Single cell RNA sequencing of pediatric high-grade gliomas, SCPCP000001 #722

georginaalbadri · 2024-08-16T13:33:07Z

georginaalbadri
Aug 16, 2024

Proposed analysis

I propose data preprocessing, by filtering and doublet detection, followed by clustering by dimensionality reduction and k-means clustering. Clusters will be annotated by combining analysis of marker genes, differential expression, and cell label transfer using linear regression.

Scientific goals

The data will be cleaned of low-quality cells, and two to three levels of cell labels provided. This will include a top layer of malignant vs non-malignant cells, and a second layer classifying the non-malignant cells into e.g. Neurons, Astrocytes, Oligodendrocytes etc. The tumour cells can be classified further into OPC-like, AC-like, NPC-like, and mesenchymal-like.

Methods or approach

The analysis will be done in Python, primarily using scanpy. Preprocessing and labelling will follow best practices https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html. Filtering will be done by median absolute deviation filtering, and doublet removal using doubletdetection.

Dimensionality reduction will be done using UMAP and leiden clustering performed. In-built scanpy functions will be used to assess marker gene and differential gene expression for annotation. CellTypist will be used to complement annotation by marker genes, which is a cell label transfer method utilising linear regression.

Existing modules

There is potential to collaborate with other projects containing glioblastoma samples

Input data

Cell label transfer will utilise the GBMap dataset https://www.biorxiv.org/content/10.1101/2022.08.27.505439v1

Scientific literature

Reference dataset GBMap https://www.biorxiv.org/content/10.1101/2022.08.27.505439v1

GBM subtypes with markers https://doi.org/10.1016/j.cell.2019.06.024

Other details

Resources: Local machine and university HPC.

Timeline: I anticipate a first draft of cell labels will be available at the end of September.

Jen-OMalley · 2024-08-16T13:58:17Z

Jen-OMalley
Aug 16, 2024
Maintainer

Hi @georginaalbadri! I'm Jen, the Scientific Community Manager at the Data Lab. Thank you for sharing your proposed analysis!

Have you filled out the contributor form yet? On this form, you will provide the name and email address that will be associated with the AWS account that we'll create for you. We also need this form returned to ensure you have agreed to the OpenScPCA terms and conditions and other policies. Once we receive this, our team will review your proposed analyses and get back to you with next steps within 3 business days!

In the meantime, please let us know if you have any questions about OpenScPCA. We look forward to discussing more with you soon!

0 replies

Jen-OMalley · 2024-08-16T14:25:02Z

Jen-OMalley
Aug 16, 2024
Maintainer

Thank you for submitting the form @georginaalbadri!

I am realizing that some of our team members will be traveling next week, so it may take longer than 3 business days to review your proposed analysis. My apologies! You can expect to hear more feedback from our team during the week of August 26. But I will get back to you shortly with information about your AWS account set up.

2 replies

georginaalbadri Aug 16, 2024
Author

No problem, thank you!

Jen-OMalley Aug 16, 2024
Maintainer

@georginaalbadri your AWS account has been created, and you should receive an email to complete setup. Here are instructions for setting up AWS!

sjspielman · 2024-08-27T14:32:21Z

sjspielman
Aug 27, 2024
Maintainer

HI @georginaalbadri! I'm Stephanie, one of the Data Scientists in the Data Lab. We're looking forward to having you on board as an OpenScPCA contributor!

Before you get started, I wanted to provide some feedback about your proposed analysis and offer additional guidance on how you can get started contributing.

Proposal questions

First, I have some feedback and questions about your proposed analysis specifically:

Data pre-processing

You wrote,

The analysis will be done in Python, primarily using scanpy. Preprocessing and labelling will follow best practices https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html. Filtering will be done by median absolute deviation filtering, and doublet removal using doubletdetection.

First, a quick implementation note: When you create your analysis module, you will likely want use one of the "Python" flags make it a Python module. I recommend (and see below) that this is should be your first pull request: establishing your Python-based analysis module.

Second, it's great to see that you're planning to follow a set of "best practices" for this kind of analysis! However, for OpenScPCA contributions in general, we actually recommend that you start your analyses with the processed ScPCA processed matrices (aka, the _processed_rna.h5ad files), because it ensures that all analyses in OpenScPCA have the same starting point. These objects have already undergone removal of empty droplets, filtering of low quality cells, normalization, and dimension reduction with PCA & UMAP. Please refer to the documentation on the ScPCA Portal for more information about how these objects were processed.

In addition, if you would like to filter out doublets, we have actually processed all datasets already with scDblFinder. Results are available for use as described here. Again, in an effort to keep data processing consistent across OpenScPCA analysis modules, we also recommend that you use these results to filter out doublets rather than running (for example), DoubletDetection from scratch.

All that said, if there is a compelling reason why your pre-processing pipeline might be preferable, please let us know and we can discuss this particular circumstance.

Analysis methods

You wrote,

In-built scanpy functions will be used to assess marker gene and differential gene expression for annotation. CellTypist will be used to complement annotation by marker genes, which is a cell label transfer method utilising linear regression.

Thanks for providing some initial details about your proposed approach! To make sure we're on the same page for all of this, let's clarify a couple details:

For the CellTypist approach with label transfer, I looked into the GBMap dataset and found that the fully annotated data is available from CZI. Is this the same source of data you are planning to use here? If so, a couple things:
1. Do you plan to use the "Core GBmap" or the "Extended GBmap," or both? The answer to this may also depend on available computational resources, and you might start with one and extend to the other later if you decide to.
2. I see when downloading these datasets, the download button provides a stable link to the specific version of the dataset downloaded. For reproducibility purposes, we're going to need to save this link so the exact version can be re-downloaded in the future. Note that, ideally, you might also perform this download step using a command-line tool like wget or curl to fully automate it, but as long as you document the specific link that will definitely be good enough!!
3. A general comment - please note that we don't commit large result files to GitHub, but instead we generally share them via a researcher-specific results AWS bucket. We can chat more about how to organize your files when we get to planning next steps.
You mention that you will use built-in scanpy functions for cell type annotation with insights from DE and marker genes. Can you please clarify further which functions you plan to use? Will this be exploratory analysis or a complementary approach for performing cell type annotation? To my knowledge (which might be lacking!) scanpy itself doesn't support cell type annotation, but I know there are other packages within the scverse overall suite of packages more generally that do! Any additional details here about the specific approach you plan to take (including the source of the marker genes you plan to use) would be great! We'll need to nail down the specific packages/software you'll use and what additional data/metadata, if any, is needed before you dive into analysis. This information will also help me recommend the best next steps for you to take.

Recommended next steps

First, let's take a bit of time to discuss your analysis in this Discussion post so we're on the same page for the exact analyses you'll be performing. Then, once I have a clearer sense of the specific steps you're going to take, I can recommend an "order" of issues & pull requests for you to file that will help get you across the finish line more efficiently. This is essentially "scoping your work" to ensure slow-and-steady modular progress towards the final cell type annotations. Remember that the more focused a given pull request is, the faster it will move through review.

After we discuss, you'll be ready to start your analysis! Please follow the below steps to start contributing to the project:

Follow the technical setup instructions found in the OpenScPCA documentation.

File an issue to track the initiation of your analysis module.

This is not meant to be a issue representing your entire cell type annotation analysis. It is meant to just track the single step of creating the analysis module. In general, this is what we are aiming for - focused issues to lead to focused pull requests to build up the analysis module code iteratively.

Initiate your module and file a pull request to establish the module without any analysis code yet

For example see this PR which was made solely to initialize a new module).

After this PR has been reviewed and accepted, you will be ready to continue with the rest of the analysis! You'll file issues as you go, with one (or more, if needed) pull requests to complete each issue.

Thanks again for your interest in OpenScPCA, and I'm looking forward to working with you! One more quick note - you might be interested in joining our Childhood Cancer Data Science Slack, which you can use to communicate with other OpenScPCA contributors, the broader pediatric cancer research community, as well as to directly ask us in the Data Lab questions about your analysis module!

2 replies

georginaalbadri Sep 10, 2024
Author

Hi Stephanie, thanks for your comprehensive message and sorry for the delay in getting back to you. Let's clarify some of the points

Data preprocessing
It's great to hear the preprocessing has been done on the data already, this will save a lot of time. There's no reason to follow slightly different guidelines so I'm happy to use the processed data files.

Analysis methods

I will start with the Core GBMap, I have found in the past that this produced confident results when using it for cell transfer; I can use the additional data perhaps if the confidence levels in the model are low. I'll make sure to use wget and keep the exact link.
I'll use scanpy complementary to the CellTypist cell transfer approach, by clustering on the UMAP using leiden clustering. Particularly where two cell types meet on the UMAP plot, it can be really useful to do 'overclustering' with 100+ clusters and each one can be analysed to get an accurate border between cell types. This can use scanpy's rank_genes_groups for differential expression analysis with knowledge of markers to identify which cell type the cluster is likely to be. As you say, it's not direct cell typing, but you can manually interrogate clusters to check the label transfer results and make sure the boundaries between cell types are accurate on the UMAP plot. Given a list of known markers, scanpy's score_genes is also very useful to see which clusters have the marker list upregulated.

Let me know if that makes sense and sounds good!

sjspielman Sep 10, 2024
Maintainer

Hi @georginaalbadri, no worries about any delay, it's all good!

I will start with the Core GBMap, I have found in the past that this produced confident results when using it for cell transfer; I can use the additional data perhaps if the confidence levels in the model are low. I'll make sure to use wget and keep the exact link.

Sounds great, thanks for clarifying this!

I'll use scanpy complementary to the CellTypist cell transfer approach, by clustering on the UMAP using leiden clustering.

If I understand correctly then, it sounds like you will use CellTypist as the primary method of annotation, and you will use approaches within scanpy to perform some "validation" on the labels. This makes sense to me! The only thing I would caution is that we don't rely too much on the specific UMAP "layout" itself for any of this validation, because UMAP dimensions and distances between points are (disappointingly, perhaps..) not actually very meaningful.

In terms of next steps, here's how I would recommend proceeding, where each step below corresponds to 1 issue, each with 1 or more PRs depending on the size:

First, you can write an issue and file a PR to establish your analysis module. I included some links to our documentation on this in my previous comment. We recommend doing this (seemingly small!) step first before writing any code to make sure you are have some experience with PRs, code review, and syncing your repository after the PR is merged before getting too far into the analysis.
- Specifically, you will want to establish a Python-based module, and just be aware that for next steps, you will be using conda and conda-lock to manage software dependencies.
You can then proceed to write code that can performs cell typing and validation on just on one sample. Then, we'll ultimately have this code run across all samples of interest. We should take the following steps for this:
- First, you'll want a script (or Jupyter notebook, whichever you prefer!) that runs CellTypist on one sample and exports a table of annotations. These can be saved in your module's results directory (note that contents of results will not be included in version control but should be uploaded to your researcher bucket for review).
  - As part of this, you'll likely also want a separate script that can download Core GBMap and save it to your module's scratch directory. This step could probably go in the overall script that runs the module. This can either be a PR on its own, or I imagine the code will be small enough that it should be able to go in the same PR as the CellTypist code.
- Second, you'll want most likely a Jupyter notebook (vs a python script) to explore the data along with the CellTypist results for validation. We can probably chat more about this once we have the initial CellTypist code up and running.

I know there are a lot of moving parts here, so please let me know what I can clarify or provide more information about, or any other ways that I can support you as you begin working on this analysis!

I do also recommend having a look at least through some of these specific areas of the OpenScPCA documentation before beginning, since this will help you learn about OpenScPCA procedures and expectations before getting too deep into analysis:

This section on analysis modules
This section on code review
This section on working with conda, as well as reporting dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single cell RNA sequencing of pediatric high-grade gliomas, SCPCP000001 #722

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Single cell RNA sequencing of pediatric high-grade gliomas, SCPCP000001 #722

georginaalbadri Aug 16, 2024

Proposed analysis

Scientific goals

Methods or approach

Existing modules

Input data

Scientific literature

Other details

Replies: 3 comments · 4 replies

Jen-OMalley Aug 16, 2024 Maintainer

Jen-OMalley Aug 16, 2024 Maintainer

georginaalbadri Aug 16, 2024 Author

Jen-OMalley Aug 16, 2024 Maintainer

sjspielman Aug 27, 2024 Maintainer

Proposal questions

Data pre-processing

Analysis methods

Recommended next steps

georginaalbadri Sep 10, 2024 Author

sjspielman Sep 10, 2024 Maintainer

georginaalbadri
Aug 16, 2024

Replies: 3 comments 4 replies

Jen-OMalley
Aug 16, 2024
Maintainer

Jen-OMalley
Aug 16, 2024
Maintainer

georginaalbadri Aug 16, 2024
Author

Jen-OMalley Aug 16, 2024
Maintainer

sjspielman
Aug 27, 2024
Maintainer

georginaalbadri Sep 10, 2024
Author

sjspielman Sep 10, 2024
Maintainer