celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

UTSouthwesternDSSR · 2024-10-01T23:03:49Z

Purpose/implementation Section

The goal of this PR is to annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Please link to the GitHub issue that this pull request addresses.

Issue: celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #787
Discussion: Non-early T-cell precursor T-cell acute lymphoblastic leukemia (non-ETP ALL) annotation (SCPCP000003) #630

What is the goal of this pull request?

Annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Briefly describe the general approach you took to achieve this goal.

Create marker genes list by obtaining most cell types from Azimuth human reference - bone marrow(level 1: B, CD4 T, CD8 T, DC, HSPC, Mono, NK, Other T; level 2: Macrophage, Early Eryth, Late Eryth, Plasma, Platelet, Stromal), blast cell from Bhasin et al., and erythroid precursor and cancer cell from ScType database.
Annotate cell type by running ScType with above marker gene list.
Identify tumor cells by running CopyKat, using B cells from the same sample as normal cells.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Out of 11 samples, 7 samples have B cells annotated (except SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100). Although SCPCS000099 contains B cells, these cells are not useful for the identification of tumor cells, as shown by the extremely large number of not.defined cells with default parameters.

I may try to merge the samples, and use the annotated B cells as the normal cell to run CopyKat, hoping that we could identify tumor cells from those samples without B cells. This could be another pull request, if it works.

Results

What is the name of your results bucket on S3?

rds objects are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
metadata files and sctype results are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/
umap and dot plots are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots

What types of results does your code produce (e.g., table, figure)?

rds objects
two text files for each sample: _metadata.txt (cell ID, leiden clusters, cell type annotation, low confidence cell type annotation, CopyKat prediction [for the 7 samples]) and _sctype_top10_celltypes_perCluster.txt (top 10 possible cell types with their respective sctype score in each cluster)
umap plots showing leiden clustering, cell type, and CopyKat prediction respectively
dot plots showing the average expression of group of markers for each cell type using AddModuleScore()

What is your summary of the results?

With the default threshold of having sctype score > 25% of ncells in a cluster (sctype_classification), there are a large number of cells being annotated as "Unknown" in each sample, reaching to as high as ~45% (disregarding SCPCS000099: 48% and SCPCS000100: 52%, which have much lower number of features detected).

Therefore, I relaxed the threshold from 25% to 10% (lowConfidence_annot), and the percentage of "Unknown" is now dropped to 30% (disregarding SCPCS000099: 33% and SCPCS000100: 48%), with 5 samples having no "Unknown".

I only ran CopyKat for 7 samples, excluding SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100. However,
tumor prediction does not work for SCPCS000099, given that it has 36% of "not.defined", while the other samples have less than 1%. Even if I relaxed the cutoff of ngene.chr from 5 to 1, the number of "not.defined" cells decreases, but they all get annotated as "diploid".

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

The packages are installed and updated in renv.lock and conda.lock
Analysis could be executed on a Standard-4XL virtual machine via AWS Lightsail for Research, but CopyKat runs pretty slow on this machine with one core. Therefore, I ran the CopyKat on our lab server with 50 cores (The longest time needed for a sample is ~1.5 hr)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?
For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

Thank you so much for any suggestions/feedback!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

jaclyn-taroni · 2024-10-02T12:00:34Z

Hi @UTSouthwesternDSSR, I wanted to answer your question before a full review:

I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?

We will look at the results during the review of the PR.

jaclyn-taroni · 2024-10-02T14:25:33Z

@UTSouthwesternDSSR, it looks like we're in an excellent position to start testing this module. I will commit some changes to this branch to get it up and running so we'll be aware of any errors during the review.

jaclyn-taroni · 2024-10-02T17:50:02Z

Hi @UTSouthwesternDSSR, it looks like you initialized a new module starting at 4396373. My assumption here is that you accidentally committed to this branch instead of a new one. Can we revert back to cd02ec1 so we can keep the review focused on the non-ETP-ALL module? One way would be to start a new branch at that commit and open a new PR. Please let me know if you have questions or if there's any way we can help! Thank you.

UTSouthwesternDSSR · 2024-10-02T19:02:43Z

Sorry about this! Yeah, I meant to create a new module for ETP-ALL. Could you please guide me how to revert back to "minor update on script"? I will try to open a new branch now.

jaclyn-taroni · 2024-10-02T19:11:35Z

Happy to help! You should be able to create a new branch from that commit with the following command:

git branch {new branch name} cd02ec127f6c5bd3a3772c3f1e2f264eb936b86b

Replacing {new branch name}, including the curly brackets, with your chosen branch name. Then you'll need to push the branch:

# Checkout the branch
git checkout {new branch name}
# Push it to origin
git push -u origin HEAD

You could then create a new PR and close this one. You can just copy and paste exactly what you wrote in your initial comment here.

UTSouthwesternDSSR · 2024-10-02T19:37:02Z

This is what I did:

git branch UTSouthwesternDSSR/nonETP cd02ec127f6c5bd3a3772c3f1e2f264eb936b86b
git checkout UTSouthwesternDSSR/nonETP
git push -u origin origin/HEAD

But it gives above error. I am not sure if I am doing it wrong?
Just to clarify, so the branch "UTSouthwesternDSSR/nonETP" will have everything for nonETP module, and I will create another branch for ETP module?

UTSouthwesternDSSR · 2024-10-02T19:56:36Z

Actually I am a bit confused. It seems like it did work. On my github webpage, there are two branches main and
UTSouthwesternDSSR/jwl. If I switch it to main, the last commit is "minor update on script", while UTSouthwesternDSSR/jwl has the last commit of "update gitignore". I am not sure where is my third branch UTSouthwesternDSSR/nonETP?

I have another question. I am trying to download the rds object from S3 bucket with ETP module to my local server, but when I did git clone, my ETP module is not there. I am wondering how could I have my ETP module there when doing git clone?

jaclyn-taroni · 2024-10-02T20:36:30Z

On my github webpage, there are two branches main and UTSouthwesternDSSR/jwl. If I switch it to main, the last commit is "minor update on script", while UTSouthwesternDSSR/jwl has the last commit of "update gitignore". I am not sure where is my third branch UTSouthwesternDSSR/nonETP?

Since you got this error when pushing to GitHub:

I would not expect the UTSouthwesternDSSR/nonETP branch to be on GitHub because the failure happens before the push is successful.

If you're using GitKraken, you can check out the UTSouthwesternDSSR/nonETP branch and then hit the push button. It will ask you what remote branch you want, and you can stick with the defaults.

If you want to use the command line, you'll need to generate a Personal Access Token (GitHub docs: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) and use it in your password field instead of your actual password. This is because multifactor authentication is on.

I have another question. I am trying to download the rds object from S3 bucket with ETP module to my local server, but when I did git clone, my ETP module is not there. I am wondering how could I have my ETP module there when doing git clone?

When cloning, you'll first be on the main branch, which does not currently contain the ETP module. To get your ETP module, checkout UTSouthwesternDSSR/jwl.

You can continue to develop the ETP module in this branch (UTSouthwesternDSSR/jwl) if you'd like, but what I would probably personally do is create a new branch off of main and cherry-pick the commits 4396373 and 0da0ec5 (here's a GitKraken tutorial: https://www.gitkraken.com/learn/git/cherry-pick). If you're not comfortable cherry-picking, no worries – I thought it might make it a little easier to get everything reviewed and merged down the line.

Please let me know if you have any more questions.

UTSouthwesternDSSR · 2024-10-02T22:17:01Z

I tried with what you suggested. I think this is what you meant, but after pressing the push button. My github page still have only two branches main and UTSouthwesternDSSR/jwl. I guess for now main would be for nonETP module and UTSouthwesternDSSR/jwl would be for ETP module.

I would create a new PR on the branch main for the nonETP module and close the current PR.

update for cell annotation

f2aa909

UTSouthwesternDSSR requested a review from jaclyn-taroni as a code owner October 1, 2024 23:03

jaclyn-taroni and others added 14 commits October 2, 2024 10:28

Uncomment out pull request and workflow call triggers

610a7d8

Download relevant test data in CI

c3f162e

Fill out Dockerfile using recommendations for renv + Conda

fb68ef2

Uncomment pull request and push steps for nonETP-ALL Docker builder

4a67535

Add step for installing system dependencies

38750f5

Add conda setup steps to workflow

b7b9052

Add running Rscripts to workflow

249f0ec

Newline in Dockerfile

e905fbe

Try activating conda environment

69849ac

Remove activation step and add defaults for job

5d0162a

reticulate code depends on conda env name

f0db47e

minor update on script

cd02ec1

init module skeleton

4396373

update gitignore

0da0ec5

added marker file

fd023eb

UTSouthwesternDSSR closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

UTSouthwesternDSSR commented Oct 1, 2024

jaclyn-taroni commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

Conversation

UTSouthwesternDSSR commented Oct 1, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

What types of results does your code produce (e.g., table, figure)?

What is your summary of the results?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

jaclyn-taroni commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024

jaclyn-taroni commented Oct 2, 2024

UTSouthwesternDSSR commented Oct 2, 2024