Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #788

Conversation

UTSouthwesternDSSR
Copy link
Contributor

Purpose/implementation Section

The goal of this PR is to annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).

Briefly describe the general approach you took to achieve this goal.

  1. Create marker genes list by obtaining most cell types from Azimuth human reference - bone marrow(level 1: B, CD4 T, CD8 T, DC, HSPC, Mono, NK, Other T; level 2: Macrophage, Early Eryth, Late Eryth, Plasma, Platelet, Stromal), blast cell from Bhasin et al., and erythroid precursor and cancer cell from ScType database.
  2. Annotate cell type by running ScType with above marker gene list.
  3. Identify tumor cells by running CopyKat, using B cells from the same sample as normal cells.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Out of 11 samples, 7 samples have B cells annotated (except SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100). Although SCPCS000099 contains B cells, these cells are not useful for the identification of tumor cells, as shown by the extremely large number of not.defined cells with default parameters.

I may try to merge the samples, and use the annotated B cells as the normal cell to run CopyKat, hoping that we could identify tumor cells from those samples without B cells. This could be another pull request, if it works.

Results

What is the name of your results bucket on S3?

  • rds objects are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
  • metadata files and sctype results are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/
  • umap and dot plots are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots

What types of results does your code produce (e.g., table, figure)?

  • rds objects
  • two text files for each sample: _metadata.txt (cell ID, leiden clusters, cell type annotation, low confidence cell type annotation, CopyKat prediction [for the 7 samples]) and _sctype_top10_celltypes_perCluster.txt (top 10 possible cell types with their respective sctype score in each cluster)
  • umap plots showing leiden clustering, cell type, and CopyKat prediction respectively
  • dot plots showing the average expression of group of markers for each cell type using AddModuleScore()

What is your summary of the results?

With the default threshold of having sctype score > 25% of ncells in a cluster (sctype_classification), there are a large number of cells being annotated as "Unknown" in each sample, reaching to as high as ~45% (disregarding SCPCS000099: 48% and SCPCS000100: 52%, which have much lower number of features detected).
Screenshot 2024-10-01 at 5 31 35 PM
Screenshot 2024-10-01 at 5 37 40 PM

Therefore, I relaxed the threshold from 25% to 10% (lowConfidence_annot), and the percentage of "Unknown" is now dropped to 30% (disregarding SCPCS000099: 33% and SCPCS000100: 48%), with 5 samples having no "Unknown".

I only ran CopyKat for 7 samples, excluding SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100. However,
tumor prediction does not work for SCPCS000099, given that it has 36% of "not.defined", while the other samples have less than 1%. Even if I relaxed the cutoff of ngene.chr from 5 to 1, the number of "not.defined" cells decreases, but they all get annotated as "diploid".

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

  • The packages are installed and updated in renv.lock and conda.lock
  • Analysis could be executed on a Standard-4XL virtual machine via AWS Lightsail for Research, but CopyKat runs pretty slow on this machine with one core. Therefore, I ran the CopyKat on our lab server with 50 cores (The longest time needed for a sample is ~1.5 hr)

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

  • I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?
  • For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?

Thank you so much for any suggestions/feedback!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@jaclyn-taroni
Copy link
Member

Hi @UTSouthwesternDSSR, I wanted to answer your question before a full review:

I am wondering will you take a look at the results part during PR, or it will be reviewed after the deadline?

We will look at the results during the review of the PR.

@jaclyn-taroni
Copy link
Member

@UTSouthwesternDSSR, it looks like we're in an excellent position to start testing this module. I will commit some changes to this branch to get it up and running so we'll be aware of any errors during the review.

@jaclyn-taroni
Copy link
Member

Hi @UTSouthwesternDSSR, it looks like you initialized a new module starting at 4396373. My assumption here is that you accidentally committed to this branch instead of a new one. Can we revert back to cd02ec1 so we can keep the review focused on the non-ETP-ALL module? One way would be to start a new branch at that commit and open a new PR. Please let me know if you have questions or if there's any way we can help! Thank you.

@UTSouthwesternDSSR
Copy link
Contributor Author

Sorry about this! Yeah, I meant to create a new module for ETP-ALL. Could you please guide me how to revert back to "minor update on script"? I will try to open a new branch now.

@jaclyn-taroni
Copy link
Member

Happy to help! You should be able to create a new branch from that commit with the following command:

git branch {new branch name} cd02ec127f6c5bd3a3772c3f1e2f264eb936b86b

Replacing {new branch name}, including the curly brackets, with your chosen branch name. Then you'll need to push the branch:

# Checkout the branch
git checkout {new branch name}
# Push it to origin
git push -u origin HEAD

You could then create a new PR and close this one. You can just copy and paste exactly what you wrote in your initial comment here.

@UTSouthwesternDSSR
Copy link
Contributor Author

This is what I did:

git branch UTSouthwesternDSSR/nonETP cd02ec127f6c5bd3a3772c3f1e2f264eb936b86b
git checkout UTSouthwesternDSSR/nonETP
git push -u origin origin/HEAD

Screenshot 2024-10-02 at 2 34 20 PM

But it gives above error. I am not sure if I am doing it wrong?
Just to clarify, so the branch "UTSouthwesternDSSR/nonETP" will have everything for nonETP module, and I will create another branch for ETP module?

@UTSouthwesternDSSR
Copy link
Contributor Author

Actually I am a bit confused. It seems like it did work. On my github webpage, there are two branches main and
UTSouthwesternDSSR/jwl. If I switch it to main, the last commit is "minor update on script", while UTSouthwesternDSSR/jwl has the last commit of "update gitignore". I am not sure where is my third branch UTSouthwesternDSSR/nonETP?

I have another question. I am trying to download the rds object from S3 bucket with ETP module to my local server, but when I did git clone, my ETP module is not there. I am wondering how could I have my ETP module there when doing git clone?

@jaclyn-taroni
Copy link
Member

On my github webpage, there are two branches main and UTSouthwesternDSSR/jwl. If I switch it to main, the last commit is "minor update on script", while UTSouthwesternDSSR/jwl has the last commit of "update gitignore". I am not sure where is my third branch UTSouthwesternDSSR/nonETP?

Since you got this error when pushing to GitHub:

Screenshot 2024-10-02 at 2 34 20 PM

I would not expect the UTSouthwesternDSSR/nonETP branch to be on GitHub because the failure happens before the push is successful.

If you're using GitKraken, you can check out the UTSouthwesternDSSR/nonETP branch and then hit the push button. It will ask you what remote branch you want, and you can stick with the defaults.

If you want to use the command line, you'll need to generate a Personal Access Token (GitHub docs: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) and use it in your password field instead of your actual password. This is because multifactor authentication is on.

I have another question. I am trying to download the rds object from S3 bucket with ETP module to my local server, but when I did git clone, my ETP module is not there. I am wondering how could I have my ETP module there when doing git clone?

When cloning, you'll first be on the main branch, which does not currently contain the ETP module. To get your ETP module, checkout UTSouthwesternDSSR/jwl.

You can continue to develop the ETP module in this branch (UTSouthwesternDSSR/jwl) if you'd like, but what I would probably personally do is create a new branch off of main and cherry-pick the commits 4396373 and 0da0ec5 (here's a GitKraken tutorial: https://www.gitkraken.com/learn/git/cherry-pick). If you're not comfortable cherry-picking, no worries – I thought it might make it a little easier to get everything reviewed and merged down the line.

Please let me know if you have any more questions.

@UTSouthwesternDSSR
Copy link
Contributor Author

I tried with what you suggested. I think this is what you meant, but after pressing the push button. My github page still have only two branches main and UTSouthwesternDSSR/jwl. I guess for now main would be for nonETP module and UTSouthwesternDSSR/jwl would be for ETP module.
Screenshot 2024-10-02 at 4 46 51 PM

I would create a new PR on the branch main for the nonETP module and close the current PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants