-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to identify tumor cells in those non-ETP samples without B cells (SCPCP000003) #815
Conversation
As a side note, to address the questions regarding how well the clusters are separated and their stability/purity, I wrote a script I also included a script I guess it is a bit confusing. Sorry about putting them in a single PR! I just thought that these are the exploratory works, and it may not get accepted. So I just want to let you to take a look at the results (not sure if we want to merge it into the repository.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for filing this @UTSouthwesternDSSR! I wanted to comment on the results you’ve included and discuss potential next steps before diving into the code in detail.
First, including negative results in the repository is a good thing. So, let’s get this merged after the review, even if we change directions.
Second, I share your concern about the potential overcorrection by scMerge
.
I wonder if we might want to set the samples that don’t have B cells aside for a moment and aim for something where we don’t have to merge.
I am also considering this analysis in light of the discussion in #808. I’ll summarize what is most salient:
- inferCNV seems to perform better than CopyKAT there (and in a lot of our prior experience at the Data Lab)
- The choice of reference normal cells seems to matter a lot! (See: Exploratory CNV results and seeking general feedback on Wilms tumor annotation (SCPCP000014) #808 (reply in thread) and Exploratory CNV results and seeking general feedback on Wilms tumor annotation (SCPCP000014) #808 (reply in thread))
- We may want to look at the specific alterations being called rather than diploid vs. aneuploid. Here are a few references I identified:
Stepping back a bit — if we use B cells as the normal reference, we want to ensure those calls are solid. Because ScType assigns the same label to all cells in a cluster if there are 25% or more called as that cell type, I’d feel more confident about B cell calls where the majority of cells in that cluster have B cell as the highest scoring cell type, and if that cluster was of high quality (probably using purity for that evaluation). I sketched out some plots I might make for this analysis – a ridge plot looking at ScType scores, grouped by cell type, just for the cells called B cells, and a scatter plot comparing purity to B cell ScType score:
If we have solid B cell calls and detect expected alterations with inferCNV, we could consider leveraging samples with B cells to annotate those without B cells, e.g., by deriving marker genes.
Looking forward to hearing your thoughts!
gene_sets_prepare <- function(path_to_db_file, cell_type){ | ||
cell_markers = read.csv(path_to_db_file, header = T) | ||
cell_markers = cell_markers[cell_markers$tissueType == cell_type,] | ||
cell_markers$ensembl_id_positive_marker = gsub(" ","",cell_markers$ensembl_id_positive_marker); cell_markers$ensembl_id_negative_marker = gsub(" ","",cell_markers$ensembl_id_negative_marker) | ||
|
||
# correct gene symbols from the given DB (up-genes) | ||
cell_markers$ensembl_id_positive_marker = sapply(1:nrow(cell_markers), function(i){ | ||
markers_all = gsub(" ", "", unlist(strsplit(cell_markers$ensembl_id_positive_marker[i],","))) | ||
markers_all = toupper(markers_all[markers_all != "NA" & markers_all != ""]) | ||
markers_all = sort(markers_all) | ||
|
||
if(length(markers_all) > 0){ | ||
suppressMessages({markers_all = unique(na.omit(markers_all))}) #since the markers are provided in Ensembl ID, I removed checkGeneSymbols function here | ||
paste0(markers_all, collapse=",") | ||
} else { | ||
"" | ||
} | ||
}) | ||
|
||
# correct gene symbols from the given DB (down-genes) | ||
cell_markers$ensembl_id_negative_marker = sapply(1:nrow(cell_markers), function(i){ | ||
markers_all = gsub(" ", "", unlist(strsplit(cell_markers$ensembl_id_negative_marker[i],","))) | ||
markers_all = toupper(markers_all[markers_all != "NA" & markers_all != ""]) | ||
markers_all = sort(markers_all) | ||
|
||
if(length(markers_all) > 0){ | ||
suppressMessages({markers_all = unique(na.omit(markers_all))}) #since the markers are provided in Ensembl ID, I removed checkGeneSymbols function here | ||
paste0(markers_all, collapse=",") | ||
} else { | ||
"" | ||
} | ||
}) | ||
|
||
cell_markers$ensembl_id_positive_marker = gsub("///",",",cell_markers$ensembl_id_positive_marker);cell_markers$ensembl_id_positive_marker = gsub(" ","",cell_markers$ensembl_id_positive_marker) | ||
cell_markers$ensembl_id_negative_marker = gsub("///",",",cell_markers$ensembl_id_negative_marker);cell_markers$ensembl_id_negative_marker = gsub(" ","",cell_markers$ensembl_id_negative_marker) | ||
|
||
gs = lapply(1:nrow(cell_markers), function(j) gsub(" ","",unlist(strsplit(toString(cell_markers$ensembl_id_positive_marker[j]),",")))); names(gs) = cell_markers$cellName | ||
gs2 = lapply(1:nrow(cell_markers), function(j) gsub(" ","",unlist(strsplit(toString(cell_markers$ensembl_id_negative_marker[j]),",")))); names(gs2) = cell_markers$cellName | ||
|
||
list(gs_positive = gs, gs_negative = gs2) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the same as the function in 02-03_annotation.R
. Can you move this function to a separate file, perhaps scripts/util/gene-set-functions.R
, and then source it in this script and 02-03_annotation.R
:
source(file.path({the module directory}, "scripts/util/gene-set-functions.R"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can install the package this is from and then include it in renv.lock
:
# First, install rOpenScPCA
renv::install("AlexsLemonade/OpenScPCA-analysis:packages/rOpenScPCA")
# Second, run snapshot to add the package to renv.lock
renv::snapshot()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is close to ready to go in -- I'd just like to see some rearrangements to the structure first. Specifically, I'd like for us to move the merging and CopyKat steps into a subdirectory scripts/exploratory-analyses
and to renumber scripts that are in the "main workflow." The numbered steps should be run in CI.
Please let me know if you have any questions. Thanks again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into a directory called exploratory-analyses
:
analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/merging.multipanel_plot.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into a directory called exploratory-analyses
:
analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/copykat_exploration.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into a directory called exploratory-analyses
and rename to remove the number:
analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/merging.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me to keep this as one of the numbered steps in case you use these results in refining the B cell calls:
analyses/cell-type-nonETP-ALL-03/scripts/05_cluster_evaluation.R
Giving it a more descriptive name as well 👍🏻
And then I'd change scripts/multipanel_plot.R
to scripts/04_multipanel_plot.R
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the workflow that runs this module in CI?
Rscript scripts/multipanel_plot.R |
Would become:
Rscript scripts/04_multipanel_plot.R
Rscript scripts/05_cluster_evaluation.R
I can also make this change if you prefer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just updated it. I am wondering if you get a chance to look at how I call the stability function (as shown below)
stability_df <- rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 1.0, seed = 10)
I am not sure if I do it right, because all ARI values are 0 for 20 replicates for all 11 samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've asked @sjspielman to take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @UTSouthwesternDSSR, I've had a look into your code and the rOpenScPCA::calculate_stability
function. Indeed, it seems that the result you are getting with all 0 values for ari here is correct. This result is, in part, emerging from your parameterization - with a resolution of 1 on this dataset, leiden clustering is actually giving each cell its own cluster, and as a consequence there is never any overlap between your input clusters and the bootstrapped ones.
I tried a few different things here while exploring (note that I only did 3 replicates here, just for speed). First, I tried running this function with a resolution of 0.01, which indeed gave some more sensible ari.
> rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 0.01, replicates = 3, seed = 10)
replicate ari algorithm weighting nn resolution objective_function
1 1 0.2089952 leiden jaccard 10 0.01 CPM
2 2 0.2016440 leiden jaccard 10 0.01 CPM
3 3 0.1910005 leiden jaccard 10 0.01 CPM
I also checked with a resolution of 1, but with the alternative objective function, which is "modularity" (see docs for rOpenScPCA::calculate_clusters()
or the igraph docs for leiden clustering):
stability_df <- rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 1, objective_function="modularity", replicates = 3, seed = 10)
replicate ari algorithm weighting nn resolution objective_function
1 1 0.3576057 leiden jaccard 10 1 modularity
2 2 0.3924719 leiden jaccard 10 1 modularity
3 3 0.3644018 leiden jaccard 10 1 modularity
So, the function is working, and the results you were getting were driven by the parameter space you specified.
I also dug a little bit into the method I believe you were using to generate these original clusters, and it seems that the default resolution parameter for that method is also 1 and they are indeed using the "modularity" (not "CPM") approach to calculate clusters: https://github.com/atarashansky/self-assembling-manifold/blob/73c6f425f6eb04452c4ef5aca2e42253111092c8/samalg/__init__.py#L1520.
Therefore, you want to be running the calculate_stability()
function specifying modularity and a resolution of 1 to match the original algorithm you used in the first place. This should give you more sensible ari values! Your code will look something like this:
rOpenScPCA::calculate_stability(
x = seu,
clusters = clusID.df$cluster,
pc_name = "Xpca_",
algorithm = "leiden",
resolution = 1,
objective_function = "modularity",
seed = 10
)
It's also very much worth noting that this original clustering method appears to hardcode the random seed to equal 0, so their clustering is effectively deterministic because you cannot vary the seed. You don't necessarily need to change anything, just be aware that's happening behind the scenes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the detailed explanation! If I were to set seed = 0
in rOpenScPCA::calculate_stability
, will it be better, or it does not matter, since it is not related to the leiden clustering method (as in I cannot replicate what I got from the leiden clustering)? Actually I could not remember why I put seed = 10 (sorry about that!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about the seed for this - was just pointing it out! What you are doing so far is totally fine; the specific seed you use for the rOpenScPCA
functions doesn't matter, as long as you set something, and 10 is something :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please run renv::snapshot()
and commit and push the results? rOpenScPCA
has not been added to the lock file yet, causing the CI failure. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just updated the renv.lock
and pushed it. Sorry about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I have been reviewing the results you posted and noticed that was failing 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more stringent cutoff for calling B cells with ScType is a reasonable thing to try 👍 I would maybe add some caution around treating all non-B cell scores equally — I would expect (and sometimes I see this in your results!) overlap in signature with plasma cells. I think it’s okay to move forward and keep this in the back of our minds.
Since you pulled in the results from SingleR and CellAssign, I was looking through the cell type reports we have for this project, which you can also obtain with the download script using --include-reports
:
./download-data.py --project SCPCP000003 --include-reports --profile openscpca
And it seems like the libraries often contain some population that should be normal that SingleR and CellAssign agree on, and that doesn’t always map cleanly to a cluster (considerable caveat: clusters are derived differently than what you have with SAM).
For example, in SCPCL000706
, the methods seem to agree on monocytes, plasma cells, and maybe some other B cell subtypes:
So, I wonder if looking at the overlap between multiple methods for several cell type populations we’d expect to be unaffected – where it varies from library to library – and using those cells as a reference with inferCNV could be a potentially viable alternative if the stringent ScType B cell strategy doesn’t pan out.
Purpose/implementation Section
This is an attempt to identify tumor cells in the sample without annotated B cells (including
SCPCL000706
, which has too manynot.defined
cells from CopyKat)Please link to the GitHub issue that this pull request addresses.
#811
What is the goal of this pull request?
Trying to identify the tumor cells in the 5 samples without B cells annotated (
SCPCL000710
,SCPCL000077
,SCPCL000704
,SCPCL000082
) or too manynot.defined
cells from CopyKat prediction (SCPCL000706
).Briefly describe the general approach you took to achieve this goal.
Merging these samples with
SCPCL000703
(the best annotated sample with B cells) respectively and correcting the batch effect withHarmony
, trying to deduce the tumor/normal status by the clustering pattern, if there is any.If known, do you anticipate filing additional pull requests to complete this analysis module?
Results
What is the name of your results bucket on S3?
rds
objects are found ins3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots
What types of results does your code produce (e.g., table, figure)?
rds
objectsumap
plots (multipanels_
) showing seurat cluster ID, cell type, library ID, and copykat prediction results of merged object. The cell type and copykat prediction results are obtained from individual library run. For those samples that do not have normal B cells, the copykat prediction will be NA (grey color).umap
plots (_splitPlot.png
) showing the leiden clusters and cell type from respective library in the merged object.What is your summary of the results?
SCPCL000710
, when it is merged/Harmony-corrected withSCPCL000703
, only cluster 10 (from leiden cluster annotation inSCPCL000710
) grouped with those of diploid cells (NK and CD4 T) inSCPCL000703
. My guess is that these cells from cluster 10 are diploids, but there aren't much proof for it, other than the clustering pattern.Actually, I did try to call CopyKat after merging/Harmony-corrected
SCPCL000703
,SCPCL000704
,SCPCL000710
, andSCPCL000077
, but only those that are originally called as aneuploid inSCPCL000703
remain as aneuploid, and the rest are all predicted as diploid.Then, I tried with
scMerge
, which returns a batch effect corrected gene expression matrix. I used their semi-supervised approach 1, which employs partial cell type information to create pseudo-replicates. I excluded theUnknown
in the annotation when providing the list. It shows a good clustering pattern betweenSCPCL000703
andSCPCL000710
, but now only theB cells
are called as diploid, and not even theCD4 T
andNK
cells inSCPCL000703
(those originally called asdiploid
in individual run) [I slightly tweaked the code in CopyKat to take in this corrected gene expression matrix, instead of the raw count]. It seems like scMerge is over-correcting the differences between two libraries, since the separation observed on PC1/PC2 umap plot is based on B cells vs non-B cells and that's how the diploid/aneuploid separation.Therefore, I think
Harmony
may be better in the sense that it is not over-correcting the differences between libraries.SCPCL000077
, when it is merged/Harmony-corrected withSCPCL000703
, only cluster 12 (from leiden cluster annotation inSCPCL000077
) grouped with those of diploid cells (B, Pre-Eryth, NK, and CD4 T) inSCPCL000703
. My guess is that these cells from cluster 12 are diploids, but again just based on the clustering pattern.SCPCL000704
, my guess is cluster 14 (from leiden cluster annotation inSCPCL000704
) is diploid.SCPCL000706
andSCPCL000082
, they don't really work in this way.Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Actually the guess is a bit far-fetched, because technically we could have mixture of diploids and aneuploids in a cluster, but I am not really sure how to call the tumor annotation without B cells annotated. When I ran CopyKat without providing normal, it gives
WARNING! NOT CONVERGENT
and it loops through every single cell. That's why I did not run CopyKat for those without normal B. Much appreciation for any suggestion!Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.