Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to identify tumor cells in those non-ETP samples without B cells (SCPCP000003) #815

Merged
merged 19 commits into from
Oct 18, 2024

Conversation

UTSouthwesternDSSR
Copy link
Contributor

Purpose/implementation Section

This is an attempt to identify tumor cells in the sample without annotated B cells (including SCPCL000706, which has too many not.defined cells from CopyKat)

Please link to the GitHub issue that this pull request addresses.

#811

What is the goal of this pull request?

Trying to identify the tumor cells in the 5 samples without B cells annotated (SCPCL000710, SCPCL000077, SCPCL000704, SCPCL000082) or too many not.defined cells from CopyKat prediction (SCPCL000706).

Briefly describe the general approach you took to achieve this goal.

Merging these samples with SCPCL000703 (the best annotated sample with B cells) respectively and correcting the batch effect with Harmony, trying to deduce the tumor/normal status by the clustering pattern, if there is any.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What is the name of your results bucket on S3?

  • merged rds objects are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
  • umap plots are found in s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots

What types of results does your code produce (e.g., table, figure)?

  • merged rds objects
  • umap plots (multipanels_) showing seurat cluster ID, cell type, library ID, and copykat prediction results of merged object. The cell type and copykat prediction results are obtained from individual library run. For those samples that do not have normal B cells, the copykat prediction will be NA (grey color).
  • umap plots (_splitPlot.png) showing the leiden clusters and cell type from respective library in the merged object.

What is your summary of the results?

  • For SCPCL000710, when it is merged/Harmony-corrected with SCPCL000703, only cluster 10 (from leiden cluster annotation in SCPCL000710) grouped with those of diploid cells (NK and CD4 T) in SCPCL000703. My guess is that these cells from cluster 10 are diploids, but there aren't much proof for it, other than the clustering pattern.
Screenshot 2024-10-15 at 10 19 45 AM

Actually, I did try to call CopyKat after merging/Harmony-corrected SCPCL000703,SCPCL000704,SCPCL000710, and SCPCL000077, but only those that are originally called as aneuploid in SCPCL000703 remain as aneuploid, and the rest are all predicted as diploid.
Screenshot 2024-10-15 at 10 29 19 AM

Then, I tried with scMerge, which returns a batch effect corrected gene expression matrix. I used their semi-supervised approach 1, which employs partial cell type information to create pseudo-replicates. I excluded the Unknown in the annotation when providing the list. It shows a good clustering pattern between SCPCL000703 and SCPCL000710, but now only the B cells are called as diploid, and not even the CD4 T and NK cells in SCPCL000703 (those originally called as diploid in individual run) [I slightly tweaked the code in CopyKat to take in this corrected gene expression matrix, instead of the raw count]. It seems like scMerge is over-correcting the differences between two libraries, since the separation observed on PC1/PC2 umap plot is based on B cells vs non-B cells and that's how the diploid/aneuploid separation.
Screenshot 2024-10-15 at 10 46 47 AM

Therefore, I think Harmony may be better in the sense that it is not over-correcting the differences between libraries.

  • For SCPCL000077, when it is merged/Harmony-corrected with SCPCL000703, only cluster 12 (from leiden cluster annotation in SCPCL000077) grouped with those of diploid cells (B, Pre-Eryth, NK, and CD4 T) in SCPCL000703. My guess is that these cells from cluster 12 are diploids, but again just based on the clustering pattern.
Screenshot 2024-10-15 at 10 11 09 AM
  • For SCPCL000704, my guess is cluster 14 (from leiden cluster annotation in SCPCL000704) is diploid.
Screenshot 2024-10-15 at 3 34 40 PM
  • For SCPCL000706 and SCPCL000082, they don't really work in this way.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

  • The packages are installed and updated in renv.lock and conda.lock.
  • The analysis is executed on a Standard-4XL virtual machine via AWS Lightsail for Research.

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Actually the guess is a bit far-fetched, because technically we could have mixture of diploids and aneuploids in a cluster, but I am not really sure how to call the tumor annotation without B cells annotated. When I ran CopyKat without providing normal, it gives WARNING! NOT CONVERGENT and it loops through every single cell. That's why I did not run CopyKat for those without normal B. Much appreciation for any suggestion!

Author checklists

Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@UTSouthwesternDSSR
Copy link
Contributor Author

As a side note, to address the questions regarding how well the clusters are separated and their stability/purity, I wrote a script run_evalClus.R to run the three functions that are provided in evaluate-clusters.R. (I am not sure how to call it, so I copy the whole functions into my own script). It seems that the average purity per cluster is not bad (>50%) for all samples, except SCPCL000706 and SCPCL000082. The average silhouette score per cluster is equally bad for all samples, and those clusters that have average silhouette score ~0.2 or higher are usually located outside of the main "island" and exist in a tight cluster, although cluster 7 of SCPCL000706 (score: 0.35) does not follow the trend. Surprisingly, the cluster 2 and 3 of SCPCL000082 have relatively good average silhouette score (~0.2).

I also included a script copykat_exploration.R to show (1) the stacked bar plots of cell type count and (2) the blast module score (in violin plots) among aneuploid, diploid, and not.defined cells from the CopyKat prediction results. Although the expectation is that aneuploid will have the highest blast module score, but it is highly dependent on how many blasts cells are annotated as diploid or aneuploid.

I guess it is a bit confusing. Sorry about putting them in a single PR! I just thought that these are the exploratory works, and it may not get accepted. So I just want to let you to take a look at the results (not sure if we want to merge it into the repository.)

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for filing this @UTSouthwesternDSSR! I wanted to comment on the results you’ve included and discuss potential next steps before diving into the code in detail.

First, including negative results in the repository is a good thing. So, let’s get this merged after the review, even if we change directions.

Second, I share your concern about the potential overcorrection by scMerge.

I wonder if we might want to set the samples that don’t have B cells aside for a moment and aim for something where we don’t have to merge.

I am also considering this analysis in light of the discussion in #808. I’ll summarize what is most salient:

Stepping back a bit — if we use B cells as the normal reference, we want to ensure those calls are solid. Because ScType assigns the same label to all cells in a cluster if there are 25% or more called as that cell type, I’d feel more confident about B cell calls where the majority of cells in that cluster have B cell as the highest scoring cell type, and if that cluster was of high quality (probably using purity for that evaluation). I sketched out some plots I might make for this analysis – a ridge plot looking at ScType scores, grouped by cell type, just for the cells called B cells, and a scatter plot comparing purity to B cell ScType score:

If we have solid B cell calls and detect expected alterations with inferCNV, we could consider leveraging samples with B cells to annotate those without B cells, e.g., by deriving marker genes.

Looking forward to hearing your thoughts!

Comment on lines 11 to 51
gene_sets_prepare <- function(path_to_db_file, cell_type){
cell_markers = read.csv(path_to_db_file, header = T)
cell_markers = cell_markers[cell_markers$tissueType == cell_type,]
cell_markers$ensembl_id_positive_marker = gsub(" ","",cell_markers$ensembl_id_positive_marker); cell_markers$ensembl_id_negative_marker = gsub(" ","",cell_markers$ensembl_id_negative_marker)

# correct gene symbols from the given DB (up-genes)
cell_markers$ensembl_id_positive_marker = sapply(1:nrow(cell_markers), function(i){
markers_all = gsub(" ", "", unlist(strsplit(cell_markers$ensembl_id_positive_marker[i],",")))
markers_all = toupper(markers_all[markers_all != "NA" & markers_all != ""])
markers_all = sort(markers_all)

if(length(markers_all) > 0){
suppressMessages({markers_all = unique(na.omit(markers_all))}) #since the markers are provided in Ensembl ID, I removed checkGeneSymbols function here
paste0(markers_all, collapse=",")
} else {
""
}
})

# correct gene symbols from the given DB (down-genes)
cell_markers$ensembl_id_negative_marker = sapply(1:nrow(cell_markers), function(i){
markers_all = gsub(" ", "", unlist(strsplit(cell_markers$ensembl_id_negative_marker[i],",")))
markers_all = toupper(markers_all[markers_all != "NA" & markers_all != ""])
markers_all = sort(markers_all)

if(length(markers_all) > 0){
suppressMessages({markers_all = unique(na.omit(markers_all))}) #since the markers are provided in Ensembl ID, I removed checkGeneSymbols function here
paste0(markers_all, collapse=",")
} else {
""
}
})

cell_markers$ensembl_id_positive_marker = gsub("///",",",cell_markers$ensembl_id_positive_marker);cell_markers$ensembl_id_positive_marker = gsub(" ","",cell_markers$ensembl_id_positive_marker)
cell_markers$ensembl_id_negative_marker = gsub("///",",",cell_markers$ensembl_id_negative_marker);cell_markers$ensembl_id_negative_marker = gsub(" ","",cell_markers$ensembl_id_negative_marker)

gs = lapply(1:nrow(cell_markers), function(j) gsub(" ","",unlist(strsplit(toString(cell_markers$ensembl_id_positive_marker[j]),",")))); names(gs) = cell_markers$cellName
gs2 = lapply(1:nrow(cell_markers), function(j) gsub(" ","",unlist(strsplit(toString(cell_markers$ensembl_id_negative_marker[j]),",")))); names(gs2) = cell_markers$cellName

list(gs_positive = gs, gs_negative = gs2)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the same as the function in 02-03_annotation.R. Can you move this function to a separate file, perhaps scripts/util/gene-set-functions.R, and then source it in this script and 02-03_annotation.R:

source(file.path({the module directory}, "scripts/util/gene-set-functions.R"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can install the package this is from and then include it in renv.lock:

# First, install rOpenScPCA
renv::install("AlexsLemonade/OpenScPCA-analysis:packages/rOpenScPCA")

# Second, run snapshot to add the package to renv.lock
renv::snapshot()

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close to ready to go in -- I'd just like to see some rearrangements to the structure first. Specifically, I'd like for us to move the merging and CopyKat steps into a subdirectory scripts/exploratory-analyses and to renumber scripts that are in the "main workflow." The numbered steps should be run in CI.

Please let me know if you have any questions. Thanks again!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into a directory called exploratory-analyses:

analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/merging.multipanel_plot.R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into a directory called exploratory-analyses:

analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/copykat_exploration.R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into a directory called exploratory-analyses and rename to remove the number:

analyses/cell-type-nonETP-ALL-03/scripts/exploratory-analyses/merging.R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to keep this as one of the numbered steps in case you use these results in refining the B cell calls:

analyses/cell-type-nonETP-ALL-03/scripts/05_cluster_evaluation.R

Giving it a more descriptive name as well 👍🏻

And then I'd change scripts/multipanel_plot.R to scripts/04_multipanel_plot.R, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the workflow that runs this module in CI?

Would become:

Rscript scripts/04_multipanel_plot.R
Rscript scripts/05_cluster_evaluation.R

I can also make this change if you prefer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just updated it. I am wondering if you get a chance to look at how I call the stability function (as shown below)
stability_df <- rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 1.0, seed = 10)
I am not sure if I do it right, because all ARI values are 0 for 20 replicates for all 11 samples.
Screenshot 2024-10-17 at 3 20 09 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've asked @sjspielman to take a look!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @UTSouthwesternDSSR, I've had a look into your code and the rOpenScPCA::calculate_stability function. Indeed, it seems that the result you are getting with all 0 values for ari here is correct. This result is, in part, emerging from your parameterization - with a resolution of 1 on this dataset, leiden clustering is actually giving each cell its own cluster, and as a consequence there is never any overlap between your input clusters and the bootstrapped ones.

I tried a few different things here while exploring (note that I only did 3 replicates here, just for speed). First, I tried running this function with a resolution of 0.01, which indeed gave some more sensible ari.

>   rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 0.01, replicates = 3, seed = 10)
  replicate       ari algorithm weighting nn resolution objective_function
1         1 0.2089952    leiden   jaccard 10       0.01                CPM
2         2 0.2016440    leiden   jaccard 10       0.01                CPM
3         3 0.1910005    leiden   jaccard 10       0.01                CPM

I also checked with a resolution of 1, but with the alternative objective function, which is "modularity" (see docs for rOpenScPCA::calculate_clusters() or the igraph docs for leiden clustering):

stability_df <- rOpenScPCA::calculate_stability(x = seu, clusters = clusID.df$cluster, pc_name = "Xpca_", algorithm = "leiden", resolution = 1, objective_function="modularity", replicates = 3, seed = 10)
  replicate       ari algorithm weighting nn resolution objective_function
1         1 0.3576057    leiden   jaccard 10          1         modularity
2         2 0.3924719    leiden   jaccard 10          1         modularity
3         3 0.3644018    leiden   jaccard 10          1         modularity

So, the function is working, and the results you were getting were driven by the parameter space you specified.

I also dug a little bit into the method I believe you were using to generate these original clusters, and it seems that the default resolution parameter for that method is also 1 and they are indeed using the "modularity" (not "CPM") approach to calculate clusters: https://github.com/atarashansky/self-assembling-manifold/blob/73c6f425f6eb04452c4ef5aca2e42253111092c8/samalg/__init__.py#L1520.

Therefore, you want to be running the calculate_stability() function specifying modularity and a resolution of 1 to match the original algorithm you used in the first place. This should give you more sensible ari values! Your code will look something like this:

 rOpenScPCA::calculate_stability(
  x = seu, 
  clusters = clusID.df$cluster, 
  pc_name = "Xpca_", 
  algorithm = "leiden",
  resolution = 1, 
  objective_function = "modularity", 
  seed = 10
)

It's also very much worth noting that this original clustering method appears to hardcode the random seed to equal 0, so their clustering is effectively deterministic because you cannot vary the seed. You don't necessarily need to change anything, just be aware that's happening behind the scenes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the detailed explanation! If I were to set seed = 0 in rOpenScPCA::calculate_stability, will it be better, or it does not matter, since it is not related to the leiden clustering method (as in I cannot replicate what I got from the leiden clustering)? Actually I could not remember why I put seed = 10 (sorry about that!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about the seed for this - was just pointing it out! What you are doing so far is totally fine; the specific seed you use for the rOpenScPCA functions doesn't matter, as long as you set something, and 10 is something :)

@UTSouthwesternDSSR
Copy link
Contributor Author

UTSouthwesternDSSR commented Oct 17, 2024

Based on your suggestion, I plotted ScType score of different cell types for the B cells annotated by (1) ScType, (2) SingleR, and (3) CellAssign for the left panel, and the scatter plot of B cell ScType score vs purity for the same sets of cells, colored by Leiden cluster ID for the right panel.

It seems like SingleR is pretty comparable with ScType, in the sense that the called B cells are mostly originated from the same cluster as ScType calls, although its number is the least among the 3 methods. For SCPCL000706, it is quite clear that there is no strong signal for B cells.
Screenshot 2024-10-17 at 4 04 50 PM

As for SCPCL000077, the 29 B cells called by SingleR seem to be pretty clean, and indeed cluster 12 (237 cells) shows up-regulation in B marker genes, although there is a strong signal for Other T in the same cluster [The sum of ScType score for Other T is higher than those of B, that's why it gets assigned as Other T]. I try to purify the B cells, by setting the cutoff of 99 percentile of non-B ScType score for those clusters assigned as B (in this case, it will be cluster 12), and then call the new B cells as those cells with B ScType score larger than the cutoff.

Bcell.names <- colnames(seu)[which(seu$sctype_classification=="B")]
df <- sctype.score[match(Bcell.names, rownames(sctype.score)),]
nonB.df <- df[,which(!colnames(df) %in% "B")]
cutoff <- quantile(unlist(nonB.df, use.names = F), probs = 0.99) #99 percentile of non-B ScType score for annotated B
seu$B_SctypeScore <- sctype.score$B[match(rownames(sctype.score),colnames(seu))]
new.B <- colnames(seu)[which(seu$B_SctypeScore > cutoff)]

I plot the ScType score of those cells in ridgeplot and feature plot (bottom right panel).
Screenshot 2024-10-17 at 8 18 26 PM

Similarly, I use the same idea to attempt finding B cells in SCPCL000082 by selecting clusters that showed some signal in the dotplot for B_Feature1 (as shown below), but I think the value of B ScType score is too low, as compared to other samples.
Screenshot 2024-10-17 at 8 31 00 PM

Please let me know what do you think about using 99 percentile of non-B ScType score as the cutoff for purifying B cells. Thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please run renv::snapshot() and commit and push the results? rOpenScPCA has not been added to the lock file yet, causing the CI failure. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just updated the renv.lock and pushed it. Sorry about it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have been reviewing the results you posted and noticed that was failing 👍🏻

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more stringent cutoff for calling B cells with ScType is a reasonable thing to try 👍 I would maybe add some caution around treating all non-B cell scores equally — I would expect (and sometimes I see this in your results!) overlap in signature with plasma cells. I think it’s okay to move forward and keep this in the back of our minds.

Since you pulled in the results from SingleR and CellAssign, I was looking through the cell type reports we have for this project, which you can also obtain with the download script using --include-reports:

./download-data.py --project SCPCP000003 --include-reports --profile openscpca

And it seems like the libraries often contain some population that should be normal that SingleR and CellAssign agree on, and that doesn’t always map cleanly to a cluster (considerable caveat: clusters are derived differently than what you have with SAM).

For example, in SCPCL000706 , the methods seem to agree on monocytes, plasma cells, and maybe some other B cell subtypes:

SCPCL000706-jaccard-index

So, I wonder if looking at the overlap between multiple methods for several cell type populations we’d expect to be unaffected – where it varies from library to library – and using those cells as a reference with inferCNV could be a potentially viable alternative if the stringent ScType B cell strategy doesn’t pan out.

@jaclyn-taroni jaclyn-taroni merged commit 1666669 into AlexsLemonade:main Oct 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants