Update clustering report to include leiden clustering #895

allyhawkins · 2024-11-19T16:29:58Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #878

What is the goal of this pull request?

Here I am updating the clustering report in the Ewing's module to use the functions in rOpenScPCA. While doing that I also made some changes to the clustering parameters tested to look at both louvain and leiden clustering.

Briefly describe the general approach you took to achieve this goal.

Clustering results are now generated with both Louvain and Leiden clustering using the rOpenScPCA::sweep_clusters() function. I varied the nearest neighbors (5-40 with increments of 5 as we were doing previously), the resolution (.5, 1, 1.5) and the objective function (CPM and modularity).
I calculated purity, width, and stability for all clustering options. I had functions written that help get these stats across a variety of parameters so I kept these functions and used the ones from rOpenScPCA when it made sense, but mostly I still have the functions I wrote that get the results across a variety of parameters tested. I think I'm going to file issues to add this functionality to rOpenScPCA in the future.
I then updated the plots to display all of the results. Previously we had just been varying the number of nearest neighbors so I had to make a few changes to accommodate looking at multiple parameters. For all plots I grouped by algorithm + objective function meaning we have a separate plot for louvain, leiden-CPM, and leiden-modularity. Then for each of those I faceted by both nn and resolution.
I removed any functions that were no longer being used from the clustering-functions.R script in this module.

There are two other sections of this report that I ended up moving to a completely different report that I will update in a subsequent PR. These sections compare cluster assignments to cell type assignments and look at marker gene expression across clusters. I started by trying to create those plots across all clustering parameters tested but that resulted in a very long report and also a lot of plots that weren't very meaningful. From the metrics output it's pretty obvious which clustering is bad (looking at you leiden-CPM) and which parameters may be better. Because of that I am reframing this slightly so that the first report generates cluster results and metrics across all possible parameters. Then this report can be evaluated to identify an algorithm, nn, and resolution (or smaller range of nn and resolution) that could be used as input to the second report. This second report would help validate that these clusters are indeed separating by cell type and gene set as expected. This will also make it easy to move the first report with just the metrics here to the hello-clusters module since it is not dataset specific while the second report is a little more specific to the Ewing module.
I wanted to keep the code for this second report so I really just copied over a lot of the existing code and noted a TODO about actually finishing the report next.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Next up I'm going to do two things:

Fix the second report that looks at cell types and marker gene sets to take a specified set of clustering params as input.
Move calculating clusters to a script that outputs a table with all possible cluster assignments so that it is not inside this notebook. I think it makes sense to move this so that other modules can use the script to sweep across clustering options.

Results

What types of results does your code produce (e.g., table, figure)?

This is just a template report so no real results yet, but I'm including a rendered example report for easier review.

01-clustering-metrics.html.zip

For reference here's the previous version of the report:
01-clustering.html.zip

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I was able to generate the report locally.

Are there particularly areas you'd like reviewers to have a close look at?

I think the main focus of the review here should be on two things:

Are we okay with the range of parameters used?
Are the plots easy to interpret or do you have any ideas on how the plots in 01-clustering-metrics.Rmd can be improved? Note that the only thing that changed from the previous plots is the faceting, but the content and type of plots were not changed.

Is there anything that you want to discuss further?

Just another note that you will not need to review/ go line by line for 02-clustering-celltypes.Rmd. This report contains the second half of the original clustering report just copied over and I have not made any other adjustments to the code. I plan on fully updating that report in a later PR.

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

…ings-update-clustering

allyhawkins · 2024-11-19T19:48:15Z

Just noting that a74fe36 is because I was seeing a failure in the CNV annotation workflow and CIT wasn't passing. After looking into it, InferCNV was now detecting annotations on chrKT. I made a change to just include chr1-22 rather for summarizing results but it's unrelated to other changes here.

sjspielman · 2024-11-19T21:05:03Z

I've only looked at your opening comment so far, but I wanted to note before I dive into the code - the resolution parameter for CPM should probably be way lower, like a few orders of magnitude. I'd try it in the [1e-4,1e-2] range to see if that suits you better. Along those lines, it might end up making sense to group plots by algorithm/resolution, rather than algorithm/objective function, as the objective function is only used by leiden. One annoying outcome of how the leiden parameters influence results is that it's hard to use sweep_clusters on both resolution and objective function at once, since resolution for CPM and modularity should be provided on such different scales

sjspielman

Code looks good to me overall! The main recommendation I have is, as I commented earlier, to play around with lower resolution values for CPM.

sjspielman · 2024-11-19T21:12:56Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+- Average cluster purity: This metric also evaluates cluster separation and tells us the proportion of neighboring cells that are assigned to the same cluster. 
+Purity values range from 0-1 with higher purity values indicating clusters that are well separated. 
+- Cluster stability: This evaluates how stable the clustering is to input data. 
+Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters. 


Suggested change

Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters.

Stability values range from 0-1 with higher values of cluster stability indicating more reproducible clusters.

sjspielman · 2024-11-19T21:13:36Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+
+- Average silhouette width: This metric evaluates cluster separation. 
+Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. 
+Higher values indicate tighter clusters.


might note it goes [-1,1]

sjspielman · 2024-11-19T21:22:18Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+                                     resolution = ~ glue::glue("{.}-res"))) +
+      theme(
+        aspect.ratio = 1,
+        legend.position = "none"


There's a lot of UMAPs here and it's hard to see where one stops and the other starts in the grid. I think adding a border will help!

Suggested change

legend.position = "none"

legend.position = "none",

panel.border = element_rect(color = "black", fill = NA))

…ings-update-clustering

allyhawkins · 2024-11-20T18:31:47Z

I've only looked at your opening comment so far, but I wanted to note before I dive into the code - the resolution parameter for CPM should probably be way lower, like a few orders of magnitude. I'd try it in the [1e-4,1e-2] range to see if that suits you better. Along those lines, it might end up making sense to group plots by algorithm/resolution, rather than algorithm/objective function, as the objective function is only used by leiden. One annoying outcome of how the leiden parameters influence results is that it's hard to use sweep_clusters on both resolution and objective function at once, since resolution for CPM and modularity should be provided on such different scales

I went ahead and updated the resolution range used for leiden-CPM to test .001, .005, and .01. I tried .0001 and everything was just one cluster so that didn't seem helpful and above .01 things start to get assigned to all individual clusters for the most part. But I kept the plots grouped in the same way so there's one plot for louvain, one plot for leiden-cpm, and one plot for leiden-modularity for each metric. To me that makes the most sense rather than looking at two different objective functions with very different resolution ranges in the same plot.

Note that I also updated the UMAPs to have borders and fixed an issue in the AUCell/SingleR workflow in d7ed51b. When generating reports for that workflow, there were samples where marker gene expression was now 0 (due to the new rounding), so plots couldn't be made. I just added a message that the plots are missing and accounted for libraries with no marker gene expression.

Here's an updated report:
01-clustering-metrics.html.zip

sjspielman

This looks good to me! I left a few small comments, the main one being that it might make sense to use actual Rmd params for the algorithm parameters.

I realize that I don't think these clustering notebooks are being tested in CI right now. Remind me where we see these notebooks being slotted into the workflow eventually? It may be fine to not have in CI for this PR, but it should happen sooner rather than later, which is a thing I say from experience 🫠

sjspielman · 2024-11-20T19:31:52Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+
+- Objective function: CPM and modularity 
+- Nearest neighbors: 5, 10, 15, 20, 25, 30, 35, and 40
+- Resolution: 0.5, 1, 1.5


Needs the new values used for CPM

sjspielman · 2024-11-20T19:33:44Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+
+## Calculate clusters
+
+


I might define a chunk here for parameters, e.g. nn <- seq(5, 40, 5). Or we could actually make these actual Rmd params with the values you use here as defaults, which would allow some flexibility if it's needed when running this as a template across samples.

sjspielman · 2024-11-20T19:37:28Z

analyses/cell-type-ewings/template_notebooks/clustering-workflow/01-clustering-metrics.Rmd

+      facet_grid(rows = vars(nn),
+                 cols = vars(resolution),


I think I might actually flip these so the overall look is more horizontal? You may also need to change the r chunk fig width/height.

Since there are more nn values used than resolution I think having a wider plot would be hard and make the UMAPs really small. So I'm not super inclined to change this.

…ings-update-clustering

allyhawkins · 2024-11-20T21:18:55Z

@sjspielman I updated this so that the parameter values being tested are provided as params to the notebook.

I realize that I don't think these clustering notebooks are being tested in CI right now. Remind me where we see these notebooks being slotted into the workflow eventually? It may be fine to not have in CI for this PR, but it should happen sooner rather than later, which is a thing I say from experience 🫠

The plan is for this to be a workflow that renders this report on all samples in the project (see #686). My next step was to create that workflow and then add running that workflow to CI.

I didn't make the plot change only because it made some really small umaps that were hard to read. This should be ready for another look!

sjspielman

LGTM!

allyhawkins added 4 commits November 18, 2024 17:05

delete functions now in rOpenScPCA

5d714ce

use default stability reps

2f8b61b

add leiden clustering to report

25a0191

move cell type and gene expression plots

9dc11a9

allyhawkins requested a review from jaclyn-taroni as a code owner November 19, 2024 16:29

allyhawkins requested review from sjspielman and removed request for jaclyn-taroni November 19, 2024 16:30

This was referenced Nov 19, 2024

Update report to compare clustering results to cell types and marker gene sets in Ewing module #896

Open

Script for calculating clusters across parameters in Ewing module #897

Closed

allyhawkins added 2 commits November 19, 2024 13:44

account for weird chromosomes showing up

a74fe36

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ew…

c980387

…ings-update-clustering

sjspielman reviewed Nov 19, 2024

View reviewed changes

allyhawkins added 4 commits November 20, 2024 11:14

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ew…

ed46cdc

…ings-update-clustering

account for no marker gene expression

d7ed51b

adjust range and add boxes around umaps

a1f8af7

use different resolution for cpm

5d25617

allyhawkins requested a review from sjspielman November 20, 2024 18:31

sjspielman reviewed Nov 20, 2024

View reviewed changes

allyhawkins added 3 commits November 20, 2024 14:52

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ew…

5d825f3

…ings-update-clustering

add parameters as actual parameters

2479a81

change back to current

9fd9a5e

allyhawkins requested a review from sjspielman November 20, 2024 21:18

sjspielman approved these changes Nov 21, 2024

View reviewed changes

allyhawkins merged commit f1ab752 into AlexsLemonade:main Nov 21, 2024
3 checks passed

allyhawkins deleted the allyhawkins/ewings-update-clustering branch November 21, 2024 15:21

allyhawkins mentioned this pull request Nov 22, 2024

Add workflow for evaluating clustering to Ewing's module #908

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update clustering report to include leiden clustering #895

Update clustering report to include leiden clustering #895

allyhawkins commented Nov 19, 2024

allyhawkins commented Nov 19, 2024

sjspielman commented Nov 19, 2024

sjspielman left a comment

sjspielman Nov 19, 2024

sjspielman Nov 19, 2024

sjspielman Nov 19, 2024

allyhawkins commented Nov 20, 2024

sjspielman left a comment

sjspielman Nov 20, 2024

sjspielman Nov 20, 2024

sjspielman Nov 20, 2024

allyhawkins Nov 20, 2024

allyhawkins commented Nov 20, 2024

sjspielman left a comment

	Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters.
	Stability values range from 0-1 with higher values of cluster stability indicating more reproducible clusters.

	legend.position = "none"
	legend.position = "none",
	panel.border = element_rect(color = "black", fill = NA))

Update clustering report to include leiden clustering #895

Update clustering report to include leiden clustering #895

Conversation

allyhawkins commented Nov 19, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Results

What types of results does your code produce (e.g., table, figure)?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Is there anything that you want to discuss further?

Author checklists

Analysis module and review

Reproducibility checklist

allyhawkins commented Nov 19, 2024

sjspielman commented Nov 19, 2024

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Nov 19, 2024

Choose a reason for hiding this comment

sjspielman Nov 19, 2024

Choose a reason for hiding this comment

sjspielman Nov 19, 2024

Choose a reason for hiding this comment

allyhawkins commented Nov 20, 2024

sjspielman left a comment

Choose a reason for hiding this comment

sjspielman Nov 20, 2024

Choose a reason for hiding this comment

sjspielman Nov 20, 2024

Choose a reason for hiding this comment

sjspielman Nov 20, 2024

Choose a reason for hiding this comment

allyhawkins Nov 20, 2024

Choose a reason for hiding this comment

allyhawkins commented Nov 20, 2024

sjspielman left a comment

Choose a reason for hiding this comment