Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update clustering report to include leiden clustering #895

Merged

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #878

What is the goal of this pull request?

Here I am updating the clustering report in the Ewing's module to use the functions in rOpenScPCA. While doing that I also made some changes to the clustering parameters tested to look at both louvain and leiden clustering.

Briefly describe the general approach you took to achieve this goal.

  • Clustering results are now generated with both Louvain and Leiden clustering using the rOpenScPCA::sweep_clusters() function. I varied the nearest neighbors (5-40 with increments of 5 as we were doing previously), the resolution (.5, 1, 1.5) and the objective function (CPM and modularity).
  • I calculated purity, width, and stability for all clustering options. I had functions written that help get these stats across a variety of parameters so I kept these functions and used the ones from rOpenScPCA when it made sense, but mostly I still have the functions I wrote that get the results across a variety of parameters tested. I think I'm going to file issues to add this functionality to rOpenScPCA in the future.
  • I then updated the plots to display all of the results. Previously we had just been varying the number of nearest neighbors so I had to make a few changes to accommodate looking at multiple parameters. For all plots I grouped by algorithm + objective function meaning we have a separate plot for louvain, leiden-CPM, and leiden-modularity. Then for each of those I faceted by both nn and resolution.
  • I removed any functions that were no longer being used from the clustering-functions.R script in this module.

There are two other sections of this report that I ended up moving to a completely different report that I will update in a subsequent PR. These sections compare cluster assignments to cell type assignments and look at marker gene expression across clusters. I started by trying to create those plots across all clustering parameters tested but that resulted in a very long report and also a lot of plots that weren't very meaningful. From the metrics output it's pretty obvious which clustering is bad (looking at you leiden-CPM) and which parameters may be better. Because of that I am reframing this slightly so that the first report generates cluster results and metrics across all possible parameters. Then this report can be evaluated to identify an algorithm, nn, and resolution (or smaller range of nn and resolution) that could be used as input to the second report. This second report would help validate that these clusters are indeed separating by cell type and gene set as expected. This will also make it easy to move the first report with just the metrics here to the hello-clusters module since it is not dataset specific while the second report is a little more specific to the Ewing module.
I wanted to keep the code for this second report so I really just copied over a lot of the existing code and noted a TODO about actually finishing the report next.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes! Next up I'm going to do two things:

  • Fix the second report that looks at cell types and marker gene sets to take a specified set of clustering params as input.
  • Move calculating clusters to a script that outputs a table with all possible cluster assignments so that it is not inside this notebook. I think it makes sense to move this so that other modules can use the script to sweep across clustering options.

Results

What types of results does your code produce (e.g., table, figure)?

This is just a template report so no real results yet, but I'm including a rendered example report for easier review.

01-clustering-metrics.html.zip

For reference here's the previous version of the report:
01-clustering.html.zip

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

I was able to generate the report locally.

Are there particularly areas you'd like reviewers to have a close look at?

I think the main focus of the review here should be on two things:

  • Are we okay with the range of parameters used?
  • Are the plots easy to interpret or do you have any ideas on how the plots in 01-clustering-metrics.Rmd can be improved? Note that the only thing that changed from the previous plots is the faceting, but the content and type of plots were not changed.

Is there anything that you want to discuss further?

Just another note that you will not need to review/ go line by line for 02-clustering-celltypes.Rmd. This report contains the second half of the original clustering report just copied over and I have not made any other adjustments to the code. I plan on fully updating that report in a later PR.

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins
Copy link
Member Author

Just noting that a74fe36 is because I was seeing a failure in the CNV annotation workflow and CIT wasn't passing. After looking into it, InferCNV was now detecting annotations on chrKT. I made a change to just include chr1-22 rather for summarizing results but it's unrelated to other changes here.

@sjspielman
Copy link
Member

I've only looked at your opening comment so far, but I wanted to note before I dive into the code - the resolution parameter for CPM should probably be way lower, like a few orders of magnitude. I'd try it in the [1e-4,1e-2] range to see if that suits you better. Along those lines, it might end up making sense to group plots by algorithm/resolution, rather than algorithm/objective function, as the objective function is only used by leiden. One annoying outcome of how the leiden parameters influence results is that it's hard to use sweep_clusters on both resolution and objective function at once, since resolution for CPM and modularity should be provided on such different scales

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me overall! The main recommendation I have is, as I commented earlier, to play around with lower resolution values for CPM.

- Average cluster purity: This metric also evaluates cluster separation and tells us the proportion of neighboring cells that are assigned to the same cluster.
Purity values range from 0-1 with higher purity values indicating clusters that are well separated.
- Cluster stability: This evaluates how stable the clustering is to input data.
Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters.
Stability values range from 0-1 with higher values of cluster stability indicating more reproducible clusters.


- Average silhouette width: This metric evaluates cluster separation.
Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters.
Higher values indicate tighter clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might note it goes [-1,1]

resolution = ~ glue::glue("{.}-res"))) +
theme(
aspect.ratio = 1,
legend.position = "none"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of UMAPs here and it's hard to see where one stops and the other starts in the grid. I think adding a border will help!

Suggested change
legend.position = "none"
legend.position = "none",
panel.border = element_rect(color = "black", fill = NA))

@allyhawkins
Copy link
Member Author

I've only looked at your opening comment so far, but I wanted to note before I dive into the code - the resolution parameter for CPM should probably be way lower, like a few orders of magnitude. I'd try it in the [1e-4,1e-2] range to see if that suits you better. Along those lines, it might end up making sense to group plots by algorithm/resolution, rather than algorithm/objective function, as the objective function is only used by leiden. One annoying outcome of how the leiden parameters influence results is that it's hard to use sweep_clusters on both resolution and objective function at once, since resolution for CPM and modularity should be provided on such different scales

I went ahead and updated the resolution range used for leiden-CPM to test .001, .005, and .01. I tried .0001 and everything was just one cluster so that didn't seem helpful and above .01 things start to get assigned to all individual clusters for the most part. But I kept the plots grouped in the same way so there's one plot for louvain, one plot for leiden-cpm, and one plot for leiden-modularity for each metric. To me that makes the most sense rather than looking at two different objective functions with very different resolution ranges in the same plot.

Note that I also updated the UMAPs to have borders and fixed an issue in the AUCell/SingleR workflow in d7ed51b. When generating reports for that workflow, there were samples where marker gene expression was now 0 (due to the new rounding), so plots couldn't be made. I just added a message that the plots are missing and accounted for libraries with no marker gene expression.

Here's an updated report:
01-clustering-metrics.html.zip

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I left a few small comments, the main one being that it might make sense to use actual Rmd params for the algorithm parameters.

I realize that I don't think these clustering notebooks are being tested in CI right now. Remind me where we see these notebooks being slotted into the workflow eventually? It may be fine to not have in CI for this PR, but it should happen sooner rather than later, which is a thing I say from experience 🫠


- Objective function: CPM and modularity
- Nearest neighbors: 5, 10, 15, 20, 25, 30, 35, and 40
- Resolution: 0.5, 1, 1.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs the new values used for CPM


## Calculate clusters


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might define a chunk here for parameters, e.g. nn <- seq(5, 40, 5). Or we could actually make these actual Rmd params with the values you use here as defaults, which would allow some flexibility if it's needed when running this as a template across samples.

Comment on lines +154 to +155
facet_grid(rows = vars(nn),
cols = vars(resolution),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might actually flip these so the overall look is more horizontal? You may also need to change the r chunk fig width/height.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are more nn values used than resolution I think having a wider plot would be hard and make the UMAPs really small. So I'm not super inclined to change this.

@allyhawkins
Copy link
Member Author

@sjspielman I updated this so that the parameter values being tested are provided as params to the notebook.

I realize that I don't think these clustering notebooks are being tested in CI right now. Remind me where we see these notebooks being slotted into the workflow eventually? It may be fine to not have in CI for this PR, but it should happen sooner rather than later, which is a thing I say from experience 🫠

The plan is for this to be a workflow that renders this report on all samples in the project (see #686). My next step was to create that workflow and then add running that workflow to CI.

I didn't make the plot change only because it made some really small umaps that were hard to read. This should be ready for another look!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@allyhawkins allyhawkins merged commit f1ab752 into AlexsLemonade:main Nov 21, 2024
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/ewings-update-clustering branch November 21, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use clustering functions from rOpenScPCA in cell-type-ewings module cluster report
2 participants