hello-clusters notebook: Perform and evaluate clustering #874

sjspielman · 2024-11-12T18:12:11Z

Closes #796

This PR adds the first notebook to the hello-clusters module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!
Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip

In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.

…gether

… some TODO comments to come back to

jashapiro

High level comments:

I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.

I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

jashapiro · 2024-11-13T13:01:53Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+
+All functions presented in this section take the following required arguments:
+
+* An SCE or Seurat object that contains principal components


Can't we also just pass in the PCA matrix?

jashapiro · 2024-11-13T14:05:19Z

then have separate sections about the considerations for input and output with SCE or Seurat objects.

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include.

…s for pca names, and add missing wording

sjspielman · 2024-11-14T18:01:20Z

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity.

Great call, incoming.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation.

I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage?

…glmGamPoi is now needed in renv

sjspielman · 2024-11-15T19:45:44Z

This is now ready for another look! Changes broadly include:

Code now uses a pca matrix throughout, except for the new section towards the end that shows how to use an object
A section for evaluating existing cluster results from Seurat or the ScPCA ones as examples
The Seurat object used throughout the examples is now generated via a Seurat pipeline from the raw counts, which is more realistic for how contributors would be using a Seurat object (based on our experience so far). I figure in the future, we can replace the conversion code here with a function we add to rOpenScPCA for doing the conversion.
I pitched evaluation more as "calculating QC metrics" rather than evaluating per se

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

sjspielman · 2024-11-15T19:54:10Z

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

jashapiro · 2024-11-15T19:55:10Z

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

jashapiro · 2024-11-15T19:56:35Z

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

Hardcoding the seed here seems fine.

sjspielman · 2024-11-15T19:57:23Z

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

Boo, sorry, I'll regenerate.
That said, I did remove the plots which is tentatively what I took your review to mean (see #874 (comment)). But, can easily restore!

sjspielman · 2024-11-15T20:03:41Z

Better! 01_perform-evaluate-clustering.nb.html.zip

jashapiro

This looks pretty good.

I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.

I also suggest moving all the Seurat content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.

Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.

jashapiro · 2024-11-19T18:49:05Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+```{r set seed}
+set.seed(2024)
+```


I'd usually do this after setting the paths.

jashapiro · 2024-11-19T18:49:51Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+```{r create process seurat, message = FALSE}
+# Convert to a Seurat object
+seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA")

-# Output files
+# Process the object with a Seurat pipeline to obtain clusters
+seurat_obj <- seurat_obj |>
+  SCTransform() |>
+  RunPCA() |>
+  FindNeighbors() |>
+  FindClusters()
 ```


I'd probably do this with the Seurat portion.

jashapiro · 2024-11-19T18:51:07Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd


-Organize the remainder of your content into sections and subsections as appropriate for your analysis.
+We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.
+We'll show how to do this for both SCE and Seurat objects:


I don't know that you need to show it for Seurat?

jashapiro · 2024-11-19T18:53:38Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd


-Organize the remainder of your content into sections and subsections as appropriate for your analysis.
+We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.


Maybe something like this?

Suggested change

We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.

For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.

We could also use the SCE object or a Seurat object directly, which we will demonstrate later.

jashapiro · 2024-11-19T18:55:06Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+# Print resulting table
+cluster_results_df


Do we want to print the whole table? It is kind of a lot. Maybe just the first few rows?

jashapiro · 2024-11-19T19:11:08Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+
+
+
+##  Calculate QC metrics on existing clusters


I would move this to after the object section and demonstrate these using the objects themselves.

jashapiro · 2024-11-19T19:12:33Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+`rOpenScPCA` assumes that the PCA matrix is named `PCA` in SCE objects, and `pca` in Seurat objects (as they were in the above examples).
+If the PCA matrix you want to use in the object has a different name, you can provide the argument `pc_name`.
+We can see this below with an SCE object, for example:
+
+```{r use pc_name}
+# First, we'll rename the PCA matrix for demonstration
+reducedDimNames(sce) <- c("PCA_matrix", "UMAP")
+reducedDimNames(sce)
+
+# Calculate clusters from an SCE object using default parameters
+cluster_results_df <- calculate_clusters(
+  sce,
+  pc_name = "PCA_matrix"
+)
+cluster_results_df
+```


This feels like a detail we don't need to demonstrate. I would remove this section, or at least the code for it.

jashapiro · 2024-11-19T19:16:25Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+```{r rename columns}
+# First, rename columns in `cluster_results_df` to avoid ambiguity
+cluster_results_renamed_df <- cluster_results_df |>
+  # add the prefix openscpca_ to all columns


Suggested change

# add the prefix openscpca_ to all columns

# add the prefix "ropenscpca_" to all columns

jashapiro · 2024-11-19T19:17:24Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+# First, rename columns in `cluster_results_df` to avoid ambiguity
+cluster_results_renamed_df <- cluster_results_df |>
+  # add the prefix openscpca_ to all columns
+  dplyr::rename_with(~ paste0("ropenscpca_", .x))


Modern R

Suggested change

dplyr::rename_with(~ paste0("ropenscpca_", .x))

dplyr::rename_with(\(x) paste0("ropenscpca_", x))

jashapiro · 2024-11-19T19:23:50Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+
+```{r add to seurat}
+# Add the cluster result data frame to the Seurat object metadata
+seurat_obj <- AddMetaData(seurat_obj, cluster_results_renamed_df)


Does Seurat assume that the data frame is in the same order? If so, we might want to caution about that.

In the interests of simplicity, you might also want to show making that assumption above, in which case you can first check that it is true with all(colnames(sce) == cluster_results_df$cell_id) or something like that, then just add the clusers more simply with scr$ropenscpca_clusters <- cluster_results_df$cluster. I don't think we really need to show adding all of the cluster parameters?

sjspielman added 10 commits November 8, 2024 15:44

WIP: began sketching out notebook to eval clustering

3b2cc56

Continued WIP: begin to flesh out sections, some reorg as it comes to…

d6414e9

…gether

WIP: much progress. Code basically complete and most text complete

ecf1d09

final touchups to complete the first draft of this notebook, and left…

aa3d207

… some TODO comments to come back to

update README

25778f2

add line to run this notebook, and chmod

bd59389

Turn on GHA on PRs, update data download, and run module script

7b61e9b

samples flag is plural

adc43bf

dont need repo name with renv::update

a6623e6

bump ropenscpca for calculate_stability usage

039ad4c

sjspielman requested a review from jaclyn-taroni as a code owner November 12, 2024 18:12

sjspielman removed the request for review from jaclyn-taroni November 12, 2024 18:12

sjspielman added 3 commits November 12, 2024 15:56

update ropenscpca

a262ffc

need igraph deps

a99e962

missing a quote. sad.

464fc2a

sjspielman requested a review from jashapiro November 12, 2024 21:55

jashapiro reviewed Nov 13, 2024

View reviewed changes

sjspielman added 3 commits November 13, 2024 15:21

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

dab73e2

response to reviews: add parentheses for functions, just use backtick…

4f68ca9

…s for pca names, and add missing wording

one seed to rule them all, and fix yaml

0a7ed18

sjspielman added 7 commits November 14, 2024 14:50

WIP: rearranging notebook

26045f0

Continuing notebook reorg, added code for a seurat section for which …

6534d6d

…glmGamPoi is now needed in renv

WIP: delete extra text, and dont do stability with seurat

7b9ea6c

Finish notebook rearrangement

15f4a24

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

de4b1ce

fix header depth

32a401f

fix wording

a8c3f29

sjspielman requested a review from jashapiro November 15, 2024 19:45

sjspielman added 2 commits November 15, 2024 15:00

Add missing chunk name, and regenerate with script

cddab18

rm seed param and hardcode, and regenerate for real

3953290

jashapiro reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hello-clusters notebook: Perform and evaluate clustering #874

hello-clusters notebook: Perform and evaluate clustering #874

sjspielman commented Nov 12, 2024

jashapiro left a comment

jashapiro Nov 13, 2024

jashapiro commented Nov 13, 2024

sjspielman commented Nov 14, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro commented Nov 15, 2024

jashapiro commented Nov 15, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro left a comment

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024

jashapiro Nov 19, 2024


		All functions presented in this section take the following required arguments:

		* An SCE or Seurat object that contains principal components


		Organize the remainder of your content into sections and subsections as appropriate for your analysis.
		We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.

	We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.
	For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
	We could also use the SCE object or a Seurat object directly, which we will demonstrate later.

	# add the prefix openscpca_ to all columns
	# add the prefix "ropenscpca_" to all columns

	dplyr::rename_with(~ paste0("ropenscpca_", .x))
	dplyr::rename_with(\(x) paste0("ropenscpca_", x))

hello-clusters notebook: Perform and evaluate clustering #874

Are you sure you want to change the base?

hello-clusters notebook: Perform and evaluate clustering #874

Conversation

sjspielman commented Nov 12, 2024

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Nov 13, 2024

sjspielman commented Nov 14, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro commented Nov 15, 2024

jashapiro commented Nov 15, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment