Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hello-clusters notebook: Perform and evaluate clustering #874

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

sjspielman
Copy link
Member

Closes #796

This PR adds the first notebook to the hello-clusters module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!
Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip

In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments:

I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.

I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved

All functions presented in this section take the following required arguments:

* An SCE or Seurat object that contains principal components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we also just pass in the PCA matrix?

@jashapiro
Copy link
Member

then have separate sections about the considerations for input and output with SCE or Seurat objects.

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include.

@sjspielman
Copy link
Member Author

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity.

Great call, incoming.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation.

I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage?

@sjspielman
Copy link
Member Author

This is now ready for another look! Changes broadly include:

  • Code now uses a pca matrix throughout, except for the new section towards the end that shows how to use an object
  • A section for evaluating existing cluster results from Seurat or the ScPCA ones as examples
  • The Seurat object used throughout the examples is now generated via a Seurat pipeline from the raw counts, which is more realistic for how contributors would be using a Seurat object (based on our experience so far). I figure in the future, we can replace the conversion code here with a function we add to rOpenScPCA for doing the conversion.
  • I pitched evaluation more as "calculating QC metrics" rather than evaluating per se

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

@sjspielman
Copy link
Member Author

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

@jashapiro
Copy link
Member

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

@jashapiro
Copy link
Member

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

Hardcoding the seed here seems fine.

@sjspielman
Copy link
Member Author

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

Boo, sorry, I'll regenerate.
That said, I did remove the plots which is tentatively what I took your review to mean (see #874 (comment)). But, can easily restore!

@sjspielman
Copy link
Member Author

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good.

I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.

I also suggest moving all the Seurat content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.

Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.

Comment on lines +48 to +50
```{r set seed}
set.seed(2024)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd usually do this after setting the paths.

Comment on lines +83 to 93
```{r create process seurat, message = FALSE}
# Convert to a Seurat object
seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA")

# Output files
# Process the object with a Seurat pipeline to obtain clusters
seurat_obj <- seurat_obj |>
SCTransform() |>
RunPCA() |>
FindNeighbors() |>
FindClusters()
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably do this with the Seurat portion.


Organize the remainder of your content into sections and subsections as appropriate for your analysis.
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.
We'll show how to do this for both SCE and Seurat objects:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that you need to show it for Seurat?


Organize the remainder of your content into sections and subsections as appropriate for your analysis.
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this?

Suggested change
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters.
For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object.
We could also use the SCE object or a Seurat object directly, which we will demonstrate later.

Comment on lines +135 to +136
# Print resulting table
cluster_results_df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to print the whole table? It is kind of a lot. Maybe just the first few rows?




## Calculate QC metrics on existing clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to after the object section and demonstrate these using the objects themselves.

Comment on lines +423 to +438
`rOpenScPCA` assumes that the PCA matrix is named `PCA` in SCE objects, and `pca` in Seurat objects (as they were in the above examples).
If the PCA matrix you want to use in the object has a different name, you can provide the argument `pc_name`.
We can see this below with an SCE object, for example:

```{r use pc_name}
# First, we'll rename the PCA matrix for demonstration
reducedDimNames(sce) <- c("PCA_matrix", "UMAP")
reducedDimNames(sce)

# Calculate clusters from an SCE object using default parameters
cluster_results_df <- calculate_clusters(
sce,
pc_name = "PCA_matrix"
)
cluster_results_df
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a detail we don't need to demonstrate. I would remove this section, or at least the code for it.

```{r rename columns}
# First, rename columns in `cluster_results_df` to avoid ambiguity
cluster_results_renamed_df <- cluster_results_df |>
# add the prefix openscpca_ to all columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# add the prefix openscpca_ to all columns
# add the prefix "ropenscpca_" to all columns

# First, rename columns in `cluster_results_df` to avoid ambiguity
cluster_results_renamed_df <- cluster_results_df |>
# add the prefix openscpca_ to all columns
dplyr::rename_with(~ paste0("ropenscpca_", .x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modern R

Suggested change
dplyr::rename_with(~ paste0("ropenscpca_", .x))
dplyr::rename_with(\(x) paste0("ropenscpca_", x))


```{r add to seurat}
# Add the cluster result data frame to the Seurat object metadata
seurat_obj <- AddMetaData(seurat_obj, cluster_results_renamed_df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Seurat assume that the data frame is in the same order? If so, we might want to caution about that.

In the interests of simplicity, you might also want to show making that assumption above, in which case you can first check that it is true with all(colnames(sce) == cluster_results_df$cell_id) or something like that, then just add the clusers more simply with scr$ropenscpca_clusters <- cluster_results_df$cluster. I don't think we really need to show adding all of the cluster parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add notebook to demonstrate clustering with rOpenScPCA
2 participants