-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hello-clusters notebook: Perform and evaluate clustering #874
base: main
Are you sure you want to change the base?
hello-clusters notebook: Perform and evaluate clustering #874
Conversation
… some TODO comments to come back to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comments:
I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.
My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.
I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.
|
||
All functions presented in this section take the following required arguments: | ||
|
||
* An SCE or Seurat object that contains principal components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we also just pass in the PCA matrix?
A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include. |
…s for pca names, and add missing wording
Great call, incoming.
I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage? |
…glmGamPoi is now needed in renv
This is now ready for another look! Changes broadly include:
Here is the current version of the rendered notebook: |
Small question here: I'm on the fence for keeping |
This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo? |
Hardcoding the seed here seems fine. |
Boo, sorry, I'll regenerate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good.
I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.
I also suggest moving all the Seurat
content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.
Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.
```{r set seed} | ||
set.seed(2024) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd usually do this after setting the paths.
```{r create process seurat, message = FALSE} | ||
# Convert to a Seurat object | ||
seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA") | ||
|
||
# Output files | ||
# Process the object with a Seurat pipeline to obtain clusters | ||
seurat_obj <- seurat_obj |> | ||
SCTransform() |> | ||
RunPCA() |> | ||
FindNeighbors() |> | ||
FindClusters() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably do this with the Seurat
portion.
|
||
Organize the remainder of your content into sections and subsections as appropriate for your analysis. | ||
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters. | ||
We'll show how to do this for both SCE and Seurat objects: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that you need to show it for Seurat
?
|
||
Organize the remainder of your content into sections and subsections as appropriate for your analysis. | ||
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
We'll also extract its PCA matrix, which we'll use to demonstrate how to calculate and evaluate clusters. | |
For the initial cluster calculations and evaluations, we will use the PCA matrix extracted from the SCE object. | |
We could also use the SCE object or a Seurat object directly, which we will demonstrate later. |
# Print resulting table | ||
cluster_results_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to print the whole table? It is kind of a lot. Maybe just the first few rows?
|
||
|
||
|
||
## Calculate QC metrics on existing clusters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to after the object section and demonstrate these using the objects themselves.
`rOpenScPCA` assumes that the PCA matrix is named `PCA` in SCE objects, and `pca` in Seurat objects (as they were in the above examples). | ||
If the PCA matrix you want to use in the object has a different name, you can provide the argument `pc_name`. | ||
We can see this below with an SCE object, for example: | ||
|
||
```{r use pc_name} | ||
# First, we'll rename the PCA matrix for demonstration | ||
reducedDimNames(sce) <- c("PCA_matrix", "UMAP") | ||
reducedDimNames(sce) | ||
|
||
# Calculate clusters from an SCE object using default parameters | ||
cluster_results_df <- calculate_clusters( | ||
sce, | ||
pc_name = "PCA_matrix" | ||
) | ||
cluster_results_df | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a detail we don't need to demonstrate. I would remove this section, or at least the code for it.
```{r rename columns} | ||
# First, rename columns in `cluster_results_df` to avoid ambiguity | ||
cluster_results_renamed_df <- cluster_results_df |> | ||
# add the prefix openscpca_ to all columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# add the prefix openscpca_ to all columns | |
# add the prefix "ropenscpca_" to all columns |
# First, rename columns in `cluster_results_df` to avoid ambiguity | ||
cluster_results_renamed_df <- cluster_results_df |> | ||
# add the prefix openscpca_ to all columns | ||
dplyr::rename_with(~ paste0("ropenscpca_", .x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern R
dplyr::rename_with(~ paste0("ropenscpca_", .x)) | |
dplyr::rename_with(\(x) paste0("ropenscpca_", x)) |
|
||
```{r add to seurat} | ||
# Add the cluster result data frame to the Seurat object metadata | ||
seurat_obj <- AddMetaData(seurat_obj, cluster_results_renamed_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Seurat assume that the data frame is in the same order? If so, we might want to caution about that.
In the interests of simplicity, you might also want to show making that assumption above, in which case you can first check that it is true with all(colnames(sce) == cluster_results_df$cell_id)
or something like that, then just add the clusers more simply with scr$ropenscpca_clusters <- cluster_results_df$cluster
. I don't think we really need to show adding all of the cluster parameters?
Closes #796
This PR adds the first notebook to the
hello-clusters
module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip
In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.