Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to perform graph-based clustering #749

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
5fba017
initial clustering function
sjspielman Sep 3, 2024
116c73d
add bluster and roxygen. Update column name to cell_id
sjspielman Sep 3, 2024
3ad4f49
implement test of defaults and debug
sjspielman Sep 3, 2024
8df9716
Merge branch 'main' into sjspielman/745-add-clustering-function
sjspielman Sep 3, 2024
5eff758
add testthat file
sjspielman Sep 4, 2024
3812826
add tibble
sjspielman Sep 4, 2024
f4b2337
Merge branch 'main' into sjspielman/745-add-clustering-function
sjspielman Sep 4, 2024
0305faa
some comments for future us
sjspielman Sep 4, 2024
f6b736d
Apply suggestions from code review
sjspielman Sep 9, 2024
148ae00
Merge branch 'main' into sjspielman/745-add-clustering-function
sjspielman Sep 9, 2024
5b48321
rm .gitkeep
sjspielman Sep 9, 2024
29c7868
fix quote
sjspielman Sep 9, 2024
15fea48
reverting since resolution_parameter has been superceded
sjspielman Sep 9, 2024
b168a01
use match.args for algorithm and weighting
sjspielman Sep 9, 2024
0a433b8
use data.frame and update docs
sjspielman Sep 9, 2024
288842a
rm tibble dependency
sjspielman Sep 9, 2024
ed93f85
return data frame with all parameters and update docs. update cluster…
sjspielman Sep 9, 2024
d7abb39
update test for all df columns
sjspielman Sep 9, 2024
e4bf3de
dont need to handle errors with match.args
sjspielman Sep 9, 2024
3498350
add tests to handle bad args, but not match.args ones
sjspielman Sep 9, 2024
9db71da
add examples
sjspielman Sep 9, 2024
d576509
rerun document
sjspielman Sep 9, 2024
52fd2d1
add tests for additional args
sjspielman Sep 9, 2024
c11cedc
add dplyr dep
sjspielman Sep 9, 2024
e35bdf6
update error
sjspielman Sep 9, 2024
b724ebb
document
sjspielman Sep 9, 2024
33127ae
Merge branch 'feature/ropenscpca' into sjspielman/745-add-clustering-…
jashapiro Sep 9, 2024
4d1a91f
Apply suggestions from code review
sjspielman Sep 10, 2024
4daa7e5
remove .gitkeep again, got restored with the main merge
sjspielman Sep 10, 2024
835b7ee
expect_s3_class
sjspielman Sep 10, 2024
d8f5b31
combine stopifnots and update to allNames
sjspielman Sep 10, 2024
329180d
dont check the specific error
sjspielman Sep 10, 2024
37d4973
use expect_equal instead
sjspielman Sep 10, 2024
662c33f
rm comment
sjspielman Sep 10, 2024
fe022d5
expand docs for non-default args, and add number weighting
sjspielman Sep 10, 2024
7075b17
add test for single length args
sjspielman Sep 10, 2024
c89e3b9
update DESCRIPTION and check appeasement
sjspielman Sep 10, 2024
e536f02
Update packages/rOpenScPCA/R/calculate-clusters.R
sjspielman Sep 10, 2024
bcc9254
Apply suggestions from code review
sjspielman Sep 11, 2024
d904171
docs and syntax updates
sjspielman Sep 11, 2024
0cdfe03
remove purrr dependency
sjspielman Sep 11, 2024
e7b22cc
update test to expect_error
sjspielman Sep 11, 2024
b2b4cbe
Apply suggestions from code review
sjspielman Sep 11, 2024
b0a9065
wording update to be consistent with bluster docs wording
sjspielman Sep 11, 2024
9a8a4b1
document
sjspielman Sep 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions packages/rOpenScPCA/DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,7 @@ Suggests:
testthat (>= 3.0.0)
Config/testthat/edition: 3
RoxygenNote: 7.3.2
Imports:
bluster,
dplyr,
methods
Empty file removed packages/rOpenScPCA/R/.gitkeep
Empty file.
108 changes: 108 additions & 0 deletions packages/rOpenScPCA/R/calculate-clusters.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#' Calculate graph-based clusters from a provided matrix
#'
sjspielman marked this conversation as resolved.
Show resolved Hide resolved
#' This function is provided to simplify application of bluster package clustering functions on OpenScPCA data.
#' In particular, this function runs bluster::clusterRows() with the bluster::NNGraphParam() function.
#' Note that defaults for some arguments may differ from the bluster::NNGraphParam() defaults.
#' Specifically, the clustering algorithm defaults to "louvain" and the weighting scheme to "jaccard"
#' to align with common practice in scRNA-seq analysis.
#'
#' @param mat Matrix, usually of PCs, where each row is a cell. Matrix must have rownames of cell ids (e.g., barcodes)
#' @param algorithm Clustering algorithm to use. Must be one of "louvain" (default), "walktrap", or "leiden".
#' @param weighting Weighting scheme to use. Must be one of "jaccard" (default), "rank", or "number"
#' @param nn Number of nearest neighbors. Default is 10.
#' @param resolution Resolution parameter used by louvain and leiden clustering only. Default is 1.
#' @param objective_function Leiden-specific parameter for whether to use the Constant Potts Model ("CPM"; default) or "modularity"
#' @param seed Random seed to set for clustering. Default is 2024.
#' @param cluster_args List of additional arguments to pass to the chosen clustering function.
#' Only single values for each argument are supported (no vectors or lists).
#' See igraph documentation for details on each clustering function: https://igraph.org/r/html/latest
#'
#' @return A data frame of cluster results with columns `cell_id` and `cluster`. Additional columns represent algorithm parameters
#' and include at least: `algorithm`, `weighting`, and `nn`. Louvain and leiden clustering will also include `resolution`, and
#' leiden clustering will further include `objective_function`.
#'
#' @export
#'
#' @examples
#' \dontrun{
#' # cluster using default parameters
#' cluster_df <- calculate_clusters(pca_matrix)
#'
#' # cluster using the leiden algorithm with a resolution of 0.1
#' cluster_df <- calculate_clusters(pca_matrix, algorithm = "leiden", resolution = 0.1)
#'
#' # cluster using the leiden algorithm with a non-default of 3 iterations
#' cluster_df <- calculate_clusters(
#' pca_matrix,
#' algorithm = "leiden",
#' cluster_args = list(n_iterations = 3)
#' )
#' }
calculate_clusters <- function(
mat,
algorithm = c("louvain", "walktrap", "leiden"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep walktrap as the default for consistency with the underlying package?

Suggested change
algorithm = c("louvain", "walktrap", "leiden"),
algorithm = c("walktrap", "louvain", "leiden"),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel really torn about this actually (and similarly the weighting param re-order)... I agree about consistency with bluster, but it makes me nervous since these defaults don't correspond to the way most people will want to use the functions (...aka seurat defaults) so it seems potentially less user-friendly. On the other hand this would force folks to be explicit, which is a good thing!

Please give me a further push to be convinced to accept this 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a very strong preference, aside from that if you change the defaults from the underlying method that should be made explicit in the docs. In particular, I would add a description after the roxygen heading explaining the motivation for this function and the way it differs from the bluster default/underlying implementation.

weighting = c("jaccard", "rank", "number"),
nn = 10,
resolution = 1, # louvain or leiden
objective_function = c("CPM", "modularity"), # leiden only
cluster_args = list(),
seed = NULL) {
if (!is.null(seed)) {
set.seed(seed)
}

# Check input arguments
stopifnot(
"The `mat` argument must be a matrix." = any(class(mat) %in% c("matrix", "Matrix")),
"The `mat` matrix must have row names representing cell ids, e.g. barcodes." = is.character(rownames(mat)),
"`resolution` must be numeric" = is.numeric(resolution),
"`nn` must be numeric" = is.numeric(nn)
)

algorithm <- match.arg(algorithm)
weighting <- match.arg(weighting)
objective_function <- match.arg(objective_function)

if (length(cluster_args)) {
stopifnot(
"`cluster_args` must be a named list." = is.list(cluster_args) && !("" %in% methods::allNames(cluster_args)),
"`cluster_args` elements must all have only a single value" = all(sapply(cluster_args, length) == 1)
)
}

# Update cluster_args list with parameters that users can directly provide
# note that clusterRows throws an error if this list has a param not used by the chosen algorithm
if (algorithm %in% c("louvain", "leiden")) {
cluster_args$resolution <- resolution
}
if (algorithm == "leiden") {
cluster_args$objective_function <- objective_function
}


# Perform clustering
clusters <- bluster::clusterRows(
mat,
bluster::NNGraphParam(
k = nn,
type = weighting,
cluster.fun = algorithm,
cluster.args = cluster_args
)
)


# Transform results into a table and return
cluster_df <- data.frame(
cell_id = rownames(mat),
cluster = clusters,
algorithm = algorithm,
weighting = weighting,
nn = nn
) |>
dplyr::bind_cols(
data.frame(cluster_args)
)

return(cluster_df)
}
64 changes: 64 additions & 0 deletions packages/rOpenScPCA/man/calculate_clusters.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 12 additions & 0 deletions packages/rOpenScPCA/tests/testthat.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# This file is part of the standard setup for testthat.
# It is recommended that you do not modify it.
#
# Where should you do additional test configuration?
# Learn more about the roles of various files in:
# * https://r-pkgs.org/testing-design.html#sec-tests-files-overview
# * https://testthat.r-lib.org/articles/special-files.html

library(testthat)
library(rOpenScPCA)

test_check("rOpenScPCA")
78 changes: 78 additions & 0 deletions packages/rOpenScPCA/tests/testthat/test-calculate-clusters.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
test_mat <- matrix(
runif(1000, -3, 3),
nrow = 100,
ncol = 10
)
rownames(test_mat) <- as.character(1:100)

test_that("calculate_clusters runs with defaults", {
cluster_df <- calculate_clusters(test_mat)

expect_equal(
names(cluster_df),
c("cell_id", "cluster", "algorithm", "weighting", "nn", "resolution")
)
expect_equal(
cluster_df$cell_id,
as.character(1:100)
)

expect_s3_class(
cluster_df$cluster,
"factor"
)

expect_equal(
unique(cluster_df$algorithm),
"louvain"
)
expect_equal(
unique(cluster_df$weighting),
"jaccard"
)
expect_equal(
unique(cluster_df$nn),
10
)
expect_equal(
unique(cluster_df$resolution),
1
)
})


test_that("calculate_clusters runs with additional cluster_args", {
cluster_df <- calculate_clusters(
test_mat,
algorithm = "leiden",
cluster_args = list(n_iterations = 3)
)

expect_setequal(
names(cluster_df),
c("cell_id", "cluster", "algorithm", "weighting", "nn", "resolution", "objective_function", "n_iterations")
)
expect_equal(
unique(cluster_df$n_iterations),
3
)
})




test_that("calculate_clusters errors as expected", {
test_mat_nonames <- test_mat
rownames(test_mat_nonames) <- NULL

expect_error(calculate_clusters(test_mat_nonames))
expect_error(calculate_clusters("not a matrix"))
expect_error(calculate_clusters(test_mat, resolution = "string"))
expect_error(calculate_clusters(test_mat, nn = "string"))
expect_error(
calculate_clusters(
test_mat,
cluster_args = list(too_long = 1:10)
)
)
})