From 33e0a4ec8c32ca5a8a688ff2e4970f5c945b9d31 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Wed, 24 Jan 2024 16:23:49 -0800 Subject: [PATCH 01/13] Update README.md --- README.md | 193 ++---------------------------------------------------- 1 file changed, 7 insertions(+), 186 deletions(-) diff --git a/README.md b/README.md index 4941672..dfd8683 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,9 @@ # gssnng +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). + + +This package works with Scanpy AnnData objects stored as h5ad files. + **Try it out! ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) @@ -6,11 +11,10 @@ **See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) -Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). +and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) -The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, -which in turn makes gene set scoring difficult. +More info: The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. @@ -19,186 +23,3 @@ Nearest neighbor graphs (NNG) are constructed based on user defined groups (see The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be constructed within each cluster or jointly by cluster *and* sample. Smoothing can be performed using either the adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells. - -This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. -For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. -Gene sets can be provided by using .gmt files or through the OmniPath API (see below). - -Scoring functions work with ranked or unranked data (**"your mileage may vary"**): - -Method references (singscore, RBO) are below. - -Some methods have additional parameters, see below! - -``` - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. - - singscore: Normalised mean (median centered) ranks (requires ranked data) - - ssgsea: The well known single sample GSEA. - - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. - - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). - - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - - average_score: Mean ranks or counts - - median_score: Median of counts or ranks - - summed_up: Sum up the ranks or counts. - -``` - -## Installation from PyPI -``` -pip3 install gssnng -``` - - -## Installation from GitHub - -``` -# also gets you the demo data and some gene sets. -git clone https://github.com/IlyaLab/gssnng - -pip install -e gssnng -``` - -## Example script - -Copy the script out from the cloned repo and run, check the paths if you get an error. - -``` - cp gssnng/gssnng/test/example_script.py . - - python3 example_script.py -``` - - -## Usage - -See gssnng/notebooks for examples on all methods - -1. Read in an AnnData object using scanpy (an h5ad file). - -2. Get your gene sets formatted as a .gmt file. (default is undirected, can take _UP, _DN, and split gene sets _UP+_DN) - -3. Score cells, each gene set will show up as a column in adata.obs. - -``` -from gssnng import score_cells - -q = sc.datasets.pbmc3k_processed() - -score_cells.with_gene_sets(adata=adata, # AnnData object - gene_set_file='cibersort_lm22.gmt', # File path of gene sets - groupby='louvain', # Will sample neighbors within this group - smooth_mode='connectivity', # Smooths matrix using distance weights from NN graph. - recompute_neighbors=32, # Rebuild nearest neighbor graph with groups, 0 turns off function - score_method='singscore', # Method of scoring - method_params={'normalization':'theoretical'}, # Special parameters for some methods - ranked=True, # Use ranked data, True or False - cores=8) # Groups are scored in parallel. - - -sc.pl.umap(q, color=['louvain','T.cells.CD8.up'], wspace=0.35) -``` - -## Parameters - - adata: AnnData object from scanpy.read_* - AnnData containing the cells to be scored - - gene_set_file: str[path] - The gene set file with list of gene sets, gmt, one per line. See `this definition `_ . - - groupby: [str, list, dict] - either a column label in adata.obs, and all categories taken, or a dict specifies one group. - SEE DESCRIPTION BELOW - - smooth_mode: "adjacency", "connectivity", or "off" - Dictates how to use the neighborhood graph. - `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more, `off` does no smoothing. - - recompute_neighbors: int - should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N - - score_method: str - which scoring method to use - - method_params: dict - python dict with XGBoost params. - - ranked: bool - whether the gene expression counts should be rank ordered - - cores: int - number of parallel processes to work through groupby groups - -## Groupby - -The group of cells available for constructing the neighborhood graph can be controlled by using the groupby parameter. In the example -above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the -neighborhood and will available for sampling. - -Groupby specifies a column name (or list of columns) that's found in the AnnData.obs table. -In that case, cells will be grouped as the intersection of categories. For example, using groupby=['louvain','phenotype'] -will take cells that are first in a given louvain cluster and then also in a given phenotype group. By also setting -the recompute_neighbors, the nearest neighbor graph is recomputed within this subset of cells. Controlling the -neighborhood leads to more controlled smoothing of the count matrix and is more suitable for downstream comparisons. - -## Gene sets - -We are following the MSigDB nomenclature, where gene sets default to undirected, but can be marked with the suffix "_UP" -(example: CD8_signature_UP or CD8.signature.up). In this case, when data is ranked, genes with higher expression have larger ranks. If the -gene set has suffix "_DN" (example: CD8_signature_DN or CD8.signature.dn), then lowest expressed genes have largest ranks. In the -use of singscore or Z scores, the undirected case is based on absolute values, so either direction, in the extreme, will result in a large score. - -To access OmniPath signatures, see the ["decoupler-style"](https://github.com/IlyaLab/gssnng/blob/main/notebooks/gssnng_decoupler.ipynb) notebook. - -## Method options - -Some methods have some additional options. They are passed as a dictionary, method_params={param_name: param_value}. - - geneset_overlap: {'threshold': 0, 'fraction': False} - -The geneset overlap compares the smoothed value or rank of geneset genes to the threshold. If ranked=False, then ranks are used. -If smooth_mode is set to "off", this method can be used with untransformed counts. If 'percent' is set to True, then -the overlap will be given as a percentage of the total number of genes in the gene set. - - singscore: {'normalization': 'theoretical'}, {'normalization': 'standard'} - -The singscore manuscript describes the theoretical method of standarization which involves determining the theoretical max and minimum ranks for the given gene set. - - ssGSEA: {'omega': 0.75} - -The ssGSEA method uses this parameter as a exponent to the ranks. It has been strongly suggested to use 0.75. - - rank_biased_overlap: {'rbo_depth': n} (n: int) - -Here, n is the depth that we decend down the ranks, where at each step, the overlap with the gene set is measured and added to the score. - - -*The following methods do not have additional options.* - - robust_std - mean_z - average_score - median_score - summed_up - -## References - -rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf - -singscore: https://pubmed.ncbi.nlm.nih.gov/30400809/ - -ssGSEA: https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html - -anndata: https://anndata.readthedocs.io/en/latest/ - -MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/ - -decoupler: https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613 \ No newline at end of file From 46b54ef4c6a9fc936123d93bd66234f7d4763cb9 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Wed, 24 Jan 2024 16:25:29 -0800 Subject: [PATCH 02/13] Update README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index dfd8683..166bb6f 100644 --- a/README.md +++ b/README.md @@ -5,13 +5,13 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq This package works with Scanpy AnnData objects stored as h5ad files. -**Try it out! ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) + * **Try it out! ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) -**Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) + * **Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) -**See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) + * **See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) -and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) + * and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) More info: From d553a62662451ac35ddf4c2119bb5640965767c1 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Wed, 24 Jan 2024 16:26:01 -0800 Subject: [PATCH 03/13] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 166bb6f..2a2ccfa 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ This package works with Scanpy AnnData objects stored as h5ad files. * and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) -More info: + The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. From 14b37805cd28b134874b5c29c2d56937cd4b686f Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Wed, 24 Jan 2024 16:45:53 -0800 Subject: [PATCH 04/13] Create main_index.rst added main page --- docs/main_index.rst | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 docs/main_index.rst diff --git a/docs/main_index.rst b/docs/main_index.rst new file mode 100644 index 0000000..1c74866 --- /dev/null +++ b/docs/main_index.rst @@ -0,0 +1,35 @@ +gssnng +====== + +Try it out! ===>>> `Open In Colab `_ + +Decoupler/Omnipath style API ===>>> `Open In Colab `_ + +See the paper ===>>> `gssnng `_ + +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). + +The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. +The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. +Nearest neighbor graphs (NNG) are constructed based on user defined groups (see the 'groupby' parameter below). The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be constructed within each cluster or jointly by cluster and sample. Smoothing can be performed using either the adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells. + +This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. Gene sets can be provided by using .gmt files or through the OmniPath API (see below). + +Scoring functions work with ranked or unranked data ("your mileage may vary") + +// `Link text `_ +// Check out the :doc:`usage` section for further information, including +// how to :ref:`installation` the project. + +.. note:: + + This project is under active development. Please consider using a named release. + +Contents +-------- + +.. toctree:: + usage + with gmt gene set files + with a decoupler/omnipath style + api From 340d2db141244324f273a0bf94e16420301338a0 Mon Sep 17 00:00:00 2001 From: gibbsdavidl Date: Fri, 2 Feb 2024 16:32:50 -0800 Subject: [PATCH 05/13] update to docs --- README.md | 4 +- docs/decoupler_api.rst | 228 +++++++++++++++++ docs/gmt_files_doc.rst | 225 +++++++++++++++++ docs/index.rst | 236 ++---------------- docs/main_index.rst | 35 --- gssnng/test/example_decoupler_omnipath_api.py | 37 +++ 6 files changed, 508 insertions(+), 257 deletions(-) create mode 100644 docs/decoupler_api.rst create mode 100644 docs/gmt_files_doc.rst delete mode 100644 docs/main_index.rst create mode 100644 gssnng/test/example_decoupler_omnipath_api.py diff --git a/README.md b/README.md index 2a2ccfa..6bc85f9 100644 --- a/README.md +++ b/README.md @@ -5,9 +5,9 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq This package works with Scanpy AnnData objects stored as h5ad files. - * **Try it out! ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) + * **Notebook using gmt files ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) - * **Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) + * **Notebook using Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) * **See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) diff --git a/docs/decoupler_api.rst b/docs/decoupler_api.rst new file mode 100644 index 0000000..820c853 --- /dev/null +++ b/docs/decoupler_api.rst @@ -0,0 +1,228 @@ +.. GSSNNG documentation master file, created by +sphinx-quickstart on Wed Apr 27 09:20:15 2022. +You can adapt this file completely to your liking, but it should at least +contain the root `toctree` directive. + +gssnng using the decoupler-omnipath api +================================== + +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). + +.. + .. toctree:: + :caption: Table of Contents + :maxdepth: 2 + + Installation + Scoring Functions + Example script + Usage + Parameters + Groupby + Gene sets + References + + +`**Notebook using gmt files** `_ + +`**Notebook using Decoupler/Omnipath style API** `_ + +`**See the paper** `_ + +This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. +For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. +Gene sets can be provided by using .gmt files or through the OmniPath API (see below). + +Scoring functions work with ranked or unranked data (**"your mileage may vary"**): + +Method references (singscore, RBO) are below. + +Some methods have additional parameters, see below! + + +Installation +============ + +Install the package using the following commands:: + + pip3 install gssnng + + + +Installation from GitHub +======================== + + git clone https://github.com/IlyaLab/gssnng + + pip install -e gssnng + + + +Scoring Functions +================= + +The list of scoring functions:: + + geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + + singscore: Normalised mean (median centered) ranks (requires ranked data) + + ssGSEA: Single sample GSEA based on ranked data. + + rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + + robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + + mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + + average_score: Mean ranks or counts + + median_score: Median of counts or ranks + + summed_up: Sum up the ranks or counts. + + + + +Example script +============== + +Copy the script out from the cloned repo and run, check the paths if you get an error.:: + + cp gssnng/gssnng/test/example_decoupler_omnipath_api.py . + + python3.10 example_decoupler_omnipath_api.py + + +Usage +====== + +See gssnng/notebooks for examples on all methods. + +1. Read in an AnnData object using scanpy (an h5ad file). + +2. Get the model from omnipath via the decoupler API. + +3. Score cells, each gene set will show up as a column in adata.obs. + +.. code-block:: + + from gssnng import score_cells + + q = sc.datasets.pbmc3k_processed() + + model = dc.get_progeny().query('weight>0') + + score_cells.run_gssnng( + adata, model, + source='source',target='target', weight='weight', + groupby="louvain", # None + smooth_mode='connectivity', + recompute_neighbors=32, + score_method="mean_z", + method_params={}, # 'normalization':'standard' + ranked=False, + cores=6 + ) + + #Extracts activities as AnnData object. + acts_gss = dc.get_acts(adata, obsm_key='gssnng_estimate') + + sc.pl.umap(acts_gss, color=sorted(acts_gss.var_names), cmap='coolwarm') + + +Parameters +========== + +These parameters are used with the "scores_cells.with_gene_sets" function.:: + + adata: AnnData object from scanpy.read_* + AnnData containing the cells to be scored + + model: str + The decoupler gene set model. See Omnipath Wrappers (https://saezlab.github.io/decoupleR/reference/index.html#omnipath-wrappers). + + groupby: [str, list, dict] + either a column label in adata.obs, and all categories taken, or a dict specifies one group. + SEE DESCRIPTION BELOW + + smooth_mode: "adjacency", "connectivity", or "off" + Dictates how to use the neighborhood graph. + `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more + + recompute_neighbors: int + should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N + + score_method: str + which scoring method to use + + method_params: dict + python dict with XGBoost params. + + ranked: bool + whether the gene expression counts should be rank ordered + + cores: int + number of parallel processes to work through groupby groups + + +Groupby +======= + +The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example +above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the +neighborhood and will available for sampling. + +Groupby specifies a column name that's found in the AnnData.obs table, and it can also take a list of column names. +In that case, cells will be grouped as the intersection of categories. For example, using groupby=['louvain','phenotype'] +will take cells that are first in a given louvain cluster and then also in a given phenotype group. By also setting +the recompute_neighbors, the nearest neighbor graph is recomputed within this subset of cells. Controlling the +neighborhood leads to more controlled smoothing of the count matrix and is more suitable for downstream comparisons. + + +Gene sets +========= + +We are following the Omnipath wrapper APIs supplied by Decoupler. +See: https://saezlab.github.io/decoupleR/reference/index.html#omnipath-wrappers for available gene sets. + +## Method options + +Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:: + + singscore: {'normalization', 'theoretical'}, {'normalization', 'standard'} + +The singscore manuscript describes the theoretical method of standarization which involves determining the theoretical max and minimum ranks for the given gene set.:: + + rank_biased_overlap: {'rbo_depth', n} (n: int) + +Here, n is the depth that is decended down the ranks, where at each step, the overlap with the gene set is measured and added to the score.:: + + ssGSEA: {'omega': 0.75} + +The ssGSEA method uses this parameter as a exponent to the ranks. It has been strongly suggested to use 0.75. + +*The following methods do not have additional options.* + + robust_std + mean_z + average_score + median_score + summed_up + +References +========== + +rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf + +singscore: https://pubmed.ncbi.nlm.nih.gov/30400809/ + +anndata: https://anndata.readthedocs.io/en/latest/ + +MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/ + +ssGSEA: https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html + +decoupler: https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613 + +omnipath: https://omnipathdb.org/ \ No newline at end of file diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst new file mode 100644 index 0000000..fc22402 --- /dev/null +++ b/docs/gmt_files_doc.rst @@ -0,0 +1,225 @@ +.. GSSNNG documentation master file, created by +sphinx-quickstart on Wed Apr 27 09:20:15 2022. +You can adapt this file completely to your liking, but it should at least +contain the root `toctree` directive. + +gssnng using gmt files +================================== + +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). + +.. + .. toctree:: + :caption: Table of Contents + :maxdepth: 2 + + Installation + Scoring Functions + Example script + Usage + Parameters + Groupby + Gene sets + References + + +`**Notebook using gmt files** `_ + +`**Notebook using Decoupler/Omnipath style API** `_ + +`**See the paper** `_ + + +This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. +For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. +Gene sets can be provided by using .gmt files or through the OmniPath API (see below). + +Scoring functions work with ranked or unranked data (**"your mileage may vary"**): + +Method references (singscore, RBO) are below. + +Some methods have additional parameters, see below! + + +Installation +============ + +Install the package using the following commands:: + + pip3 install gssnng + + + +Installation from GitHub +======================== + + git clone https://github.com/IlyaLab/gssnng + + pip install -e gssnng + + + +Scoring Functions +================= + +The list of scoring functions:: + + geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + + singscore: Normalised mean (median centered) ranks (requires ranked data) + + ssGSEA: Single sample GSEA based on ranked data. + + rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + + robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + + mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + + average_score: Mean ranks or counts + + median_score: Median of counts or ranks + + summed_up: Sum up the ranks or counts. + + + + +Example script +============== + +Copy the script out from the cloned repo and run, check the paths if you get an error.:: + + cp gssnng/gssnng/test/example_script.py . + + python3.10 example_script.py + + +Usage +====== + +See gssnng/notebooks for examples on all methods. + +1. Read in an AnnData object using scanpy (an h5ad file). + +2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN) + +3. Score cells, each gene set will show up as a column in adata.obs. + +.. code-block:: + + from gssnng import score_cells + + q = sc.datasets.pbmc3k_processed() + + scores_cells.with_gene_sets(adata=q, # AnnData object + gene_set_file='cibersort_lm22.gmt', # File path of gene sets + groupby='louvain', # Will sample neighbors within this group, can take a list + smooth_mode='connectivity', # Smooths matrix using distance weights from NN graph. + recompute_neighbors=32, # Rebuild nearest neighbor graph with groups, 0 turns off function + score_method='singscore', # Method of scoring + method_params={'normalization':'theoretical'}, # Special parameters for some methods + ranked=True, # Use ranked data, True or False + cores=8) # Groups are scored in parallel. + + sc.pl.umap(q, color=['louvain','T.cells.CD8.up'], wspace=0.35) + + +Parameters +========== + +These parameters are used with the "scores_cells.with_gene_sets" function.:: + + adata: AnnData object from scanpy.read_* + AnnData containing the cells to be scored + + gene_set_file: str[path] + The gene set file with list of gene sets, gmt, one per line. See `this definition `_ . + + groupby: [str, list, dict] + either a column label in adata.obs, and all categories taken, or a dict specifies one group. + SEE DESCRIPTION BELOW + + smooth_mode: "adjacency", "connectivity", or "off" + Dictates how to use the neighborhood graph. + `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more + + recompute_neighbors: int + should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N + + score_method: str + which scoring method to use + + method_params: dict + python dict with XGBoost params. + + ranked: bool + whether the gene expression counts should be rank ordered + + cores: int + number of parallel processes to work through groupby groups + + +Groupby +======= + +The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example +above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the +neighborhood and will available for sampling. + +Groupby specifies a column name that's found in the AnnData.obs table, and it can also take a list of column names. +In that case, cells will be grouped as the intersection of categories. For example, using groupby=['louvain','phenotype'] +will take cells that are first in a given louvain cluster and then also in a given phenotype group. By also setting +the recompute_neighbors, the nearest neighbor graph is recomputed within this subset of cells. Controlling the +neighborhood leads to more controlled smoothing of the count matrix and is more suitable for downstream comparisons. + + +Gene sets +========= + +We are following the MSigDB nomenclature, where gene sets default to up, but can have direction specified with the suffix "_UP" +(example: CD8_signature_UP or CD8.signature.up). If the gene set name has suffix "_DN" (example: CD8_signature_DN or +CD8.signature.dn), then low expressed genes will have large ranks and produce positive scores. +In the use of singscore or Z scores, the undirected case is based on absolute values, so either direction, +in the extreme, will result in a large score. + +## Method options + +Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:: + + singscore: {'normalization', 'theoretical'}, {'normalization', 'standard'} + +The singscore manuscript describes the theoretical method of standarization which involves determining the theoretical max and minimum ranks for the given gene set.:: + + rank_biased_overlap: {'rbo_depth', n} (n: int) + +Here, n is the depth that is decended down the ranks, where at each step, the overlap with the gene set is measured and added to the score.:: + + ssGSEA: {'omega': 0.75} + +The ssGSEA method uses this parameter as a exponent to the ranks. It has been strongly suggested to use 0.75. + +*The following methods do not have additional options.* + + robust_std + mean_z + average_score + median_score + summed_up + +References +========== + +rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf + +singscore: https://pubmed.ncbi.nlm.nih.gov/30400809/ + +anndata: https://anndata.readthedocs.io/en/latest/ + +MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/ + +ssGSEA: https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html + +decoupler: https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613 + +omnipath: https://omnipathdb.org/ \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index db512eb..54747c2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,236 +1,32 @@ -.. GSSNNG documentation master file, created by - sphinx-quickstart on Wed Apr 27 09:20:15 2022. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - gssnng -================================== - -Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). - -.. toctree:: - :caption: Table of Contents - :maxdepth: 2 - - Installation - Scoring Functions - Example script - Usage - Parameters - Groupby - Gene sets - References - - -`**Try it out!** `_ - -`**Decoupler/Omnipath style API** `_ - -`**See the paper** `_ - - -Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). - -The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, -which in turn makes gene set scoring difficult. - -The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates -mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring -methods often associated with bulk RNA-seq. - -Nearest neighbor graphs (NNG) are constructed based on user defined groups (see the 'groupby' parameter below). -The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be -constructed within each cluster or jointly by cluster *and* sample. Smoothing can be performed using either the -adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells. - -This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. -For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. -Gene sets can be provided by using .gmt files or through the OmniPath API (see below). - -Scoring functions work with ranked or unranked data (**"your mileage may vary"**): - -Method references (singscore, RBO) are below. - -Some methods have additional parameters, see below! - - -Installation -============ - -Install the package using the following commands:: - - pip3 install gssnng - - - -Installation from GitHub -======================== - - git clone https://github.com/IlyaLab/gssnng - - pip install -e gssnng - - - -Scoring Functions -================= - -The list of scoring functions:: - - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. - - singscore: Normalised mean (median centered) ranks (requires ranked data) - - ssGSEA: Single sample GSEA based on ranked data. - - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. - - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). - - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - - average_score: Mean ranks or counts - - median_score: Median of counts or ranks - - summed_up: Sum up the ranks or counts. - - - - -Example script -============== - -Copy the script out from the cloned repo and run, check the paths if you get an error.:: - - cp gssnng/gssnng/test/example_script.py . - - python3.10 example_script.py - - -Usage ====== -See gssnng/notebooks for examples on all methods. - -1. Read in an AnnData object using scanpy (an h5ad file). - -2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN) - -3. Score cells, each gene set will show up as a column in adata.obs. - -.. code-block:: - - from gssnng import score_cells - - q = sc.datasets.pbmc3k_processed() - - scores_cells.with_gene_sets(adata=q, # AnnData object - gene_set_file='cibersort_lm22.gmt', # File path of gene sets - groupby='louvain', # Will sample neighbors within this group, can take a list - smooth_mode='connectivity', # Smooths matrix using distance weights from NN graph. - recompute_neighbors=32, # Rebuild nearest neighbor graph with groups, 0 turns off function - score_method='singscore', # Method of scoring - method_params={'normalization':'theoretical'}, # Special parameters for some methods - ranked=True, # Use ranked data, True or False - cores=8) # Groups are scored in parallel. - - sc.pl.umap(q, color=['louvain','T.cells.CD8.up'], wspace=0.35) - - -Parameters -========== - -These parameters are used with the "scores_cells.with_gene_sets" function.:: - - adata: AnnData object from scanpy.read_* - AnnData containing the cells to be scored - - gene_set_file: str[path] - The gene set file with list of gene sets, gmt, one per line. See `this definition `_ . - - groupby: [str, list, dict] - either a column label in adata.obs, and all categories taken, or a dict specifies one group. - SEE DESCRIPTION BELOW - - smooth_mode: "adjacency" or "connectivity", - Dictates how to use the neighborhood graph. - `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more - - recompute_neighbors: int - should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N - - score_method: str - which scoring method to use - - method_params: dict - python dict with XGBoost params. - - ranked: bool - whether the gene expression counts should be rank ordered - - cores: int - number of parallel processes to work through groupby groups - - -Groupby -======= - -The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example -above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the -neighborhood and will available for sampling. - -Groupby specifies a column name that's found in the AnnData.obs table, and it can also take a list of column names. -In that case, cells will be grouped as the intersection of categories. For example, using groupby=['louvain','phenotype'] -will take cells that are first in a given louvain cluster and then also in a given phenotype group. By also setting -the recompute_neighbors, the nearest neighbor graph is recomputed within this subset of cells. Controlling the -neighborhood leads to more controlled smoothing of the count matrix and is more suitable for downstream comparisons. - - -Gene sets -========= - -We are following the MSigDB nomenclature, where gene sets default to up, but can have direction specified with the suffix "_UP" -(example: CD8_signature_UP or CD8.signature.up). If the gene set name has suffix "_DN" (example: CD8_signature_DN or -CD8.signature.dn), then low expressed genes will have large ranks and produce positive scores. -In the use of singscore or Z scores, the undirected case is based on absolute values, so either direction, -in the extreme, will result in a large score. - -## Method options - -Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:: - - singscore: {'normalization', 'theoretical'}, {'normalization', 'standard'} - -The singscore manuscript describes the theoretical method of standarization which involves determining the theoretical max and minimum ranks for the given gene set.:: +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). - rank_biased_overlap: {'rbo_depth', n} (n: int) -Here, n is the depth that is decended down the ranks, where at each step, the overlap with the gene set is measured and added to the score.:: +Notebook using gmt gene set files ===>>> `Open In Colab `_ - ssGSEA: {'omega': 0.75} +Notebook using the Decoupler/Omnipath style API ===>>> `Open In Colab `_ -The ssGSEA method uses this parameter as a exponent to the ranks. It has been strongly suggested to use 0.75. +See the paper ===>>> `gssnng `_ -*The following methods do not have additional options.* +Contents +-------- - robust_std - mean_z - average_score - median_score - summed_up +.. toctree:: + with gmt gene set files + with a decoupler/omnipath style -References -========== -rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf +The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. +The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. +Nearest neighbor graphs (NNG) are constructed based on user defined groups (see the 'groupby' parameter below). The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be constructed within each cluster or jointly by cluster and sample. Smoothing can be performed using either the adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells. -singscore: https://pubmed.ncbi.nlm.nih.gov/30400809/ +This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. Gene sets can be provided by using .gmt files or through the OmniPath API (see below). -anndata: https://anndata.readthedocs.io/en/latest/ +Scoring functions work with ranked or unranked data ("your mileage may vary") -MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/ +.. note:: -ssGSEA: https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html + This project is under active development. Please consider using a named release. -decoupler: https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613 diff --git a/docs/main_index.rst b/docs/main_index.rst deleted file mode 100644 index 1c74866..0000000 --- a/docs/main_index.rst +++ /dev/null @@ -1,35 +0,0 @@ -gssnng -====== - -Try it out! ===>>> `Open In Colab `_ - -Decoupler/Omnipath style API ===>>> `Open In Colab `_ - -See the paper ===>>> `gssnng `_ - -Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). - -The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. -The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. -Nearest neighbor graphs (NNG) are constructed based on user defined groups (see the 'groupby' parameter below). The defined groups can be processed in parallel, speeding up the calculations. For example, a NNG could be constructed within each cluster or jointly by cluster and sample. Smoothing can be performed using either the adjacency matrix (all 1s) or the weighted graph to give less weight to more distant cells. - -This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. Gene sets can be provided by using .gmt files or through the OmniPath API (see below). - -Scoring functions work with ranked or unranked data ("your mileage may vary") - -// `Link text `_ -// Check out the :doc:`usage` section for further information, including -// how to :ref:`installation` the project. - -.. note:: - - This project is under active development. Please consider using a named release. - -Contents --------- - -.. toctree:: - usage - with gmt gene set files - with a decoupler/omnipath style - api diff --git a/gssnng/test/example_decoupler_omnipath_api.py b/gssnng/test/example_decoupler_omnipath_api.py new file mode 100644 index 0000000..f8eab67 --- /dev/null +++ b/gssnng/test/example_decoupler_omnipath_api.py @@ -0,0 +1,37 @@ +from gssnng import score_cells, gene_sets +import decoupler as dc +import scanpy as sc +import pandas as pd +import numpy as np +from scipy import sparse +import matplotlib.pyplot as plt +plt.rcParams['figure.figsize'] = [4.0, 3.0] + +# load the data +adata = sc.datasets.pbmc3k_processed() +adata.X = sparse.csr_matrix(adata.X) + +#Get the genesets +#This is a simple DataFrame, encoding the bipartite graph geneset-name-> genename. +#Note: These genesets contain genes negatively associated with the signature +# (i.e. low expression of a gene indicates the presence of a signature). +# We filter those out here as gssnng doesn't take into account the (negative) weight. +model = dc.get_progeny().query('weight>0') + + +score_cells.run_gssnng( + adata, model, + source='source',target='target', weight='weight', + groupby="louvain", # None + smooth_mode='connectivity', + recompute_neighbors=32, + score_method="mean_z", + method_params={}, # 'normalization':'standard' + ranked=False, + cores=6 +) + +acts_gss = dc.get_acts(adata, obsm_key='gssnng_estimate') + +sc.pl.umap(acts_gss, color=sorted(acts_gss.var_names), cmap='coolwarm') + From cd3b90f2c4cee9b2ca4f1def7b7ea27524a880ec Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 15:27:18 -0800 Subject: [PATCH 06/13] update to docs, examples, and smooth function --- README.md | 3 +- ...ecoupler_api.rst => decoupler_api_doc.rst} | 94 ++++++++++--------- docs/gmt_files_doc.rst | 75 +++++++-------- docs/smoothing_adatas.rst | 85 +++++++++++++++++ gssnng/{smooth_anndatas.py => nnsmooth.py} | 5 +- gssnng/score_cells.py | 5 + gssnng/test/example_decoupler_omnipath_api.py | 85 ++++++++++------- gssnng/test/example_gmt_input.py | 33 +++++++ gssnng/test/example_script.py | 58 ------------ gssnng/test/example_smoothing_counts.py | 29 ++++++ gssnng/test/test_return_smoothed.py | 2 +- 11 files changed, 292 insertions(+), 182 deletions(-) rename docs/{decoupler_api.rst => decoupler_api_doc.rst} (89%) create mode 100644 docs/smoothing_adatas.rst rename gssnng/{smooth_anndatas.py => nnsmooth.py} (93%) create mode 100644 gssnng/test/example_gmt_input.py delete mode 100644 gssnng/test/example_script.py create mode 100644 gssnng/test/example_smoothing_counts.py diff --git a/README.md b/README.md index 6bc85f9..81d6079 100644 --- a/README.md +++ b/README.md @@ -9,12 +9,13 @@ This package works with Scanpy AnnData objects stored as h5ad files. * **Notebook using Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) + * **Notebook for smoothing counts, breaks AnnData into groups. + * **See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) * and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) - The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. diff --git a/docs/decoupler_api.rst b/docs/decoupler_api_doc.rst similarity index 89% rename from docs/decoupler_api.rst rename to docs/decoupler_api_doc.rst index 820c853..04c8835 100644 --- a/docs/decoupler_api.rst +++ b/docs/decoupler_api_doc.rst @@ -27,6 +27,8 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq `**Notebook using Decoupler/Omnipath style API** `_ +`**Notebook for creating smoothed count matrices**`_ + `**See the paper** `_ This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. @@ -45,55 +47,24 @@ Installation Install the package using the following commands:: - pip3 install gssnng - - - -Installation from GitHub -======================== - - git clone https://github.com/IlyaLab/gssnng - - pip install -e gssnng - - - -Scoring Functions -================= - -The list of scoring functions:: - - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. - - singscore: Normalised mean (median centered) ranks (requires ranked data) - - ssGSEA: Single sample GSEA based on ranked data. - - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. - - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). - - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - - average_score: Mean ranks or counts - - median_score: Median of counts or ranks - - summed_up: Sum up the ranks or counts. + python3 -m pip install gssnng + # or to from github + python3 -m pip install git+https://github.com/IlyaLab/gssnng Example script ============== -Copy the script out from the cloned repo and run, check the paths if you get an error.:: +Copy the script out from the cloned repo and run, check the paths if you get an error. + +:: cp gssnng/gssnng/test/example_decoupler_omnipath_api.py . python3.10 example_decoupler_omnipath_api.py - Usage ====== @@ -101,36 +72,68 @@ See gssnng/notebooks for examples on all methods. 1. Read in an AnnData object using scanpy (an h5ad file). -2. Get the model from omnipath via the decoupler API. +2. Get the model from omnipath via the decoupler API. You may want to filter out genes negatively associated with the pathway, see the example. -3. Score cells, each gene set will show up as a column in adata.obs. +3. Score cells, each gene set will show up as a column in adata.obsm['gssnng_estimate']. -.. code-block:: +:: from gssnng import score_cells q = sc.datasets.pbmc3k_processed() + # OmniPath Model # model = dc.get_progeny().query('weight>0') score_cells.run_gssnng( adata, model, source='source',target='target', weight='weight', - groupby="louvain", # None + groupby="louvain", smooth_mode='connectivity', recompute_neighbors=32, score_method="mean_z", - method_params={}, # 'normalization':'standard' + method_params={}, ranked=False, cores=6 ) - #Extracts activities as AnnData object. + # Extracts activities as AnnData object. acts_gss = dc.get_acts(adata, obsm_key='gssnng_estimate') + # Now we can plot the gene set scores sc.pl.umap(acts_gss, color=sorted(acts_gss.var_names), cmap='coolwarm') + + +Scoring Functions +================= + +The list of scoring functions:: + + geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + + singscore: Normalised mean (median centered) ranks (requires ranked data) + + ssGSEA: Single sample GSEA based on ranked data. + + rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + + robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + + mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + + average_score: Mean ranks or counts + + median_score: Median of counts or ranks + + summed_up: Sum up the ranks or counts. + + + + + + Parameters ========== @@ -142,6 +145,11 @@ These parameters are used with the "scores_cells.with_gene_sets" function.:: model: str The decoupler gene set model. See Omnipath Wrappers (https://saezlab.github.io/decoupleR/reference/index.html#omnipath-wrappers). + source: str + weight: str + target: str + Each pathway in OmniPath is a collection of *target* genes from a *source* (i.e. pathway), where each has an interaction *weight*. + groupby: [str, list, dict] either a column label in adata.obs, and all categories taken, or a dict specifies one group. SEE DESCRIPTION BELOW diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index fc22402..e3d94e7 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -27,6 +27,8 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq `**Notebook using Decoupler/Omnipath style API** `_ +`**Notebook for creating smoothed count matrices**`_ + `**See the paper** `_ @@ -46,53 +48,23 @@ Installation Install the package using the following commands:: - pip3 install gssnng - - - -Installation from GitHub -======================== - - git clone https://github.com/IlyaLab/gssnng - - pip install -e gssnng - - - -Scoring Functions -================= - -The list of scoring functions:: - - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. - - singscore: Normalised mean (median centered) ranks (requires ranked data) - - ssGSEA: Single sample GSEA based on ranked data. - - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. - - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). - - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - - average_score: Mean ranks or counts - - median_score: Median of counts or ranks - - summed_up: Sum up the ranks or counts. + python3 -m pip install gssnng + # or to from github + python3 -m pip install git+https://github.com/IlyaLab/gssnng Example script ============== -Copy the script out from the cloned repo and run, check the paths if you get an error.:: +Copy the script out from the cloned repo and run, check the paths if you get an error. + +:: - cp gssnng/gssnng/test/example_script.py . + cp gssnng/gssnng/test/example_gmt_input.py . - python3.10 example_script.py + python3.10 example_gmt_input.py Usage @@ -102,11 +74,11 @@ See gssnng/notebooks for examples on all methods. 1. Read in an AnnData object using scanpy (an h5ad file). -2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN) +2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN), see below for more details. 3. Score cells, each gene set will show up as a column in adata.obs. -.. code-block:: +:: from gssnng import score_cells @@ -124,6 +96,29 @@ See gssnng/notebooks for examples on all methods. sc.pl.umap(q, color=['louvain','T.cells.CD8.up'], wspace=0.35) +Scoring Functions +================= + +The list of scoring functions: + + geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + + singscore: Normalised mean (median centered) ranks (requires ranked data) + + ssGSEA: Single sample GSEA based on ranked data. + + rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + + robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + + mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + + average_score: Mean ranks or counts + + median_score: Median of counts or ranks + + summed_up: Sum up the ranks or counts. + Parameters ========== diff --git a/docs/smoothing_adatas.rst b/docs/smoothing_adatas.rst new file mode 100644 index 0000000..3d4b8e8 --- /dev/null +++ b/docs/smoothing_adatas.rst @@ -0,0 +1,85 @@ +.. GSSNNG documentation master file, created by +sphinx-quickstart on Wed Apr 27 09:20:15 2022. +You can adapt this file completely to your liking, but it should at least +contain the root `toctree` directive. + +gssnng to make smoothed count matrices +====================================== + +Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). + +.. + .. toctree:: + :caption: Table of Contents + :maxdepth: 2 + + Installation + Scoring Functions + Example script + Usage + Parameters + Groupby + Gene sets + References + + +`**Notebook using gmt files** `_ + +`**Notebook using Decoupler/Omnipath style API** `_ + +`**Notebook for creating smoothed count matrices**`_ + +`**See the paper** `_ + + +This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. +For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. + + +Installation +============ + +Install the package using the following commands:: + + python3 -m pip install gssnng + + # or to from github + python3 -m pip install git+https://github.com/IlyaLab/gssnng + + + +Example script +============== + +Copy the script out from the cloned repo and run, check the paths if you get an error. + +:: + + cp gssnng/gssnng/test/example_smoothing_counts.py . + + python3.10 example_smoothing_counts.py + + +Usage +====== + +See gssnng/notebooks for examples on all methods. + +1. Read in an AnnData object using scanpy (an h5ad file). + +2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN), see below for more details. + +3. Score cells, each gene set will show up as a column in adata.obs. + +:: + + from gssnng import nnsmooth + + q = sc.datasets.pbmc3k_processed() + + q_list = nnsmooth.smooth_adata(adata=q, # AnnData object + groupby='louvain', # Will sample neighbors within this group, can take a list + smooth_mode='connectivity', # Smooths matrix using distance weights from NN graph. + recompute_neighbors=32, # Rebuild nearest neighbor graph with groups, 0 turns off function + cores=4) # Smoothed in parallel. + diff --git a/gssnng/smooth_anndatas.py b/gssnng/nnsmooth.py similarity index 93% rename from gssnng/smooth_anndatas.py rename to gssnng/nnsmooth.py index f76f34e..85b1be9 100644 --- a/gssnng/smooth_anndatas.py +++ b/gssnng/nnsmooth.py @@ -3,12 +3,11 @@ from gssnng.util import error_checking from typing import Union -def smooth_anndata( +def smooth_adata( adata: anndata.AnnData, groupby: Union[str, list, dict], smooth_mode: str, recompute_neighbors: int, - method_params: dict, cores: int ) -> anndata.AnnData: @@ -45,7 +44,7 @@ def smooth_anndata( # score each cell with the list of gene sets data_list = _proc_data(adata, None, groupby, smooth_mode, recompute_neighbors, - None, method_params, samp_neighbors, + None, None, samp_neighbors, noise_trials, None, cores, return_data) print("**done**") diff --git a/gssnng/score_cells.py b/gssnng/score_cells.py index 69eae1e..9be2636 100644 --- a/gssnng/score_cells.py +++ b/gssnng/score_cells.py @@ -11,6 +11,11 @@ from typing import Union from multiprocessing import Pool +from numba.core.errors import NumbaDeprecationWarning, NumbaPendingDeprecationWarning +import warnings + +warnings.simplefilter('ignore', category=NumbaDeprecationWarning) +warnings.simplefilter('ignore', category=NumbaPendingDeprecationWarning) def run_gssnng( mat, net, source, target, weight, diff --git a/gssnng/test/example_decoupler_omnipath_api.py b/gssnng/test/example_decoupler_omnipath_api.py index f8eab67..5a47654 100644 --- a/gssnng/test/example_decoupler_omnipath_api.py +++ b/gssnng/test/example_decoupler_omnipath_api.py @@ -1,37 +1,50 @@ -from gssnng import score_cells, gene_sets -import decoupler as dc -import scanpy as sc -import pandas as pd -import numpy as np -from scipy import sparse -import matplotlib.pyplot as plt -plt.rcParams['figure.figsize'] = [4.0, 3.0] - -# load the data -adata = sc.datasets.pbmc3k_processed() -adata.X = sparse.csr_matrix(adata.X) - -#Get the genesets -#This is a simple DataFrame, encoding the bipartite graph geneset-name-> genename. -#Note: These genesets contain genes negatively associated with the signature -# (i.e. low expression of a gene indicates the presence of a signature). -# We filter those out here as gssnng doesn't take into account the (negative) weight. -model = dc.get_progeny().query('weight>0') - - -score_cells.run_gssnng( - adata, model, - source='source',target='target', weight='weight', - groupby="louvain", # None - smooth_mode='connectivity', - recompute_neighbors=32, - score_method="mean_z", - method_params={}, # 'normalization':'standard' - ranked=False, - cores=6 -) - -acts_gss = dc.get_acts(adata, obsm_key='gssnng_estimate') - -sc.pl.umap(acts_gss, color=sorted(acts_gss.var_names), cmap='coolwarm') +if __name__ == '__main__': + import time + from gssnng import score_cells, gene_sets + import decoupler as dc + import omnipath + import scanpy as sc + #from scipy import sparse + + + # load the data + adata = sc.datasets.pbmc3k_processed() + #adata.X = sparse.csr_matrix(adata.X) # + + #Get the genesets + #This is a simple DataFrame, encoding the bipartite graph geneset-name-> genename. + #Note: These genesets contain genes negatively associated with the signature + # (i.e. low expression of a gene indicates the presence of a signature). + # We filter those out here as gssnng doesn't take into account the (negative) weight. + print("getting model") + model = dc.get_progeny().query('weight>0') + + print(model.head()) + + t0 = time.time() + print('gssnng start time: ' + str(t0)) + + score_cells.run_gssnng( + adata, model, + source='source',target='target', weight='weight', + groupby="louvain", + smooth_mode='connectivity', + recompute_neighbors=12, + score_method="summed_up", + method_params={}, + ranked=False, + cores=4 + ) + + # get the results back as an adata + acts_gss = dc.get_acts(adata, obsm_key='gssnng_estimate') + + # and we can visualize it. + #sc.pl.umap(acts_gss, color=sorted(acts_gss.var_names), cmap='coolwarm') + print(acts_gss.obsm['gssnng_estimate'].head()) + + t1 = time.time() + print('end time: ' + str(t1)) + print('TOTAL TIME: ' + str(t1 - t0)) + print("done") \ No newline at end of file diff --git a/gssnng/test/example_gmt_input.py b/gssnng/test/example_gmt_input.py new file mode 100644 index 0000000..7bf6a52 --- /dev/null +++ b/gssnng/test/example_gmt_input.py @@ -0,0 +1,33 @@ +if __name__ == '__main__': + + from gssnng import score_cells + import scanpy as sc + import time + + print("reading data") + q = sc.datasets.pbmc3k_processed() + + t0 = time.time() + print('start time: ' + str(t0)) + + print("scoring cells") + score_cells.with_gene_sets( + adata=q, + gene_set_file="data/cibersort_lm22.gmt", + groupby='louvain', + smooth_mode='connectivity', + recompute_neighbors=13, + score_method="geneset_overlap", + method_params= {'threshold': 0, 'fraction': False}, + ranked=True, + cores=4 + ) + + t1 = time.time() + + print("MEAN GENESET OVERLAP for the T.cells.CD8.up signature") + print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) + + print('end time: ' + str(t1)) + print('TOTAL TIME: ' + str(t1-t0)) + print("done") \ No newline at end of file diff --git a/gssnng/test/example_script.py b/gssnng/test/example_script.py deleted file mode 100644 index d2c3d87..0000000 --- a/gssnng/test/example_script.py +++ /dev/null @@ -1,58 +0,0 @@ -if __name__ == '__main__': - - from gssnng import score_cells - import scanpy as sc - import time - - print("reading data") - q = sc.datasets.pbmc3k_processed() - - print("computing neighborhood") - sc.pp.neighbors(q, n_neighbors=32) - - t0 = time.time() - print('start time: ' + str(t0)) - - print("scoring cells") - score_cells.with_gene_sets( - adata=q, - gene_set_file="gssnng/gssnng/test/data/cibersort_lm22.gmt", - groupby='louvain', - smooth_mode='connectivity', - recompute_neighbors=0, - score_method="geneset_overlap", - method_params= {'threshold': 0, 'fraction': False}, - ranked=True, - cores=8 - ) - - t1 = time.time() - - print("MEAN GENESET OVERLAP for the T.cells.CD8.up signature") - print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) - - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print("done") - - print("scoring cells ... again!") - score_cells.with_gene_sets( - adata=q, - gene_set_file="gssnng/gssnng/test/data/cibersort_lm22.gmt", - groupby='louvain', - smooth_mode='connectivity', - recompute_neighbors=0, - score_method="summed_up", - method_params={}, - ranked=True, - cores=8 - ) - - t2 = time.time() - - print("MEAN SUMMED_UP SCORES for the T.cells.CD8.up signature") - print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) - - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print("done") \ No newline at end of file diff --git a/gssnng/test/example_smoothing_counts.py b/gssnng/test/example_smoothing_counts.py new file mode 100644 index 0000000..4f4a529 --- /dev/null +++ b/gssnng/test/example_smoothing_counts.py @@ -0,0 +1,29 @@ +if __name__ == '__main__': + + from gssnng import nnsmooth + import scanpy as sc + import time + + print("reading data") + q = sc.datasets.pbmc3k_processed() + + t0 = time.time() + print('start time: ' + str(t0)) + + print("scoring cells") + q_list = smooth_anndatas.smooth_anndata( + adata=q, + groupby='louvain', + smooth_mode='connectivity', + recompute_neighbors=0, + cores=8 + ) + + t1 = time.time() + + print("Adata List with SMooTHed counts.") + print(len(q_list)) + + print('end time: ' + str(t1)) + print('TOTAL TIME: ' + str(t1-t0)) + print("done") diff --git a/gssnng/test/test_return_smoothed.py b/gssnng/test/test_return_smoothed.py index d75b33c..79ebd23 100644 --- a/gssnng/test/test_return_smoothed.py +++ b/gssnng/test/test_return_smoothed.py @@ -1,7 +1,7 @@ if __name__ == '__main__': import scanpy as sc - from gssnng.smooth_anndatas import smooth_anndata + from gssnng.nnsmooth import smooth_anndata import time def test_return_smoothed(adata): From 35a84e4901197e128fb6f0a157b13463f668b716 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 16:07:19 -0800 Subject: [PATCH 07/13] update to docs, examples, and smooth function --- docs/decoupler_api_doc.rst | 2 +- docs/gmt_files_doc.rst | 2 +- docs/smoothing_adatas.rst | 62 +++++++++++++++++++++++-- gssnng/nnsmooth.py | 51 -------------------- gssnng/smoothing.py | 57 +++++++++++++++++++++++ gssnng/test/example_smoothing_counts.py | 18 +++---- 6 files changed, 127 insertions(+), 65 deletions(-) delete mode 100644 gssnng/nnsmooth.py diff --git a/docs/decoupler_api_doc.rst b/docs/decoupler_api_doc.rst index 04c8835..3c993c7 100644 --- a/docs/decoupler_api_doc.rst +++ b/docs/decoupler_api_doc.rst @@ -14,9 +14,9 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :maxdepth: 2 Installation - Scoring Functions Example script Usage + Scoring Functions Parameters Groupby Gene sets diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index e3d94e7..9b950f1 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -14,9 +14,9 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :maxdepth: 2 Installation - Scoring Functions Example script Usage + Scoring Functions Parameters Groupby Gene sets diff --git a/docs/smoothing_adatas.rst b/docs/smoothing_adatas.rst index 3d4b8e8..c2f239d 100644 --- a/docs/smoothing_adatas.rst +++ b/docs/smoothing_adatas.rst @@ -14,7 +14,6 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :maxdepth: 2 Installation - Scoring Functions Example script Usage Parameters @@ -73,13 +72,68 @@ See gssnng/notebooks for examples on all methods. :: - from gssnng import nnsmooth + from gssnng import smoothing q = sc.datasets.pbmc3k_processed() - q_list = nnsmooth.smooth_adata(adata=q, # AnnData object + q_list = smoothing.smooth_adata(adata=q, # AnnData object groupby='louvain', # Will sample neighbors within this group, can take a list smooth_mode='connectivity', # Smooths matrix using distance weights from NN graph. - recompute_neighbors=32, # Rebuild nearest neighbor graph with groups, 0 turns off function + recompute_neighbors=11, # Rebuild nearest neighbor graph with groups, 0 turns off function cores=4) # Smoothed in parallel. + + +Parameters +========== + +These parameters are used with the "scores_cells.with_gene_sets" function.:: + + adata: AnnData object from scanpy.read_* + AnnData containing the cells to be scored + + groupby: [str, list, dict] + either a column label in adata.obs, and all categories taken, or a dict specifies one group. + SEE DESCRIPTION BELOW + + smooth_mode: "adjacency", "connectivity", or "off" + Dictates how to use the neighborhood graph. + `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more + + recompute_neighbors: int + should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N + + cores: int + number of parallel processes to work through groupby groups + + +Groupby +======= + +The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example +above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the +neighborhood and will available for sampling. + +Groupby specifies a column name that's found in the AnnData.obs table, and it can also take a list of column names. +In that case, cells will be grouped as the intersection of categories. For example, using groupby=['louvain','phenotype'] +will take cells that are first in a given louvain cluster and then also in a given phenotype group. By also setting +the recompute_neighbors, the nearest neighbor graph is recomputed within this subset of cells. Controlling the +neighborhood leads to more controlled smoothing of the count matrix and is more suitable for downstream comparisons. + + +References +========== + +rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf + +singscore: https://pubmed.ncbi.nlm.nih.gov/30400809/ + +anndata: https://anndata.readthedocs.io/en/latest/ + +MSigDB: https://www.gsea-msigdb.org/gsea/msigdb/ + +ssGSEA: https://gsea-msigdb.github.io/ssGSEA-gpmodule/v10/index.html + +decoupler: https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac016/6544613 + +omnipath: https://omnipathdb.org/ diff --git a/gssnng/nnsmooth.py b/gssnng/nnsmooth.py deleted file mode 100644 index 85b1be9..0000000 --- a/gssnng/nnsmooth.py +++ /dev/null @@ -1,51 +0,0 @@ -import anndata -from gssnng.score_cells import _proc_data -from gssnng.util import error_checking -from typing import Union - -def smooth_adata( - adata: anndata.AnnData, - groupby: Union[str, list, dict], - smooth_mode: str, - recompute_neighbors: int, - cores: int - ) -> anndata.AnnData: - - """ - nearest neighbor smoothing of the expression matrix - - :param adata - anndata.AnnData containing the cells to be scored - :param groupby - either a column label in adata.obs, and all categories taken, or a dict specifies one group. - :param smooth_mode - `adjacency` or `connectivity`, which representation of the neighborhood graph to use. - `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more - :param recompute_neighbors - should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N - :param method_params - specific params for each method. - :param cores - number of parallel processes to work through groupby groups - - :returns: a list of adatas with smoothed data - """ - - return_data = 1 - noise_trials = 0 ### not used currently - samp_neighbors = None ### also not used - just_smoothing=1 - - error_checking(adata, samp_neighbors, recompute_neighbors, - None, None, None, method_params, just_smoothing) - - if method_params == None: - method_params = dict() - - # score each cell with the list of gene sets - data_list = _proc_data(adata, None, groupby, smooth_mode, recompute_neighbors, - None, None, samp_neighbors, - noise_trials, None, cores, return_data) - - print("**done**") - return(data_list) diff --git a/gssnng/smoothing.py b/gssnng/smoothing.py index 2c02a5c..31265ea 100644 --- a/gssnng/smoothing.py +++ b/gssnng/smoothing.py @@ -1,6 +1,11 @@ +#from gssnng.score_cells import _proc_data +import gssnng +from gssnng.util import error_checking +from typing import Union import numpy as np from scipy import sparse import logging +import anndata NN_DISTANCE_KEY = 'distances' # scanpy names in .obsp @@ -10,6 +15,58 @@ # multiplying should leave a "one-vector" still sum to one +# returns a list of adatas, each with a nearest neighbor smoothed expression matrix +def smooth_adata( + adata: anndata.AnnData, + groupby: Union[str, list, dict], + smooth_mode: str, + recompute_neighbors: int, + cores: int + ) -> anndata.AnnData: + + """ + returns a list of adatas, each with a nearest neighbor smoothed expression matrix + + :param adata + anndata.AnnData containing the cells to be scored + :param groupby + either a column label in adata.obs, and all categories taken, or a dict specifies one group. + :param smooth_mode + `adjacency` or `connectivity`, which representation of the neighborhood graph to use. + `adjacency` weights all neighbors equally, `connectivity` weights close neighbors more + :param recompute_neighbors + should neighbors be recomputed within each group, 0 for no, >0 for yes and specifies N + :param method_params + specific params for each method. + :param cores + number of parallel processes to work through groupby groups + + :returns: a list of adatas with smoothed data + """ + + return_data = 1 + noise_trials = 0 ### not used currently + samp_neighbors = None ### also not used + just_smoothing=1 + + # no params for now + method_params = dict() + + error_checking(adata, samp_neighbors, recompute_neighbors, + None, None, None, method_params, just_smoothing) + + + # score each cell with the list of gene sets + data_list = gssnng.score_cells._proc_data(adata, None, groupby, smooth_mode, recompute_neighbors, + None, method_params, samp_neighbors, + noise_trials, None, cores, return_data) + + print("**done**") + return(data_list) + + + + def get_smoothing_matrix(adata, mode, add_diag): """ using the nearest neighbor graph in adata.obsp, calculate the smoothing diff --git a/gssnng/test/example_smoothing_counts.py b/gssnng/test/example_smoothing_counts.py index 4f4a529..3ebc929 100644 --- a/gssnng/test/example_smoothing_counts.py +++ b/gssnng/test/example_smoothing_counts.py @@ -1,28 +1,30 @@ -if __name__ == '__main__': +from gssnng import smoothing +import scanpy as sc +import time - from gssnng import nnsmooth - import scanpy as sc - import time +if __name__ == '__main__': print("reading data") q = sc.datasets.pbmc3k_processed() t0 = time.time() - print('start time: ' + str(t0)) + print('starting the smOOthing') - print("scoring cells") - q_list = smooth_anndatas.smooth_anndata( + q_list = smoothing.smooth_adata( adata=q, groupby='louvain', smooth_mode='connectivity', - recompute_neighbors=0, + recompute_neighbors=11, cores=8 ) t1 = time.time() print("Adata List with SMooTHed counts.") + print("Each is a tuple with groupby category and adata as elements.") print(len(q_list)) + for qi in q_list: + print(qi[1] + " X size: " + str(qi[0].X.shape)) print('end time: ' + str(t1)) print('TOTAL TIME: ' + str(t1-t0)) From 4e0a3f6ddc7578ee5e2d4358f27814d4bcdb43d1 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 16:41:47 -0800 Subject: [PATCH 08/13] example scripts now run as pytests --- docs/index.rst | 7 ++-- gssnng/test/example_decoupler_omnipath_api.py | 22 +++++----- gssnng/test/example_gmt_input.py | 13 +++--- gssnng/test/example_smoothing_counts.py | 5 ++- gssnng/test/geneset_overlap_test.py | 38 ------------------ gssnng/test/score_cells_test.py | 40 ------------------- gssnng/test/ssgsea_test.py | 36 ----------------- gssnng/test/test_example_scripts.py | 28 +++++++++++++ gssnng/test/test_return_smoothed.py | 29 -------------- setup.py | 2 +- 10 files changed, 57 insertions(+), 163 deletions(-) delete mode 100644 gssnng/test/geneset_overlap_test.py delete mode 100644 gssnng/test/score_cells_test.py delete mode 100644 gssnng/test/ssgsea_test.py create mode 100644 gssnng/test/test_example_scripts.py delete mode 100644 gssnng/test/test_return_smoothed.py diff --git a/docs/index.rst b/docs/index.rst index 54747c2..4671f9e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,7 +3,6 @@ gssnng Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). - Notebook using gmt gene set files ===>>> `Open In Colab `_ Notebook using the Decoupler/Omnipath style API ===>>> `Open In Colab `_ @@ -14,9 +13,9 @@ Contents -------- .. toctree:: - with gmt gene set files - with a decoupler/omnipath style - + gssnng with gene set gmt files + gssnng with a decoupler/omnipath style + gssnng to smooth count matrices The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. diff --git a/gssnng/test/example_decoupler_omnipath_api.py b/gssnng/test/example_decoupler_omnipath_api.py index 5a47654..faa4a81 100644 --- a/gssnng/test/example_decoupler_omnipath_api.py +++ b/gssnng/test/example_decoupler_omnipath_api.py @@ -1,13 +1,14 @@ -if __name__ == '__main__': - import time - from gssnng import score_cells, gene_sets - import decoupler as dc - import omnipath - import scanpy as sc - #from scipy import sparse +import time +from gssnng import score_cells, gene_sets +import decoupler as dc +import omnipath +import scanpy as sc +#from scipy import sparse +def decoupler_test(): + # load the data adata = sc.datasets.pbmc3k_processed() #adata.X = sparse.csr_matrix(adata.X) # @@ -17,7 +18,7 @@ #Note: These genesets contain genes negatively associated with the signature # (i.e. low expression of a gene indicates the presence of a signature). # We filter those out here as gssnng doesn't take into account the (negative) weight. - print("getting model") + print("getting omnipath model") model = dc.get_progeny().query('weight>0') print(model.head()) @@ -47,4 +48,7 @@ t1 = time.time() print('end time: ' + str(t1)) print('TOTAL TIME: ' + str(t1 - t0)) - print("done") \ No newline at end of file + print("done with decoupler api") + +#if __name__ == '__main__': +decoupler_test() \ No newline at end of file diff --git a/gssnng/test/example_gmt_input.py b/gssnng/test/example_gmt_input.py index 7bf6a52..5e8ef38 100644 --- a/gssnng/test/example_gmt_input.py +++ b/gssnng/test/example_gmt_input.py @@ -1,8 +1,8 @@ -if __name__ == '__main__': +from gssnng import score_cells +import scanpy as sc +import time - from gssnng import score_cells - import scanpy as sc - import time +def run_gmt_input(): print("reading data") q = sc.datasets.pbmc3k_processed() @@ -30,4 +30,7 @@ print('end time: ' + str(t1)) print('TOTAL TIME: ' + str(t1-t0)) - print("done") \ No newline at end of file + print("done") + +#if __name__ == '__main__': +run_gmt_input() \ No newline at end of file diff --git a/gssnng/test/example_smoothing_counts.py b/gssnng/test/example_smoothing_counts.py index 3ebc929..457dd2b 100644 --- a/gssnng/test/example_smoothing_counts.py +++ b/gssnng/test/example_smoothing_counts.py @@ -2,7 +2,7 @@ import scanpy as sc import time -if __name__ == '__main__': +def run_smoothing_example(): print("reading data") q = sc.datasets.pbmc3k_processed() @@ -29,3 +29,6 @@ print('end time: ' + str(t1)) print('TOTAL TIME: ' + str(t1-t0)) print("done") + +#if __name__ == '__main__': +run_smoothing_example() diff --git a/gssnng/test/geneset_overlap_test.py b/gssnng/test/geneset_overlap_test.py deleted file mode 100644 index d7a97fe..0000000 --- a/gssnng/test/geneset_overlap_test.py +++ /dev/null @@ -1,38 +0,0 @@ -if __name__ == '__main__': - - import scanpy as sc - from gssnng.score_cells import with_gene_sets - import time - - def test_score_all_sets_fun(adata, genesets): - res0 = with_gene_sets(adata=adata, - gene_set_file=genesets, - groupby='louvain', - smooth_mode='adjacency', - recompute_neighbors=32, - score_method='geneset_overlap', #'rank_biased_overlap', #, - method_params={'threshold':0}, #{'rbo_depth':50}, #{'normalization':'theoretical'}, - ranked=False, - cores=6) - return(res0) - - - def test_score_all_sets(): - q = sc.datasets.pbmc3k_processed() - gs = 'data/cibersort_lm22.gmt' # 'data/gene_set_test.gmt' #'data/cibersort_lm22.gmt' # - print("computing knn...") - sc.pp.neighbors(q, n_neighbors=32) - print('scoring...') - t0 = time.time() - print('start time: ' + str(t0)) - score_list = test_score_all_sets_fun(q, gs) - print('******DONE*******') - t1 = time.time() - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print(q.obs.head()) - print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) - #q.write_h5ad('data/pbmc3k_lm22_scores.h5ad') - - test_score_all_sets() - print('test score_all_sets done') diff --git a/gssnng/test/score_cells_test.py b/gssnng/test/score_cells_test.py deleted file mode 100644 index e70a961..0000000 --- a/gssnng/test/score_cells_test.py +++ /dev/null @@ -1,40 +0,0 @@ -if __name__ == '__main__': - - import scanpy as sc - from gssnng import score_cells - import time - - def test_score_all_sets_fun(adata, gene_set_file): - q=score_cells.with_gene_sets( - adata=adata, - gene_set_file=gene_set_file, - groupby="louvain", - smooth_mode='connectivity', - recompute_neighbors=0, - score_method="median_score", - method_params={}, - ranked=True, - cores=6 - ) - return(q) - - - def test_score_all_sets(): - q = sc.datasets.pbmc3k_processed() - gs = 'data/cibersort_lm22.gmt' # 'data/gene_set_test.gmt' #'data/cibersort_lm22.gmt' # - print("computing knn...") - sc.pp.neighbors(q, n_neighbors=32) - print('scoring...') - t0 = time.time() - print('start time: ' + str(t0)) - score_list = test_score_all_sets_fun(q, gs) - print('******DONE*******') - t1 = time.time() - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print(q.obs.head()) - print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) - #q.write_h5ad('data/pbmc3k_lm22_scores.h5ad') - - test_score_all_sets() - print('test score_all_sets done') diff --git a/gssnng/test/ssgsea_test.py b/gssnng/test/ssgsea_test.py deleted file mode 100644 index aada94a..0000000 --- a/gssnng/test/ssgsea_test.py +++ /dev/null @@ -1,36 +0,0 @@ -if __name__ == '__main__': - - from gssnng import score_cells - import scanpy as sc - import time - - print("reading data") - q = sc.datasets.pbmc3k_processed() - - print("computing neighborhood") - sc.pp.neighbors(q, n_neighbors=64) - - t0 = time.time() - print('start time: ' + str(t0)) - - print("scoring cells") - score_cells.with_gene_sets( - adata=q, - gene_set_file="data/cibersort_lm22.gmt", - groupby='louvain', - smooth_mode='connectivity', - recompute_neighbors=32, - score_method="ssgsea", - method_params={'omega':0.25}, - ranked=True, - cores=8 - ) - - t1 = time.time() - - print("MEAN SCORES for the T.cells.CD8.up signature") - print(q.obs.groupby(['louvain'])['T.cells.CD8.up'].mean().reset_index()) - - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print("done") diff --git a/gssnng/test/test_example_scripts.py b/gssnng/test/test_example_scripts.py new file mode 100644 index 0000000..d9163b8 --- /dev/null +++ b/gssnng/test/test_example_scripts.py @@ -0,0 +1,28 @@ + +import time +import pathlib +import runpy + +def run_examples(): + t0 = time.time() + print('start time: ' + str(t0)) + # grabbing the example scripts + scriptPath = pathlib.Path('.') + scripts = list(scriptPath.glob('example_*.py')) + + print("RUNNING SCRIPTS: ") + for si in scripts: + print(si) + runpy.run_path(si) + print("") + + print('******DONE*******') + t1 = time.time() + print('end time: ' + str(t1)) + print('TOTAL TIME: ' + str(t1-t0)) + print('test done') + return(True) + +def test_that_examples_run(): + assert run_examples() == True ## The up and dn sets should be combined into one. + diff --git a/gssnng/test/test_return_smoothed.py b/gssnng/test/test_return_smoothed.py deleted file mode 100644 index 79ebd23..0000000 --- a/gssnng/test/test_return_smoothed.py +++ /dev/null @@ -1,29 +0,0 @@ -if __name__ == '__main__': - - import scanpy as sc - from gssnng.nnsmooth import smooth_anndata - import time - - def test_return_smoothed(adata): - res0 = smooth_anndata(adata=adata, - groupby='louvain', - smooth_mode='adjacency', - recompute_neighbors=32, - method_params={}, - cores=4) - return(res0) - - - def test_score_all_sets(): - q = sc.datasets.pbmc3k_processed() - t0 = time.time() - print('start time: ' + str(t0)) - data_list = test_return_smoothed(q) - print('******DONE*******') - t1 = time.time() - print('end time: ' + str(t1)) - print('TOTAL TIME: ' + str(t1-t0)) - print(len(data_list)) - - test_score_all_sets() - print('test done') diff --git a/setup.py b/setup.py index eed9719..16a7f56 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ def readme(): return f.read() setup(name='gssnng', - version='0.4.2', + version='0.5.0', description='Gene Set Scoring on the Nearest Neighbor Graph (gssnng)', url='http://github.com/gibbsdavidl/gssnng', author='David Gibbs,Michael Strasser', From 97617f42cbd2d51c51cc5eea66abec5cbe199b9f Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 17:03:57 -0800 Subject: [PATCH 09/13] update to readthedocs --- README.md | 6 ++---- docs/decoupler_api_doc.rst | 9 +-------- docs/gmt_files_doc.rst | 11 +++-------- docs/index.rst | 4 ++-- ...{smoothing_adatas.rst => smoothing_adatas_doc.rst} | 9 ++------- 5 files changed, 10 insertions(+), 29 deletions(-) rename docs/{smoothing_adatas.rst => smoothing_adatas_doc.rst} (97%) diff --git a/README.md b/README.md index 81d6079..8d2de52 100644 --- a/README.md +++ b/README.md @@ -4,18 +4,16 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq This package works with Scanpy AnnData objects stored as h5ad files. + * **[Read the Docs!](https://gssnng.readthedocs.io/en/latest/)** * **Notebook using gmt files ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/gssnng_quick_start.ipynb) * **Notebook using Decoupler/Omnipath style API ===>>>** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IlyaLab/gssnng/blob/main/notebooks/Scoring_PBMC_data_with_the_GSSNNG_decoupleR_API.ipynb) - * **Notebook for smoothing counts, breaks AnnData into groups. + * **Notebook for smoothing counts. COMING SOON! For now, see the example script in test. * **See the paper ===>>>** [gssnng](https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad150/7321111?login=false) - * and finally, [Read the Docs!](https://gssnng.readthedocs.io/en/latest/) - - The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. diff --git a/docs/decoupler_api_doc.rst b/docs/decoupler_api_doc.rst index 3c993c7..92fbf36 100644 --- a/docs/decoupler_api_doc.rst +++ b/docs/decoupler_api_doc.rst @@ -13,14 +13,7 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :caption: Table of Contents :maxdepth: 2 - Installation - Example script - Usage - Scoring Functions - Parameters - Groupby - Gene sets - References + ___Usage `**Notebook using gmt files** `_ diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index 9b950f1..8fd3030 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -13,14 +13,9 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :caption: Table of Contents :maxdepth: 2 - Installation - Example script - Usage - Scoring Functions - Parameters - Groupby - Gene sets - References + ___Usage + ___Scoring Functions + ___Parameters `**Notebook using gmt files** `_ diff --git a/docs/index.rst b/docs/index.rst index 4671f9e..a71ad5c 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -14,8 +14,8 @@ Contents .. toctree:: gssnng with gene set gmt files - gssnng with a decoupler/omnipath style - gssnng to smooth count matrices + gssnng with a decoupler/omnipath style + gssnng to smooth count matrices The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. diff --git a/docs/smoothing_adatas.rst b/docs/smoothing_adatas_doc.rst similarity index 97% rename from docs/smoothing_adatas.rst rename to docs/smoothing_adatas_doc.rst index c2f239d..dfc730c 100644 --- a/docs/smoothing_adatas.rst +++ b/docs/smoothing_adatas_doc.rst @@ -13,13 +13,8 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :caption: Table of Contents :maxdepth: 2 - Installation - Example script - Usage - Parameters - Groupby - Gene sets - References + __Usage + `**Notebook using gmt files** `_ From cdf720c666bd5989874e3f9a161a8f66fd5470db Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 17:24:23 -0800 Subject: [PATCH 10/13] update to readthedocs --- docs/decoupler_api_doc.rst | 4 +--- docs/gmt_files_doc.rst | 6 +----- docs/smoothing_adatas_doc.rst | 5 +---- 3 files changed, 3 insertions(+), 12 deletions(-) diff --git a/docs/decoupler_api_doc.rst b/docs/decoupler_api_doc.rst index 92fbf36..2fb5bf0 100644 --- a/docs/decoupler_api_doc.rst +++ b/docs/decoupler_api_doc.rst @@ -11,9 +11,7 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq .. .. toctree:: :caption: Table of Contents - :maxdepth: 2 - - ___Usage + :maxdepth: 0 `**Notebook using gmt files** `_ diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index 8fd3030..302ef67 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -11,11 +11,7 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq .. .. toctree:: :caption: Table of Contents - :maxdepth: 2 - - ___Usage - ___Scoring Functions - ___Parameters + :maxdepth: 0 `**Notebook using gmt files** `_ diff --git a/docs/smoothing_adatas_doc.rst b/docs/smoothing_adatas_doc.rst index dfc730c..9e08930 100644 --- a/docs/smoothing_adatas_doc.rst +++ b/docs/smoothing_adatas_doc.rst @@ -11,10 +11,7 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq .. .. toctree:: :caption: Table of Contents - :maxdepth: 2 - - __Usage - + :maxdepth: 0 `**Notebook using gmt files** `_ From d2b1addc1ae9ebf81d89daa1f8175567835203a0 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 17:32:23 -0800 Subject: [PATCH 11/13] update to readthedocs --- docs/decoupler_api_doc.rst | 23 ++++++++++++----------- docs/gmt_files_doc.rst | 23 ++++++++++++----------- docs/smoothing_adatas_doc.rst | 14 +++++++------- 3 files changed, 31 insertions(+), 29 deletions(-) diff --git a/docs/decoupler_api_doc.rst b/docs/decoupler_api_doc.rst index 2fb5bf0..baaa7fd 100644 --- a/docs/decoupler_api_doc.rst +++ b/docs/decoupler_api_doc.rst @@ -4,14 +4,14 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. gssnng using the decoupler-omnipath api -================================== +======================================= Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). .. .. toctree:: :caption: Table of Contents - :maxdepth: 0 + :maxdepth: 1 `**Notebook using gmt files** `_ @@ -34,7 +34,7 @@ Some methods have additional parameters, see below! Installation -============ +------------ Install the package using the following commands:: @@ -46,7 +46,7 @@ Install the package using the following commands:: Example script -============== +-------------- Copy the script out from the cloned repo and run, check the paths if you get an error. @@ -57,7 +57,7 @@ Copy the script out from the cloned repo and run, check the paths if you get an python3.10 example_decoupler_omnipath_api.py Usage -====== +----- See gssnng/notebooks for examples on all methods. @@ -98,7 +98,7 @@ See gssnng/notebooks for examples on all methods. Scoring Functions -================= +----------------- The list of scoring functions:: @@ -126,7 +126,7 @@ The list of scoring functions:: Parameters -========== +---------- These parameters are used with the "scores_cells.with_gene_sets" function.:: @@ -166,7 +166,7 @@ These parameters are used with the "scores_cells.with_gene_sets" function.:: Groupby -======= +------- The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the @@ -180,12 +180,13 @@ neighborhood leads to more controlled smoothing of the count matrix and is more Gene sets -========= +--------- We are following the Omnipath wrapper APIs supplied by Decoupler. See: https://saezlab.github.io/decoupleR/reference/index.html#omnipath-wrappers for available gene sets. -## Method options +Method parameters +----------------- Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:: @@ -210,7 +211,7 @@ The ssGSEA method uses this parameter as a exponent to the ranks. It has been st summed_up References -========== +---------- rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index 302ef67..3939ba7 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -4,14 +4,14 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. gssnng using gmt files -================================== +====================== Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). .. .. toctree:: :caption: Table of Contents - :maxdepth: 0 + :maxdepth: 1 `**Notebook using gmt files** `_ @@ -35,7 +35,7 @@ Some methods have additional parameters, see below! Installation -============ +------------ Install the package using the following commands:: @@ -47,7 +47,7 @@ Install the package using the following commands:: Example script -============== +-------------- Copy the script out from the cloned repo and run, check the paths if you get an error. @@ -59,7 +59,7 @@ Copy the script out from the cloned repo and run, check the paths if you get an Usage -====== +----- See gssnng/notebooks for examples on all methods. @@ -88,7 +88,7 @@ See gssnng/notebooks for examples on all methods. sc.pl.umap(q, color=['louvain','T.cells.CD8.up'], wspace=0.35) Scoring Functions -================= +----------------- The list of scoring functions: @@ -112,7 +112,7 @@ The list of scoring functions: Parameters -========== +---------- These parameters are used with the "scores_cells.with_gene_sets" function.:: @@ -147,7 +147,7 @@ These parameters are used with the "scores_cells.with_gene_sets" function.:: Groupby -======= +------- The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the @@ -161,7 +161,7 @@ neighborhood leads to more controlled smoothing of the count matrix and is more Gene sets -========= +--------- We are following the MSigDB nomenclature, where gene sets default to up, but can have direction specified with the suffix "_UP" (example: CD8_signature_UP or CD8.signature.up). If the gene set name has suffix "_DN" (example: CD8_signature_DN or @@ -169,7 +169,8 @@ CD8.signature.dn), then low expressed genes will have large ranks and produce po In the use of singscore or Z scores, the undirected case is based on absolute values, so either direction, in the extreme, will result in a large score. -## Method options +Method parameters +----------------- Some methods have some additional options. They are passed as a dictionary, method_params={param_name, param_value}.:: @@ -194,7 +195,7 @@ The ssGSEA method uses this parameter as a exponent to the ranks. It has been st summed_up References -========== +---------- rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf diff --git a/docs/smoothing_adatas_doc.rst b/docs/smoothing_adatas_doc.rst index 9e08930..e483b8b 100644 --- a/docs/smoothing_adatas_doc.rst +++ b/docs/smoothing_adatas_doc.rst @@ -11,7 +11,7 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq .. .. toctree:: :caption: Table of Contents - :maxdepth: 0 + :maxdepth: 1 `**Notebook using gmt files** `_ @@ -28,7 +28,7 @@ For creating groups, up to four categorical variables can be used, which are fou Installation -============ +------------ Install the package using the following commands:: @@ -40,7 +40,7 @@ Install the package using the following commands:: Example script -============== +-------------- Copy the script out from the cloned repo and run, check the paths if you get an error. @@ -52,7 +52,7 @@ Copy the script out from the cloned repo and run, check the paths if you get an Usage -====== +----- See gssnng/notebooks for examples on all methods. @@ -77,7 +77,7 @@ See gssnng/notebooks for examples on all methods. Parameters -========== +---------- These parameters are used with the "scores_cells.with_gene_sets" function.:: @@ -100,7 +100,7 @@ These parameters are used with the "scores_cells.with_gene_sets" function.:: Groupby -======= +------- The specific neighborhood for each cell can be controlled by using the groupby parameter. In the example above, by setting groupby='louvain', only cells within a louvain cluster will be considered as being part of the @@ -114,7 +114,7 @@ neighborhood leads to more controlled smoothing of the count matrix and is more References -========== +---------- rank biased overlap: https://arxiv.org/pdf/1408.3587.pdf From 9b5f5f5ef0535e8a795656f61e9851ca02c56605 Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 17:55:16 -0800 Subject: [PATCH 12/13] update to readthedocs formatting --- docs/decoupler_api_doc.rst | 27 +++++++++++---------------- docs/gmt_files_doc.rst | 29 ++++++++++++----------------- docs/index.rst | 10 ++++------ docs/smoothing_adatas_doc.rst | 12 ++---------- 4 files changed, 29 insertions(+), 49 deletions(-) diff --git a/docs/decoupler_api_doc.rst b/docs/decoupler_api_doc.rst index baaa7fd..d690e1b 100644 --- a/docs/decoupler_api_doc.rst +++ b/docs/decoupler_api_doc.rst @@ -3,8 +3,8 @@ sphinx-quickstart on Wed Apr 27 09:20:15 2022. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -gssnng using the decoupler-omnipath api -======================================= +The decoupler-omnipath api +========================== Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). @@ -13,13 +13,8 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :caption: Table of Contents :maxdepth: 1 - -`**Notebook using gmt files** `_ - `**Notebook using Decoupler/Omnipath style API** `_ -`**Notebook for creating smoothed count matrices**`_ - `**See the paper** `_ This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. @@ -102,23 +97,23 @@ Scoring Functions The list of scoring functions:: - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + **geneset_overlap**: For each geneset, number (or fraction) of genes expressed past a given threshold. - singscore: Normalised mean (median centered) ranks (requires ranked data) + **singscore**: Normalised mean (median centered) ranks (requires ranked data) - ssGSEA: Single sample GSEA based on ranked data. + **ssGSEA**: Single sample GSEA based on ranked data. - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + **rank_biased_overlap**: RBO, Weighted average of agreement between sorted ranks and gene set. - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + **robust_std**: Med(x-med / mad), median of robust standardized values (recommend unranked). - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + **mean_z**: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - average_score: Mean ranks or counts + **average_score**: Mean ranks or counts - median_score: Median of counts or ranks + **median_score**: Median of counts or ranks - summed_up: Sum up the ranks or counts. + **summed_up**: Sum up the ranks or counts. diff --git a/docs/gmt_files_doc.rst b/docs/gmt_files_doc.rst index 3939ba7..1234177 100644 --- a/docs/gmt_files_doc.rst +++ b/docs/gmt_files_doc.rst @@ -3,7 +3,7 @@ sphinx-quickstart on Wed Apr 27 09:20:15 2022. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -gssnng using gmt files +Genesets as .gmt files ====================== Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). @@ -16,10 +16,6 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq `**Notebook using gmt files** `_ -`**Notebook using Decoupler/Omnipath style API** `_ - -`**Notebook for creating smoothed count matrices**`_ - `**See the paper** `_ @@ -57,7 +53,6 @@ Copy the script out from the cloned repo and run, check the paths if you get an python3.10 example_gmt_input.py - Usage ----- @@ -90,25 +85,25 @@ See gssnng/notebooks for examples on all methods. Scoring Functions ----------------- -The list of scoring functions: +The list of scoring functions:: - geneset_overlap: For each geneset, number (or fraction) of genes expressed past a given threshold. + **geneset_overlap**: For each geneset, number (or fraction) of genes expressed past a given threshold. - singscore: Normalised mean (median centered) ranks (requires ranked data) + **singscore**: Normalised mean (median centered) ranks (requires ranked data) - ssGSEA: Single sample GSEA based on ranked data. + **ssGSEA**: Single sample GSEA based on ranked data. - rank_biased_overlap: RBO, Weighted average of agreement between sorted ranks and gene set. + **rank_biased_overlap**: RBO, Weighted average of agreement between sorted ranks and gene set. - robust_std: Med(x-med / mad), median of robust standardized values (recommend unranked). + **robust_std**: Med(x-med / mad), median of robust standardized values (recommend unranked). - mean_z: Mean( (x - mean)/stddv ), average z score. (recommend unranked). + **mean_z**: Mean( (x - mean)/stddv ), average z score. (recommend unranked). - average_score: Mean ranks or counts + **average_score**: Mean ranks or counts - median_score: Median of counts or ranks + **median_score**: Median of counts or ranks - summed_up: Sum up the ranks or counts. + **summed_up**: Sum up the ranks or counts. Parameters @@ -176,7 +171,7 @@ Some methods have some additional options. They are passed as a dictionary, meth singscore: {'normalization', 'theoretical'}, {'normalization', 'standard'} -The singscore manuscript describes the theoretical method of standarization which involves determining the theoretical max and minimum ranks for the given gene set.:: +The singscore manuscript describes the theoretical method of standardization which involves determining the theoretical max and minimum ranks for the given gene set.:: rank_biased_overlap: {'rbo_depth', n} (n: int) diff --git a/docs/index.rst b/docs/index.rst index a71ad5c..276216d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,9 +13,9 @@ Contents -------- .. toctree:: - gssnng with gene set gmt files - gssnng with a decoupler/omnipath style - gssnng to smooth count matrices + Using gene set gmt files + The decoupler/omnipath api + Smooth count matrices The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. The GSSNNG method is based on using the nearest neighbor graph of cells for data smoothing. This essentially creates mini-pseudobulk expression profiles for each cell, which can be scored by using single sample gene set scoring methods often associated with bulk RNA-seq. @@ -23,9 +23,7 @@ Nearest neighbor graphs (NNG) are constructed based on user defined groups (see This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. Gene sets can be provided by using .gmt files or through the OmniPath API (see below). -Scoring functions work with ranked or unranked data ("your mileage may vary") - .. note:: - This project is under active development. Please consider using a named release. + This project is under active development. Please consider using a named release if you're concerned about reproducibility. diff --git a/docs/smoothing_adatas_doc.rst b/docs/smoothing_adatas_doc.rst index e483b8b..05cd34b 100644 --- a/docs/smoothing_adatas_doc.rst +++ b/docs/smoothing_adatas_doc.rst @@ -3,8 +3,8 @@ sphinx-quickstart on Wed Apr 27 09:20:15 2022. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -gssnng to make smoothed count matrices -====================================== +Smoothing count matrices +======================== Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq (scRNA-seq). @@ -13,16 +13,8 @@ Gene Set Scoring on the Nearest Neighbor Graph (gssnng) for Single Cell RNA-seq :caption: Table of Contents :maxdepth: 1 - -`**Notebook using gmt files** `_ - -`**Notebook using Decoupler/Omnipath style API** `_ - -`**Notebook for creating smoothed count matrices**`_ - `**See the paper** `_ - This package works with AnnData objects stored as h5ad files. Expression values are taken from adata.X. For creating groups, up to four categorical variables can be used, which are found in the adata.obs table. From 71f19e4ce4927d95c58654b3e96a22ac1e9e2a6f Mon Sep 17 00:00:00 2001 From: David L Gibbs Date: Fri, 9 Feb 2024 18:02:38 -0800 Subject: [PATCH 13/13] update to readthedocs formatting --- docs/index.rst | 2 +- docs/smoothing_adatas_doc.rst | 8 ++++++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index 276216d..5f2caf2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -14,7 +14,7 @@ Contents .. toctree:: Using gene set gmt files - The decoupler/omnipath api + Decoupler/OmniPath API Smooth count matrices The problem: The sparsity of scRNA-seq data creates a poor overlap with many gene sets, which in turn makes gene set scoring difficult. diff --git a/docs/smoothing_adatas_doc.rst b/docs/smoothing_adatas_doc.rst index 05cd34b..569b67d 100644 --- a/docs/smoothing_adatas_doc.rst +++ b/docs/smoothing_adatas_doc.rst @@ -52,7 +52,9 @@ See gssnng/notebooks for examples on all methods. 2. Get gene sets formatted as a .gmt file. (default is UP, also uses _UP, _DN, and split gene sets _UP+_DN), see below for more details. -3. Score cells, each gene set will show up as a column in adata.obs. +3. Smooth, each category defined by the groupby will be an AnnData. + +4. Returned list is a tuple of (AnnData, Groupby-Category) :: @@ -66,7 +68,9 @@ See gssnng/notebooks for examples on all methods. recompute_neighbors=11, # Rebuild nearest neighbor graph with groups, 0 turns off function cores=4) # Smoothed in parallel. - + # q_list is a list of tuples + for qi in q_list: + print(qi[1]) # the groupby-category names Parameters ----------