scNOVA : Single-Cell Nucleosome Occupancy and genetic Variation Analysis - summarized in a single Snakemake pipeline.
PART0. Pre-requirement step - preparation of single-cell genetic information using scTRIP analysis (Sanders et al. 2020)
Mosaicatcher: https://github.com/friendsofstrandseq/mosaicatcher-pipeline
PART1. Read preprocessing
PART2. Read counting to generate single-cell genebody NO
PART3. Infer expressed genes of subclones
PART4. DE analysis of subclones (default and alternative mode)
PART5. (Optional) Infer Single-cell TF motif accessibility using chromVAR (20210108 updated), by default, Roadmap epigenomics DHS (Enhancers) will be used to define CREs
PART6. (Optional) Infer haplotype-resolved genebody NO (20210108 updated)
Main output
- Single-cell level NO table :
result/{SAMPLE}_sort_geneid.txt
- Infer expression probability for each clones :
result_CNN/DNN_train80_output_ypred_clone1_annot.txt
,
result_CNN/DNN_train80_output_ypred_clone2_annot.txt
- Infer differential expression table :
result/Result_scNOVA_infer_expression_table.txt
,
result/result_PLSDA_{SAMPLE}.txt
(20220121 updated) - Heatmap and UMAP visualization of significant hits :
result_plots/Result_scNOVA_plots_{SAMPLE}.pdf
,result_plots/Result_scNOVA_plots_{SAMPLE}_alternative_PLSDA.pdf
- (Optional) Single-cell level TF motif deviation z-score :
result/motif_dev_zscore_chromVAR_DHS_2kb_Enh_{SAMPLE}.txt
- (Optional) Haplotype-resolved NO of genebody and CREs :
result_haplo/Deeptool_Genebody_H1H2_sort.txt, result_haplo/Deeptool_DHS_2kb_H1H2_sort.txt
This workflow is mean to be run in a Unix-based operating system.
As of now, unix based tools and python packages are expected to be available via modules. R packages are automatically downloaded and used via conda.
- unix based tools : SAMtools/1.3.1-foss-2016b, biobambam2/2.0.76-foss-2016b, deeptools/2.5.1-foss-2016b-Python-2.7.12
- python packages : cuDNN, CUDA, TensorFlow, scikit-learn, matplotlib
- R packages: handled via conda install. You can choose not to use conda, in which case you will need the following packages: DESeq2, matrixStats, pheatmap, gplots, umap, Rtsne, factoextra, pracma, chromVAR, nabor, motifmatchr
-
Download this pipeline
- git lfs install
- git clone https://github.com/jeongdo801/scNOVA.git
- install dependencies (see further below)
-
Preparation of input files
- Add your single-cell bam and index files (input_bam/*.bam)
- Add key result files from mosaicatcher output in the input_user folder
- input_user/simpleCalls_llr4_poppriorsTRUE_haplotagsFALSE_gtcutoff0.05_regfactor6_filterTRUE.txt
- input_user/strandphaser_output.txt
- Add the subclonality information (input_user/input_subclonality.txt)
- Add the genes within copy number changed region to mask in the infer differential expression result, if it's not provided, genes will not be masked. (input_user/input_SV_affected_genes.txt)
-
Change the project name in the Snakefile
-
Setup dependencies (--use-conda is recommended as follows)
snakemake -j 4 --use-conda --conda-create-envs-only
-
Launch the run_pipeline.sh script
This tool reduces the burden of installing R dependencies and their correct versions by using conda. After following the installation and setup (steps 1-3), use snakemake -j 4 --use-conda --conda-create-envs-only
to prepare the conda environment.
Note 1: If you do not have a lot of space in your HOME
directory (i.e. ~
), make sure conda installs packages into a directory where you have more space. To do so, open ~/.condarc
(which already might have content if you have used conda before) and add the following lines to it:
pkgs_dirs:
- <path/to/directory_with_more_space>/envs/pkgs/
envs_dirs:
- <path/to/directory_with_more_space>/envs/
Note 2: If you want to install all dependencies manually and not use conda, simply remove --use-conda
from run_pipeline.sh
, which will make snakemake
ignore all conda
statements.
A note on running hundreds of jobs at the same time (cluster environment):
If running many concurrent jobs, a rare race condition can occur in which two R environments are activated at the same time. During this activation, a file is deleted for a short amount of time, which will throw an error if a separate activation is taking place at the same time. The issue is described here and can be worked around by applying this fix, which has not yet been merged into any branch unfortunately.
The error that gets thrown looks like this: /path/to/pipeline/.snakemake/conda/<hash>/lib/R/bin/R: line 248: /path/to/pipeline/.snakemake/conda/<hash>/lib/R/etc/ldpaths: No such file or directory
By default, to make CNN feature, scNOVA performs library size normalization followed by local copy number normalization.
However, for the samples with more dramatic karyotypic changes (e.g. changes in ploidy status), we provide a option to use copy number normalization before normalization by library size. To do so, users can change two lines in the Snake.config.json
{
...
"count_norm" : "utils/count_norm_ploidy.R",
...
"combine_features" : "utils/combine_features_ploidy.R",
...
}
For detailed information on scNOVA see
Jeong and Grimes et al., 2022 (https://www.nature.com/articles/s41587-022-01551-4)
For information on scTRIP see
Sanders et al., 2020 (doi: https://doi.org/10.1038/s41587-019-0366-x)