CELL-type Expression-specific integration for Complex Traits (CELLECT) is a computational toolkit for identifing likely etiologic cell-types underlying complex traits. CELLECT leverages existing genetic prioritization models to integrate single-cell transcriptomic and human genetic data when identifing likely etiologic cell-types.
CELLECT quantifies the association between common polygenetic GWAS signal (heritability) and cell-type expression specificity (ES) of genes using established genetic prioritization models such as S-LDSC (Finucane et al., 2015) and MAGMA covariate analysis (Skene et al., 2018). The output of CELLECT is a list of prioritized etiologic cell-types for a given human complex disease or trait.
CELLECT takes as input GWAS data and cell-type expression specificity estimates. In order to compute robust estimates of ES, we developed the computational method called CELLEX (CELL-type EXpression-specificity). CELLEX is built on the observation that different ES metrics provide complementary cell-type expression specific profiles. Our method incorporates a ‘wisdom of the crowd’ approach by integrating multiple ES metrics to obtain improved robustness and a more expressive ES measure that captures multiple aspects of expression specificity.
Figure legend: conceptual illustration of CELLECT and CELLEX. The bottom layer shows a disease or trait with multiple genetic components (G1-G4). CELLECT integrates disease heritability estimates with cell-type expression specificity to identify the etiologic cell-types (T1 and T4) underlying the genetic components (G1 and G4). CELLEX estimates expression specificity from single-cell transcriptomic data.
See the official CELLECT release history and the CHANGELOG for details.
To update to the latest version of CELLECT:
git pull # get the latest version from github
git submodule update --init --recursive # get the latest version of the ldsc submodule*
* updating submodules can be problematic and depends on your git version. If you have issues, please refer to stackoverflow and contact us if your problem persists.
Step 1: Install git lfs
We use git lfs
to store the CELLECT data files on github. To download the files you need to have git lfs
setup before you clone the repository.
On OSX: brew install git-lfs; git lfs install
or Ubuntu:sudo apt-get install git-lfs; git lfs install
. For other operating systems, follow this guide.
Step 2: Clone CELLECT repository
Before you clone: check that you've installed git lfs by running git lfs env
. If you get a message that says 'lfs' is not a git command
, git lfs is not installed properly. If git lfs env
does not produces what you think it should produce then consult troubleshooting git lfs.
Clone the repository:
git clone --recurse-submodules https://github.com/perslab/CELLECT.git
The --recurse-submodules
is needed to clone the git submodule 'ldsc' (pascaltimshel/ldsc), which is a modfied version of the original ldsc repository.
(Cloning the repo might take few minutes as the CELLECT data files (> 1-3 GB) will be downloaded. To skip downloading the data files, use GIT_LFS_SKIP_SMUDGE=1 git clone --recurse-submodules https://github.com/perslab/CELLECT.git
instead.)
Step 3: Install Snakemake via conda
CELLECT uses the workflow management software Snakemake. To make things easier for you, CELLECT snakemake workflow utilises conda environments to avoid any issues with software dependencies and versioning. CELLECT snakemake workflow will automatically install all necessary dependencies. All you need to do is to install anaconda or miniconda (if conda is not already present on your system) and then install snakemake:
conda install -c bioconda -c conda-forge snakemake">=5.27.4"
(Notice the version requirement for snakemake. This ensures snakemake runs as fast as possible) If you have trouble installing snakemake, using the above command, then try out:
conda install -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
A configuration file is provided and includes paths to example data that require additional downloads and pre-processing. In order to run the example, please follow the CELLECT LDSC Tutorial or CELLECT MAGMA Tutorial.
-
Modify the
config.yml
file: specify the input GWAS summary stats and CELLEX cell-type expression specificity. These must be in the correct format - see the aforementioned tutorial for example. -
Run the workflow:
CELLECT-LDSC:
snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile config.yml
or CELLECT-MAGMA:
snakemake --use-conda -j -s cellect-magma.snakefile --configfile config.yml
We recommend running with -j
as it will use all available cores. Specifying -j 4
will use up to 4 cores.
- Inspect the output:
<BASE_OUTPUT_DIR>/<CELLECT-{LDSC,MAGMA}>/results/prioritization.csv
gives you cell-type prioritization results. You can plot the .csv file to make similar plots to this:
See our Github wiki for the CELLECT-LDSC tutorial.
See our Github wiki for the CELLECT-MAGMA tutorial.
Please see our Github wiki for full documentation of CELLECT. The Appendix in Timshel (eLife, 2020): Genetic mapping of etiologic brain cell types for obesity also contains relevant information on the methodology.
We gratefully acknowledge the developers of the genetic prioritization tools used in CELLECT: LDSC and MAGMA. In particular, Christiaan de Leeuw and Steven Gazal for their generous support.
- Pascal Nordgren Timshel (University of Copenhagen) @ptimshel
- Tobi Alegbe (University of Cambridge) @tobionformatics
- Ben Nielsen (University of Copenhagen)
- Liubov Pashkova (University of Copenhagen) @incorrigiblema3
- Jon Thompson ([email protected])
Please create an issue on the github repo if you encounter any problems using CELLECT. Alternatively, you may write an email to timshel(at)sund.ku.dk
If you find CELLECT useful for your research, please consider citing the paper:
Timshel (eLife, 2020): Genetic mapping of etiologic brain cell types for obesity