Thomas A. Sasani, David G. Ashbrook, Annabel C. Beichman, Lu Lu, Abraham A. Palmer, Robert W. Williams, Jonathan K. Pritchard, Kelley Harris
🐭 --> 🧬 --> 📊
We've made a Dash app that enables interactive exploration of some results from the manuscript. Check it out!
The code in this repository uses Snakemake to reproduce the entire manuscript from "top to bottom." This includes everything from downloading a reference genome to generating supplementary figures. However, it's also possible to simply generate the figures in the manuscript (step #2 below), since step #1 requires very large input files and quite a bit of time to execute.
This code is archived on Zenodo.
The basic outline is as follows:
-
Identify high-quality singleton mutations from the BXD VCF using
identify_singletons.smk
. -
Annotate singleton mutations and generate figures using
generate_figures.smk
.
If you just want to generate all of the manuscript figures, skip to the section entitled Usage for generating manuscript figures using precomputed data.
IMPORTANT NOTE:
I highly recommend that the
identify_singletons.smk
pipeline is run on a high-performance computing system or cluster. Theidentify_singletons.smk
pipeline assumes you've downloaded the BXD VCF (~50 Gb), and the many individual steps in the pipeline include downloading the mm10 reference genome (~1 Gb) and many Gbs of phastCons scores.
Depending on the particular steps of the pipeline you want to run, some of these steps/downloads can be avoided by editing the
identify_singletons.smk
pipeline directly. For example, if you don't want to get singleton data from the wild mouse genomes, simply comment out those output files inrule: all
. Or if you already have a copy of the mm10 reference, just add it to thedata/ref
directory.
- Dependencies
- Directory structure
- Usage for generating manuscript figures using precomputed data
- Usage for generating all raw data
- How to generate raw data on a HPC
- Running tests
Make sure that these are installed and/or in your system $PATH
! Versions in parentheses are the ones I used at the time of manuscript posting. I haven't experimented with other versions, so YMMV.
- conda (v4.9.2)
- these days, I'd recommend using
mamba
for building thepython
environments described below - assuming you have
conda
installed, useconda install mamba -n base -c conda-forge
to installmamba
, and then you can usemamba
as a 1-to-1 replacement forconda
in the instructions for buildingpython
environments provided later in the README
- these days, I'd recommend using
- bedtools (v2.29.2)
- bedops (v2.4.38)
- technically, you'll only need
wig2bed
- technically, you'll only need
- tabix (v1.10.2-125-g4162046)
- mutyper (v0.5.0)
- bcftools (v1.12)
- bgzip (v1.12)
- perl (v5.32.1)
- pal2nal (v14)
- SigProfilerExtractor (v1.1.3)
- codeml
codeml
is used byete3
for the PAML selection scans. Rather than installingcodeml
directly, I recommend usingete3 upgrade-external-tools
to install, followed by copying thecodeml
binary from/Users/your_username/.etetoolkit/ext_apps-latest/bin/codeml
(the default installation directory forete3 upgrade-external-tools
) into/Users/your_username/opt/anaconda3/envs/figure_generation/bin/
(i.e., thebin
directory of theconda
installation directory created after initializing an environment below).
All other python
and R
dependencies will be handled by conda
before executing a pipeline.
.
|__rules # individual `snakemake` "rule files" that are imported by the main pipelines
|__py_scripts # all of the `python` scripts called by `snakemake` rules
|__R_scripts # all of the `R` scripts called by `snakemake` rules
|__Rqtl_data # data used by R/qtl2 for QTL mapping
|__data # raw data files output by the `identify_singletons.smk` pipeline, plus third-party datasets
|__tests # `pytest` tests for various utility functions
|__figure_generation.yaml # YAML file containing all of the dependencies required to generate figures
|__singleton_calling.yaml # YAML file containing all of the dependencies required to call singletons
|__generate_figures.smk # main `snakemake` pipeline that generates main and supplementary figures
|__identify_singletons.smk # main `snakemake` pipeline that identifies singletons using the BXD VCF
|__singleton_calling_hpc_config.json # JSON config file template for running pipelines on cluster like SGE
If you'd rather not spend the time (and compute) needed to generate all of the singleton data from scratch, I've included the output of identify_singletons.smk
in the data/
directory. The command-line instructions below will reproduce all but one or two figures in the manuscript (like Supp. Fig. 1), which were created by hand in Illustrator.
- Annotate singletons and fixed variants with various metadata.
- Construct "tidy" dataframes containing summary information about mutation rates and spectra in each BXD.
- Perform QTL scans for phenotypes related to the mutation rate.
- Make plots.
In a number of analyses, we compare mutation spectra between the BXD singletons and previously published datasets. I've included the files below in the data/
directory, but here are instructions for downloading if necessary:
- singleton data from Dumont 2019
- download ZIP file from Supplementary data associated with the manuscript
- uncompress file and move
SuppTables_concat.xlsx
into thedata
directory of this repository
- de novo germline mutation data from Ohno et al. 2014
- download Supplementary Data 1
- move
41598_2014_BFsrep04689_MOESM2_ESM.xls
into thedata
directory of this repository
- mutation signatures from COSMIC
# enter a directory of your choice
cd $WORKDIR
# clone the BXD analysis GitHub repository
git clone https://github.com/harrispopgen/bxd_mutator_manuscript.git
# enter the directory
cd bxd_mutator_manuscript
# create a new conda environment with all of the dependencies
# required for running the pipeline
conda env create --name figure_generation --file figure_generation.yaml
# activate the new environment
conda activate figure_generation
# run the pipeline
snakemake \
-j 1 \ # number of simultaneous jobs to run
-s generate_figures.smk # name of the snakemake pipeline
This will produce plots from every main and supplementary figure in the manuscript, which will be added to a new plots/
directory.
- Download mm10 reference assembly.
- Download phastCons 60-way placental mammal WIG files for each chromosome, and convert to BED format.
- Identify singletons in each BXD.
- Identify "fixed" variants in D2 that we'll use for conservation score comparisons.
To generate all of the raw data in the manuscript, it's highly recommended that you run the following on a machine with the ability to run many simultaneous processes, and with at least 20 Gb of free space in the working directory.
The pipeline below also assumes that you've downloaded the BXD VCF (about 50 Gb). The path to this VCF must be edited in the Snakemake file.
# enter a directory of your choice (make sure
# you have at least 20 Gb of free space here)
cd $WORKDIR
# clone the BXD analysis GitHub repository
git clone https://github.com/harrispopgen/bxd_mutator_manuscript.git
# enter the directory
cd bxd_mutator_manuscript
# create a new conda environment with all of the dependencies
# required for running the pipeline
conda env create --name singleton_calling --file singleton_calling.yaml
# activate the new environment
conda activate singleton_calling
# install mutyper
pip install mutyper==0.5.0
# ---
# before running the pipeline, use your text editor of choice
# and make sure that the WORKDIR variable at the top of
# `identify_singletons.smk` is the absolute path to the directory
# that you're currently in, and that all of
# the other paths at the top of the pipeline point to the right
# binaries and directories on your machine.
# ---
# run the pipeline
snakemake \
-j 20 \ # number of simultaneous jobs to run (change if you want)
-s identify_singletons.smk # name of the snakemake pipeline
The Department of Genome Sciences at UW uses the Sun Grid Engine (SGE) for job management on the cluster. As an example, I've included a config file that can be used in conjunction with the following command to run the identify_singletons.smk
pipeline on a system with SGE. Just edit the various parameters in the file accordingly. If you use a different job management system (e.g., SLURM), there is documentation on how to submit jobs using snakemake
here.
# run the pipeline using SGE
snakemake \
-j 20 \
-s identify_singletons.smk \
--cluster-config singleton_calling_hpc_config.json \
--cluster "qsub -l centos={cluster.os} \
-l mfree={cluster.memory} \
-l h_rt={cluster.time} \
-l s_rt={cluster.time} \
-o {cluster.erroutdir} \
-e {cluster.erroutdir}" \
--latency-wait 30 \
--rerun-incomplete
After generating raw data, you can run the figure generation pipeline as described above.
There are a handful of pytest
tests for various utility functions that get used by the singleton calling and figure generation pipelines.
To run these tests, you'll first need to install the repository as a package using pip
. Make sure you're in the top level of this directory, then run:
pip install -e .
Then, you can simply run pytest .
, and the tests will run.