Qiime2 workflow for 16S analysis created with snakemake.
- Ann-Kathrin Brüggemann (@AKBrueggemann)
- Thomas Battenfeld (@thomasbtf)
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).
If you want to add your own changes to the workflow, create a GitHub repository of your own, then clone this one.
- Create a new github repository using this workflow as a template.
- Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
If you just want to use this workflow locally, then simply clone it or download it as zip-file.
When you have the folder structure added on your local machine, please add a "data" folder manually.
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, and metadata.txt
to specify your sample setup.
Some important parameters you should check and set according to your own fatsq-files in the config.yaml
are primers for the forward and reverse reads, the datatype
, that should be used by QIIME2 and the min-seq-length
. Based on the sequencing, the length of the reads can vary.
The default parameters for filtering and truncation were validated with the help of a MOCK community and fitted to retrieve all bacteria from that community.
In addition to that, you need to fit the metadata-parameters to your data. Please change the names of the used metadata-columns according to your information.
If your metadata is not containing numeric values, please use the "reduced-analysis" option in the config file to run the workflow, as the workflow is currently not able to run only on categorical metadata for the full analysis version. We are going to fix that in the future.
The workflow is able to perform clustering and denoising either with vsearch, leading to OTU creation, or with DADA2, creating ASVs. You can decide which modus to use by setting the variable "DADA2" to True
(DADA2 usage) or False
(vsearch).
Please make sure, that the names of your fastq files are correctly formatted. They should look like this:
samplename_SNumber_Lane_R1/R2_001.fastq.gz
Create a snakemake environment using mamba via:
mamba create -c conda-forge -c bioconda -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Fill up the metadata.txt
with the information of your samples:
Please be careful to not include spaces between the commas. If there is a column, that you don't have any information about, please leave it empty and simply
go on with the next column.
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Executing the workflow takes two steps:
Data preparation: snakemake --cores $N --use-conda data_prep
Workflow execution: snakemake --cores $N --use-conda
using $N
cores.
After successful execution, the workflow provides you with a compressed folder, holding all interesting results ready to decompress or to download to your local machine. The compressed file 16S-report.tar.gz holds several qiime2-artifacts that can be inspected via qiime-view. In the zipped folder report.zip is the snakemake html report holding graphics as well as the DAG of the executed jobs and html files leading you directly to the qiime2-results, without the need of using qiime-view.
This report can, e.g., be forwarded to your collaborators.
Whenever you change something, don't forget to commit the changes back to your github copy of the repository:
git commit -a
git push
Whenever you want to synchronize your workflow copy with new developments from upstream, do the following.
- Once, register the upstream repository in your local copy:
git remote add -f upstream [email protected]:snakemake-workflows/16S.git
orgit remote add -f upstream https://github.com/snakemake-workflows/16S.git
if you do not have setup ssh keys. - Update the upstream version:
git fetch upstream
. - Create a diff with the current version:
git diff HEAD upstream/master workflow > upstream-changes.diff
. - Investigate the changes:
vim upstream-changes.diff
. - Apply the modified diff via:
git apply upstream-changes.diff
. - Carefully check whether you need to update the config files:
git diff HEAD upstream/master config
. If so, do it manually, and only where necessary, since you would otherwise likely overwrite your settings and samples.
In case you have also changed or added steps, please consider contributing them back to the original repository:
- Fork the original repo to a personal or lab account.
- Clone the fork to your local system, to a different place than where you ran your analysis.
- Copy the modified files from your analysis to the clone of your fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. - Commit and push your changes to your fork.
- Create a pull request against the original repository.
Test cases are in the subfolder .test
. They are automatically executed via continuous integration with Github Actions.
A list of the tools used in this pipeline:
Tool | Link |
---|---|
QIIME2 | www.doi.org/10.1038/s41587-019-0209-9 |
Snakemake | www.doi.org/10.12688/f1000research.29032.1 |
FastQC | www.bioinformatics.babraham.ac.uk/projects/fastqc |
MultiQC | www.doi.org/10.1093/bioinformatics/btw354 |
pandas | pandas.pydata.org |
kraken2 | www.doi.org/10.1186/s13059-019-1891-0 |
vsearch | www.github.com/torognes/vsearch |
DADA2 | www.doi.org/10.1038/nmeth.3869 |
songbird | www.doi.org/10.1038/s41467-019-10656-5 |
bowtie2 | www.doi.org/10.1038/nmeth.1923 |
Ancom | www.doi.org/10.3402/mehd.v26.27663 |