ww_minimal is a bioinformatics analysis pipeline used to perform the initial quality control and variant analysis on wastewater sequencing samples. This pipeline supports Illumina short-reads prepared using the Nimagen primer scheme on various platforms (NovaSeq, NextSeq, MiSeq).
- Merge sequencing FASTQ files (
pigz
) - Adapter trimming (
fastp
) - Variant calling
- Read alignment (
bwa mem
) - Sort and index alignments (
Samtools
) - Primer sequence removal (
BAMClipper
) - Genome-wide and amplicon coverage (
mosdepth
,Samtools ampliconstats
) - Variant calling (
freyja variants/demix
; samples may fail on this step due to low coverage, these are omitted from further analysis in the pipeline, they are not excluded overall) - Extract WHO and pango lineages (
collate_results.py
,collate_lineages.py
) - Aggregate all sample outputs (
xsv
)
- Read alignment (
This pipeline uses conda
for environment and package management (recommended to use miniconda
).
With [mini]conda
installed:
git clone https://github.com/LooseLab/ww_nf_minimal
cd ww_nf_minimal
conda env create -f environment.yml
conda activate ww_minimal
nextflow run main.nf -profile test
After successfully running the test subset you can attempt to run on other samples. Read the input section for how to setup the FASTQ directory and sample sheet. Once these are done the pipeline can be run like so:
nextflow run /path/to/main.nf --readsdir <FASTQ INPUT DIRECTORY> --sample_sheet <SAMPLE SHEET CSV> -with-report report.html
If nextflow crashes while running, you can add the flag -resume
to the previous command to check for cached jobs so the entire pipeline does not need to be re-run.
There are two required user supplied inputs, the sample sheet and the FASTQ reads directory.
These can be supplied by editing the nextflow.config
file adding the sample_sheet
and readsdir
attributes to the params
or supplied on the command line using --sample_sheet
and --readsdir
.
In addition there are three static inputs that are provided with the workflow (these may change in the future as the primer scheme changes).
These are the reference genome, paired-end primer file, and amplicon primer file.
This pipeline expects FASTQ files to be structured inside an input directory with subfolders for each sequencing lab and then further subfolders for each run ID. For most labs the share directory can be used directly, however samples from Exeter require symlinking. An example input directory structure can be seen below:
input
├── <LAB1>
│ ├── <RUN1>
│ │ ├── SAMPLE_R1_L002_001.fastq.gz
│ │ └── SAMPLE_R2_L002_001.fastq.gz
│ └── <RUN2>
│ ├── SAMPLE_R1_L002_001.fastq.gz
│ └── SAMPLE_R2_L002_001.fastq.gz
└── <LAB2>
└── <RUN1>
├── SAMPLE_L001_R1_001.fastq.gz
├── SAMPLE_L001_R2_001.fastq.gz
├── ...
├── SAMPLE_L004_R1_001.fastq.gz
└── SAMPLE_L004_R2_001.fastq.gz
The CSV sample sheet is required as this informs the pipeline which samples should be analysed. It currently requires 6 fields:
sample_id
sample_site_code
timestamp_sample_collected
sequencing_lab_code
sequencing_sample_id
sequencing_run_id
These are used to find the input FASTQ files in the readsdir
.
All fields are passed through to the aggregation steps at the end of the pipeline.
Output files are written, by default to a results
directory where the pipeline is called from.
This folder is organised for each step that emits files and results like so:
results
├── aggregated.csv
├── all_lineages.csv
├── <LAB1>
│ ├── <RUN1>
│ │ ├── alignments
│ │ ├── ampliconstats
│ │ ├── bamclipper
│ │ ├── freyja
│ │ ├── mosdepth
│ │ ├── stats_csv
│ │ └── trimmed
│ └── <RUN2>
│ ├── alignments
│ ├── ampliconstats
│ ├── bamclipper
│ ├── freyja
│ ├── mosdepth
│ ├── stats_csv
│ └── trimmed
└── <LAB2>
└── <RUN1>
├── alignments
├── ampliconstats
├── bamclipper
├── freyja
├── mosdepth
├── stats_csv
└── trimmed
Outputs that are organised in directories under a <RUN ID>
are the raw outputs from the steps in pipeline summary.
The aggregated outputs are placed at the top level as these combine data from all of the sequencing labs and runs.
This CSV file aggregates the WHO lineages, their frequencies, and sequencing depths for all the samples that are able to complete analysis. As multiple lineages maybe present multiple rows can be returned for a single sample.
Column | Description |
---|---|
amplicon_mean | Mean coverage over all amplicons including zeros |
non_zero_amplicon_mean | Mean coverage over amplicons excluding zeros |
amplicon_median | Median coverage over all amplicons including zeros |
non_zero_amplicon_median | Median coverage over amplicons excluding zeros |
count_gte_20 | Count of amplicons with at least (≥) 20× coverage |
count_lt_20 | Count of amplicons with less than (<) 20× coverage |
stdev | Standard deviation of coverage over all amplicons |
non_zero_stdev | Standard deviation of coverage over all amplicons excluding zeros |
lineage | WHO lineage assigned by Freyja |
abundance | Abundance of this WHO lineage |
mean_genome_coverage | Mean coverage over whole genome from mosdepth |
sample_id | Sample ID used in the pipeline |
sample_site_code | Sample site location code |
timestamp_sample_collected | Timestamp sample collected |
sequencing_lab_code | Sequencing lab |
original_sample_id | Original metadata sample id |
sequencing_sample_id | Sample ID used in the pipeline |
sequencing_run_id | Run ID for the sample |
amplicon_001_mean_depth | Coverage over this individual amplicon |
... | ... |
amplicon_154_mean_depth | repeated for all amplicons |
This CSV file aggregates Pango lineages that are assigned by Freyja. It is a more fine-grained breakdown of the sample composition than the WHO lineages.
Column | Description |
---|---|
lineage | Pango lineage assigned by Freyja |
abundance | Abundance of this lineage |
sample_id | Sample ID used in the pipeline |
sample_site_code | Sample site location code |
timestamp_sample_collected | Timestamp sample collected |
sequencing_lab_code | Sequencing lab |
original_sample_id | Original metadata sample id |
sequencing_sample_id | Sample ID used in the pipeline |
sequencing_run_id | Run ID for the sample |