Skip to content

A minimal NXF workflow for analysing consortium wastewater data

Notifications You must be signed in to change notification settings

LooseLab/ww_nf_minimal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wastewater analysis

Introduction

ww_minimal is a bioinformatics analysis pipeline used to perform the initial quality control and variant analysis on wastewater sequencing samples. This pipeline supports Illumina short-reads prepared using the Nimagen primer scheme on various platforms (NovaSeq, NextSeq, MiSeq).

Pipeline summary

  1. Merge sequencing FASTQ files (pigz)
  2. Adapter trimming (fastp)
  3. Variant calling
    1. Read alignment (bwa mem)
    2. Sort and index alignments (Samtools)
    3. Primer sequence removal (BAMClipper)
    4. Genome-wide and amplicon coverage (mosdepth, Samtools ampliconstats)
    5. Variant calling (freyja variants/demix; samples may fail on this step due to low coverage, these are omitted from further analysis in the pipeline, they are not excluded overall)
    6. Extract WHO and pango lineages (collate_results.py, collate_lineages.py)
    7. Aggregate all sample outputs (xsv)

Quickstart

This pipeline uses conda for environment and package management (recommended to use miniconda).

Initialise environment

With [mini]conda installed:

git clone https://github.com/LooseLab/ww_nf_minimal
cd ww_nf_minimal
conda env create -f environment.yml

Run test profile

conda activate ww_minimal
nextflow run main.nf -profile test

Running an actual run

After successfully running the test subset you can attempt to run on other samples. Read the input section for how to setup the FASTQ directory and sample sheet. Once these are done the pipeline can be run like so:

nextflow run /path/to/main.nf --readsdir <FASTQ INPUT DIRECTORY> --sample_sheet <SAMPLE SHEET CSV> -with-report report.html

If nextflow crashes while running, you can add the flag -resume to the previous command to check for cached jobs so the entire pipeline does not need to be re-run.

Input and Output

Input

There are two required user supplied inputs, the sample sheet and the FASTQ reads directory. These can be supplied by editing the nextflow.config file adding the sample_sheet and readsdir attributes to the params or supplied on the command line using --sample_sheet and --readsdir. In addition there are three static inputs that are provided with the workflow (these may change in the future as the primer scheme changes). These are the reference genome, paired-end primer file, and amplicon primer file.

FASTQ

This pipeline expects FASTQ files to be structured inside an input directory with subfolders for each sequencing lab and then further subfolders for each run ID. For most labs the share directory can be used directly, however samples from Exeter require symlinking. An example input directory structure can be seen below:

input
├── <LAB1>
│  ├── <RUN1>
│  │  ├── SAMPLE_R1_L002_001.fastq.gz
│  │  └── SAMPLE_R2_L002_001.fastq.gz
│  └── <RUN2>
│     ├── SAMPLE_R1_L002_001.fastq.gz
│     └── SAMPLE_R2_L002_001.fastq.gz
└── <LAB2>
   └── <RUN1>
      ├── SAMPLE_L001_R1_001.fastq.gz
      ├── SAMPLE_L001_R2_001.fastq.gz
      ├── ...
      ├── SAMPLE_L004_R1_001.fastq.gz
      └── SAMPLE_L004_R2_001.fastq.gz

Sample sheet CSV

The CSV sample sheet is required as this informs the pipeline which samples should be analysed. It currently requires 6 fields:

  1. sample_id
  2. sample_site_code
  3. timestamp_sample_collected
  4. sequencing_lab_code
  5. sequencing_sample_id
  6. sequencing_run_id

These are used to find the input FASTQ files in the readsdir. All fields are passed through to the aggregation steps at the end of the pipeline.

Output

Output files are written, by default to a results directory where the pipeline is called from. This folder is organised for each step that emits files and results like so:

results
├── aggregated.csv
├── all_lineages.csv
├── <LAB1>
│  ├── <RUN1>
│  │  ├── alignments
│  │  ├── ampliconstats
│  │  ├── bamclipper
│  │  ├── freyja
│  │  ├── mosdepth
│  │  ├── stats_csv
│  │  └── trimmed
│  └── <RUN2>
│     ├── alignments
│     ├── ampliconstats
│     ├── bamclipper
│     ├── freyja
│     ├── mosdepth
│     ├── stats_csv
│     └── trimmed
└── <LAB2>
   └── <RUN1>
      ├── alignments
      ├── ampliconstats
      ├── bamclipper
      ├── freyja
      ├── mosdepth
      ├── stats_csv
      └── trimmed

Outputs that are organised in directories under a <RUN ID> are the raw outputs from the steps in pipeline summary. The aggregated outputs are placed at the top level as these combine data from all of the sequencing labs and runs.

aggregated.csv

This CSV file aggregates the WHO lineages, their frequencies, and sequencing depths for all the samples that are able to complete analysis. As multiple lineages maybe present multiple rows can be returned for a single sample.

Column Description
amplicon_mean Mean coverage over all amplicons including zeros
non_zero_amplicon_mean Mean coverage over amplicons excluding zeros
amplicon_median Median coverage over all amplicons including zeros
non_zero_amplicon_median Median coverage over amplicons excluding zeros
count_gte_20 Count of amplicons with at least (≥) 20× coverage
count_lt_20 Count of amplicons with less than (<) 20× coverage
stdev Standard deviation of coverage over all amplicons
non_zero_stdev Standard deviation of coverage over all amplicons excluding zeros
lineage WHO lineage assigned by Freyja
abundance Abundance of this WHO lineage
mean_genome_coverage Mean coverage over whole genome from mosdepth
sample_id Sample ID used in the pipeline
sample_site_code Sample site location code
timestamp_sample_collected Timestamp sample collected
sequencing_lab_code Sequencing lab
original_sample_id Original metadata sample id
sequencing_sample_id Sample ID used in the pipeline
sequencing_run_id Run ID for the sample
amplicon_001_mean_depth Coverage over this individual amplicon
... ...
amplicon_154_mean_depth repeated for all amplicons

all_lineages.csv

This CSV file aggregates Pango lineages that are assigned by Freyja. It is a more fine-grained breakdown of the sample composition than the WHO lineages.

Column Description
lineage Pango lineage assigned by Freyja
abundance Abundance of this lineage
sample_id Sample ID used in the pipeline
sample_site_code Sample site location code
timestamp_sample_collected Timestamp sample collected
sequencing_lab_code Sequencing lab
original_sample_id Original metadata sample id
sequencing_sample_id Sample ID used in the pipeline
sequencing_run_id Run ID for the sample

About

A minimal NXF workflow for analysing consortium wastewater data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published