This document describes the output produced by the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- Merge and Filter Reads - Merge and filter input fastq files
- Subsample Reads - Subsample merged and filtered reads
- Align Reads - Align all reads to the reference genome
- Extract UMI - Extract UMI sequences of all reads
- Cluster UMI - Cluster UMI sequences per amplicon
- Filter UMI cluster - Filter UMI sequences per cluster
- Polish cluster - Polish clusters and extract consensus reads per cluster
- Align Consensus Reads - Align all consensus reads to the reference genome
- Extract UMIs - Extract UMI sequences of consensus reads
- Cluster Consensus Reads - Cluster consensus reads by their UMIs
- Parse Consensus Clusters - Parse consensus reads of the clusters
- Align Final Consensus Reads - Align all consensus reads to the reference genome
- Call Variants - Perform a variant calling of consensus reads (optional)
- Pipeline Info - reports from nextflow about the pipeline run
Input fastq files will be merged and filtered - if specified by user - using catfishq.
If specified by user the merged and filtered reads will be subsampled using seqtk sample.
Output directory: <output>/<barcodeXX>/subsampling/
Information regarding subsampling will be saved in a tsv file.
Output directory: <output>/<barcodeXX>/stats/
Merged - filtered and subsampled - reads will be aligned to the provided reference genome using minimap2.
The aligned reads are filtered for primary alignment and to have more than 90 % overlap with the reference sequence using a python script and split by filter reason.
Output directory: <output>/<barcodeXX>/align/<type>/
The UMI sequences are extracted from the filtered and aligned reads. The raw reads are written into the header of the FASTQ/A file and the extarcted UMIs are saved as read sequence using a python script.
Output directory: <output>/<barcodeXX>/<fasta/fastq>_umi/<type>/
The reads are clustered by their UMI sequences using vsearch.
Output directory: <output>/<barcodeXX>/clustering/<type>/
The clusters are filtered to contain only reads with a defined hamming distance (default: 3).
Output directory: <output>/<barcodeXX>/clustering/<type>/
Clusters with less than a minimal number of reads are excluded.
Clusters with more than a maximal number of reads are downsampled, keeping only highest quality reads.
Clusters are parsed and written into single files for cluster polishing.
Output directory: <output>/<barcodeXX>/clustering/<type>/clusters
Stats of every cluster are documented.
Output directory: <output>/<barcodeXX>/stats/<type>/
All steps are done using a python script
The clusters are polished using medaka and consensus sequences are created.
Output directory: <output>/<barcodeXX>/clustering/<type>/clusters
Consensus sequences are aligned to the reference again.
Output directory: <output>/<barcodeXX>/align/<type>/
Extract the UMIs of the consensus read clusters.
Output directory: <output>/<barcodeXX>/<fasta/fastq>_umi/<type>/
Cluster aligned consensus reads.
Output directory: <output>/<barcodeXX>/clustering/<type>/
Parse final consensus sequences of the consensus clusters using a python script.
Output directory: <output>/<barcodeXX>/align/<type>/
Consensus sequences are aligned to the reference again.
Output directory: <output>/<barcodeXX>/align/<type>/
Call variants of the final consensus sequences.
Output directory: <output>/<barcodeXX>/<freebayes/mutserve/lowfreq>/
Nextflow has several built-in reporting tools that give information about the pipeline run.
Output directory: <output>/nextflow_stats/
dag.svg
- MMD file, which can be visualized using Mermaid
report.html
- Nextflow report describing parameters, computational resource usage and task bash commands used.
timeline.html
- A waterfall timeline plot showing the running times of the workflow tasks.
trace.txt
- A text file with machine-readable statistics about every task executed in the pipeline.