- a proper installation of miniconda3 (
~/miniconda3
) - Dorado should be in the PATH.
- Docker should run without sudo. To do so:
sudo groupadd docker sudo usermod -aG docker username su - username docker run hello-world
-
git clone this repository
git clone https://github.com/ziphra/promline
-
Create a
conda env
frompromline.yml
cd promline conda env create -f promline.yml conda activate promline mv dorado_models ${CONDA_PREFIX}/bin/ mv clair3_models ${CONDA_PREFIX}/bin/
The folders dorado_models and clair3_models contain some available basecalling models for Dorado and variant calling models for Clair3. The most up-to-date models can be downloaded from the Dorado GitHub page for Dorado models or from the Rerio GitHub page for Clair3 models.
-
Make the script executable:
chmod +x pipeline.sh
USAGE: pipeline2.sh [flags] args
flags:
-w,--base: working directory (default: '.')
-s,--sample: sample name (default: 'JohnDoe')
-R,--ref: reference genome (default: '')
-f,--fast5: directory containing raw fast5 (default: '')
-p,--pod5: directory for fast5 to pod5 conversion output or already containing pod5 (default: '')
-q,--fastqs: basecalling directory from guppy or dorado, in case of real time basecalling during sequencing. This pipeline will only use pass reads. The sequencing summary should be in this directory. If one fastq file is provided, instead of a guppy basecalling directory, a copy of the sequencing summary should be placed in the base directory (-w)
(default: '')
-S,--summary: path to sequencing_summary.txt generated during real time sequencing. Provide a sequencing_summary.txt if you start the pipeline with a bam (default: '')
-k,--bam: path to aligned BAM. File has to be indexed. (default: '')
-c,--snp_caller: snp caller: either pmdv, clair3 or all (default: 'clair3')
-B,--[no]basecalling: basecalling (default: false)
-M,--[no]modified: modified bases calling (default: false)
-A,--[no]alignment: alignment (default: true)
-r,--flowcell: flowcell and basecalling model: either r9 or r10 (default: 'r10')
-b,--bps: bases called per second (default: '400')
-a,--acc: basecalling accuracy (default: 'sup')
-d,--[no]duplex: duplex basecalling (default: false)
-t,--threads: max number of threads to use (default: '')
-v,--[no]version: tools versions (default: false)
-h,--help: show this help (default: false)
Run as:
pipeline2.sh -w ./workingdirectory \
-s samplename \
-R ref.fa \
-f fast5 \
-p pod5 \
-M \
--bps 260 \
--acc fast
-B \
-A \
-c clair3 \
-r r10 \
-t 32 2>&1 | tee log.txt
2>&1 | tee log.txt
allow to store the pipeline log to log.txt
.
Some tools will use GPUs when available.
A folder structured as follow:
- dorado
- unaligned bam file
- mmi
- aligned bam file and its index
- vc
- clair3
- merge_output.vcf.gz
- phased_output.bam
- ...
- sniffles
- sniffles_BND.vcf
- noQC_snifles.vcf
- cnvpytor
- sample500.xlsx
- sample1000.xlsx
- clair3
- QC.html
- sequencing_summary.txt
The pod5 file format is used for storing nanopore sequencing data. It requires less space than fast5 and offers improved read/write performance. The first step of this pipeline is to convert fast5 files to pod5 format. This conversion is performed if the pod5 folder doesn't exist yet or if it is empty. To initiate the conversion, use the -f and -p options. By providing -f
and -p
, the first step of this pipeline will be to convert fast5 to pod5, if the pod5 folder doesn't exist yet or if it is empty.
2. basecalling with dorado
During the basecalling process, you can choose the basecalling speed, basecalling accuracy, and whether you want modified basecalling. The pipeline now supports stereo duplex basecalling as well. However, please note that stereo duplex and modified basecalling cannot be used simultaneously.
minimap2
The minimap2 tool is used for the alignment step. If fastqs are provided, they can be used with the -q option. For example, if basecalling was done directly on the sequencer.
pycoQC
For quality control, the pycoQC tool is utilized. If fastqs are provided, the pipeline will search for a sequencing summary in the fastqs directory to perform the quality check. You can also provide a sequencing summary using the -S option.
Sniffles
is the structural variant caller recommended by nanopore.
Sniffles will be run twice, with and without the quality filtering of variants.
The recommended structural variant caller is Sniffles, which will be run twice, once with quality filtering of variants and once without. Additionally, a tandem-repeats file is provided to aid Sniffles in improving variant calling in repetitive regions. The tandem-repeats file can be found here and is also available in the current directory. After SV calling, BND variants will be duplicated to create lines in the VCF with their BND mate's coordinates, allowing both breakpoints to be represented and equally annotated in subsequent steps.
Small variants calling can be done using either Clair3, the caller recommended by nanopore, or PEPPER-Margin-DeepVariant, or both.
CNV calling is performed using cnvpytor with txo different bin size: 500 and 1000.