Skip to content

Latest commit

 

History

History
executable file
·
1099 lines (805 loc) · 41.6 KB

README.md

File metadata and controls

executable file
·
1099 lines (805 loc) · 41.6 KB

NGSeasy_logo

NGSeasy (beta)

** This is the latest dev project **
note: undergoing massive re-dev , many links broken...stay tuned and email us. give us a few weeks.

Funded by Biomedical Research Centre: http://core.brc.iop.kcl.ac.uk

Publication: pending

Authors: Stephen J Newhouse, Amos Folarin , Maximilian Kerz
Release Version: 1.0.0


[A Dockerized NGS pipeline and tool-box]


Tweet

<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

With NGSeasy you can now have full suite of NGS tools up and running on any high end workstation in an afternoon

Note: NGSeasy is under heavy development and the code and docs evolve quickly.

  • NGSeasy-v1.0 Full Production release will be available Early 2015

  • NGSeasy-v1.0.0b Full Production release contains most of the core fucntionality to go from raw fastq to raw vcf calls

  • NGSeasy updates every 12 months

  • GUI in development

  • Contact us of Reference Genomes and resource files


NGSeasy is completely open source and we encourage interested folks to jump in and get involved in the dev with us.


NGSeasy (Easy Analysis of Next Generation Sequencing)

We present NGSeasy (Easy Analysis of Next Generation Sequencing), a flexible and easy-to-use NGS pipeline for automated alignment, quality control, variant calling and annotation. The pipeline allows users with minimal computational/bioinformatic skills to set up and run an NGS analysis on their own samples, in less than an afternoon, on any operating system (Windows, iOS or Linux) or infrastructure (workstation, cluster or cloud).

NGS pipelines typically utilize a large and varied range of software components and incur a substantial configuration burden during deployment which limits their portability to different computational environments. NGSeasy simplifies this by providing the pipeline components encapsulated in Docker™ containers and bundles in a wide choice of tools for each module. Each module of the pipeline represents one functional grouping of tools (e.g. sequence alignment, variant calling etc.).

Deploying the pipeline is as simple as pulling the container images from the public repository into any host running Docker. NGSeasy can be deployed on any medium to high-end workstation, high performance computer cluster and compute clouds (public/private cloud computing) - enabling instant access to elastic scalability without investment overheads for additional compute hardware and makes open and reproducible research straight forward for the greater scientific community.

Advantages

  • Easy to use for non-informaticians.
  • All run from a single config file that can be made in Excel.
  • User can select from mutiple aligners, variant callers and variant annotators
  • No scary python, .yaml or .json files...just one simple Excel workbook saved as a textfile.
  • Just follow our simple set of instructions and NGS away!
  • Choice of aligners and variant callers and anntators
  • Allows reproducible research
  • Version controlled for auditing
  • Customisable
  • Easy to add new tools
  • If it's broke...we will fix it..
  • Enforced naming convention and directory structures
  • Allows users to run "Bake Offs" between tools with ease

We have adapted the current best practices from the Genome Analysis Toolkit (GATK, http://www.broadinstitute.org/gatk/guide/best-practices) for processing raw alignments in SAM/BAM format and variant calling. The current workflow, has been optimised for Illumina platforms, but can easily be adapted for other sequencing platforms, with minimal effort.

As the containers themselves can be run as executables with pre-specified cpu and RAM resources, the orchestration of the pipeline can be placed under the control of conventional load balancers if this mode is required.


Author Contact Details

Please contact us for help/guidance on using the beta release.

View Amos's profile on LinkedIn View Steve's profile on LinkedIn

Lets us know if you want other tools added to NGSeasy

Institution: NIHR Maudsley Biomedical Research Centre For Mental Health and Dementia Unit (Denmark Hill), at The Institute of Psychiatry, Psychology & Neuroscience (IoPPN), Kings College London


Overview of the NGSeasy Pipeline Components

The basic pipeline contains all the basic tools needed for manipulation and quality control of raw fastq files (ILLUMINA focused), SAM/BAM manipulation, alignment, cleaning (based on GATK best practises [http://www.broadinstitute.org/gatk/guide/best-practices]) and first pass variant discovery. Separate containers are provided for indepth variant annotation, structural variant calling, basic reporting and visualisations.

ngsEASY


The Full NGSeasy pipeline

The NGSeasy pipelines implement the following :-

For academic users and/or commercial/clinical groups whom have paid for GATK licensing, the next steps are to perform

For the non-GATK version

Note Some of the later functions i.e. variant annotation and qc reporting are still in dev.


We highly recommed read trimming prior to alignment. We have noticed considerable speed-ups in alignmnet time and increased quality of SNP/INDEL calls using trimmed vs raw fastq.

Base quality score recalibration is also recommended.
As an alternative to GATK, we have added fucntionality for use of BamUtil:recab for base quality score recalibration.

Non-GATK users

  • are encouraged to use aligners such as stampy and novoalign that perform base quality score recal on the fly.
  • are encouraged to use variant callers that perform local re-aligmnet around candidate sites to mitigate the need for the indel realignment stages.

Coming Soon


A Special note on the base image.

We include the following - what we think of as - NGS Powertools in the compbio/ngseasy-base image. These are all tools that allow the user to slice and dice BED/SAM/BAM/VCF files in multiple ways.

  1. samtools
  2. bcftools
  3. vcftools
  4. vcflib
  5. bamUtil
  6. bedtools2
  7. ogap
  8. samblaster
  9. sambamba
  10. bamleftalign
  11. seqtk
  12. parallel

This image is used as the base of all our compbio/ngseasy-* tools.

Why not a separate containers per application? The more docker-esque approach, would be to have separate containers for each NGS tool. However, this belies the fact that many of these tools interact in a deep way, allowing pipes and streamlined system calls for manipulating the output of NGS pipelines (BED/SAM/BAM/VCF files). Therefore, we built these into a single development environment for ngseasy.


Dockerised NGSeasy

docker

The following section describes getting the Dockerised NGSeasy Pipeline(s) and Resources, project set up and running NGSeasy.

Getting all resources and building required tools will take a few hours depending on network connections and any random "ghosts in the machine" - half a day in reality. But once you're set up, thats it - you are good to go.


1. Install Docker

Follow the simple instructions in the links provided below

A full set of instructions for multiple operating systems are available on the Docker website.

2. Get NGSeasy

We provide a simple Makefile to pull all of the public nsgeasy components, scripts and set up to correct project directory structre on your local machines.


git clone https://github.com/KHP-Informatics/ngseasy.git

cd ngseasy

make all

Setting up the initial project can take up a day, depending on your local network connections and speeds.

3. Set up NGSeasy Project configuration file

In Excel make config file and save as [TAB] Delimited file with .tsv extenstion.
See Example provided and GoogleDoc. Remove the header from this file before running the pipeline. This sets up Information related to: Project Name, Sample Name, Library Type, Pipeline to call, NCPU.

The [config.file.tsv] should contain the following 15 columns for each sample to be run through a pipeline:-

Variable type Description Options/Examples
POJECT_ID string Project ID Cancer
SAMPLE_ID string Sample ID T100
FASTQ1 string Raw fastq file name read 1 foo_1_fq.gz
FASTQ2 string Raw fastq file name read 1 foo_2_fq.gz
PROJECT_DIR string Project Directory /medida/ngs_projects
DNA_PREP_LIBRARY_ID string DNA Libray Prep ID Custom_Cancer
NGS_PLATFORM string Platform Name ILLUMINA
NGS_TYPE string Experiment type WGS, WEX, TGS
BAIT string user supplied bed file
CAPTURE string user supplied bed file
FASTQC string run FastQc skip, qc-fastq
TRIM string run Trimmomatic skip, qc-trimm, qc-adaptor
BSQR string Base Quality Score Recalibration skip, bam-recab, gatk-recab
REALN string Bam Realignment around indels skip,bam-realn, gatk-realn
ALIGNER string Aligner skip, bwa, bowtie2, stampy, snap, novoalign
VARCALLER string Variant Caller ensemble,ensemble-fast, freebayes, platypus, UnifiedGenotyper, HaplotypeCaller
CNV string CNV Caller skip, lump, delly, exomedepth
ANNOTATOR string Choose annotator skip
CLEANUP string Clean Up Files (TRUE/FALSE) TRUE/FALSE
NCPU number Number of cores to call 1..n
VERSION number NGSeasy Version 1.0
NGSUSER string user email address [email protected]

4. Run NGSeasy


All NGSeasy Docker images can be pulled down from compbio Docker Hub or using the Makefile.
We provide an Amazon EBS data volume with indexed genomes: XXXXXX


Dockerised NGS Tools

The following opensource tools are all provided.

Tool Build
ngseasy-base automated build
fastqc automated build
trimmomatic automated build
bwa automated build
bowtie automated build
picardtools automated build
samtools automated build
freebayes automated build
bedtools automated build
bcbiovar automated build
delly automated build
lumpy automated build
cnmops automated build
mhmm automated build
exomedepth automated build
bamutil automated build

samtools includes bcftools and htslib

Its as easy as: -

docker pull compbio/ngseasy-${TOOL}

The NGSeasy project directory

The user needs to make the relevent directory structures on their local machine before starting an NGS run.

On our sysetm we typically set up a top-level driectory called ngs_projects within which we store output from all our individual NGS projects.

Here we are working from local top level directory called media/, but this can really be any folder on your local system ie your home directory ~/${USER}.

Within this directory media we make the following folders: -

ngs_projects  
|  
|__raw_fastq  
|__config_files  
|__reference_genomes_b37  
|__gatk_resources  
|__ngseasy

Note The following directories are obtained in step 4. Download NGSeasy Resources.
**- reference_genomes_b37 **
**- gatk_resources **

Move to media

# Move to media/
cd media

make toplevel ngs_projects folder

# make toplevel NGS folder
mkdir ngs_projects 

make fast_raw folder

# fastq staging area
mkdir ngs_projects/fastq_raw 

make config_files folder

# config files
mkdir ngs_projects/config_files 

make ngseasy folder

# NGSeasy scripts
mkdir ngs_projects/ngseasy 

4. Download NGSeasy Resources

Download the indexed reference genomes and example data for use with NGSeasy.

NGSeasy Resources:-

  • reference_genomes_b37.tgz b37 reference genomes indexed for use with all provided aligners (BWA, Bowtie2, Stampy, Novoalign) and annotation bed files for use with pipeline scripts
  • gatk_resources.tar.gz gatk resources bundle
  • fastq_example.tgz Example 75bp PE Illumina Whole Exome Sequence fastq data for NA12878
  • Annotation Databases Coming in the next update

Download the data to the top level directory

FTP Details

  • ftp: 159.92.120.21
  • user: compbio-public
  • pwd: compbio-public
  • port: 21

Move to top level directory

cd ngs_projects

FTP NGSeasy Resources

ftp 159.92.120.21

mget NGSeasy Resources

ftp> cd /Public/NGSeasy_Public_Resources
ftp> prompt off
ftp> mget *.gz
ftp> exit

I would recommend using a separate program like FileZilla, which will make it much easier for you to set up and manage your file transfers

Extract NGSeasy Resources

# Extract resources
cd ngs_projects/
tar xvf gatk_resources.tgz; 
cd ngs_projects/gatk_resources
gunzip *

Extract NGSeasy Reference Genomes

# Extract Reference Genomes
cd ngs_projects/
tar xvf reference_genomes_b37.tgz; 
cd ngs_projects/reference_genomes_b37
gunzip *

GATK Resources

Downloading

location: ftp.broadinstitute.org
username: gsapubftp-anonymous
password: <blank>

b37 Resources: the Standard Data Set

  • Reference sequence (standard 1000 Genomes fasta) along with fai and dict files
  • dbSNP in VCF. This includes two files:
    • The most recent dbSNP release This file subsetted to only sites discovered in or before dbSNPBuildID 129, which excludes the impact of the 1000 Genomes project and is useful for evaluation of dbSNP rate and Ti/Tv values at novel sites.
    • HapMap genotypes and sites VCFs
  • OMNI 2.5 genotypes for 1000 Genomes samples, as well as sites, VCF The current best set of known indels to be used for local realignment (note that we don't use dbSNP for this anymore); use both files:
  • 1000G_phase1.indels.b37.vcf (currently from the 1000 Genomes Phase I indel calls)
  • Mills_and_1000G_gold_standard.indels.b37.sites.vcf
  • A large-scale standard single sample BAM file for testing:
    • NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.bam containing ~64x reads of NA12878 on chromosome 20 The results of the latest UnifiedGenotyper with default arguments run on this data set (NA12878.HiSeq.WGS.bwa.cleaned.recal.hg19.20.vcf)

5. Get NGSeasy Sripts

We then need to get the latest NGSeasy scripts from GitHub . The user is required to download the scripts to the ngseasy directory

move to the ngseasy directory

cd /media/ngs_projects/nsgeasy

clone the ngs repository

git clone https://github.com/KHP-Informatics/ngs.git

add nsgeasy/ngs/bin to your system PATH

export PATH=$PATH:/media/ngs_projects/nsgeasy/ngs/bin

or add to global .bashrc

echo "export PATH=$PATH:/media/ngs_projects/nsgeasy/ngs/bin" ~/.bashrc
source ~/.bashrc

alternatively donwload the scripts from our GitHub Release


6. Manually Build required NGSeasy Container Images

Currently we are not able to automatically build some of the tools in pre-built docker containers due to licensing restrictions.

Some of the software has restrictions on use particularly for commercial purposes. Therefore if you wish to use this for commercial purposes, then you leagally have to approach the owners of the various components yourself!

Software composing the pipeline requiring registration:-

These tools require manual download and registration with the proivder. For non-academics/commercial groups, you will need to pay for some of these tools.

Dockerised and Manual Builds

Tool Build
novoalign manual build
annovar manual build
stampy manual build
platypus manual build
gatk manual build

Once you have paid/registered and downloaded the tool, we provide scripts and guidance for building these tools on your system.

Its as easy as:-

docker build -t compbio/ngseasy-${TOOL} .

6.1 Building Stampy

resister at http://www.well.ox.ac.uk/project-stampy

Download stampy to local directory and check version number. If this differs from the Dockerfile build file, then edit the Dockerfile if needed. You will be emailed a URL to download stampy. Insert this into the Dockerfile

# on our local system we cd to media
cd /media

# them move to ngs_projects toplevel directory
cd ngs_projects

# and then the ngseasy folder with all our ngs scripts
# git  clone https://github.com/KHP-Informatics/ngs.git
# if you havent alreay
cd ngseasy

# move to ngseasy_stampy folder
cd ngs/ngs_docker_debian/ngseasy_stampy

# build
docker build -t compbio/ngseasy-stampy:v1.0 .

6.2 Building Platypus

resister at http://www.well.ox.ac.uk/platypus

Download platypus to local directory and check version number. If this differs from the Dockerfile build file, then edit the Dockerfile if needed. You will be emailed a URL to download platypus. Insert this into the Dockerfile

# on our local system we cd to media
cd /media

# them move to ngs_projects toplevel directory
cd ngs_projects

# and then the ngseasy folder with all our ngs scripts
# git  clone https://github.com/KHP-Informatics/ngs.git
# if you havent alreay
cd ngseasy

# move to ngseasy_stampy folder
cd ngs/ngs_docker_debian/ngseasy_platypus

# build
docker build -t compbio/ngseasy-platypus:v1.0 .

6.3 Building NOVOALIGN

Download Novoalign from http://www.novocraft.com/ into the local build directory ngs/ngs_docker_debian/ngseasy_novoalign. Edit the Dockerfile to relfect the correct version of novoalign.

To use all novoalign fucntionality, you will need to pay for a license.

Once you obtained your novoalign.lic, download this to the build directory ngs/ngs_docker_debian/ngseasy_novoalign, which now should contain your updated Dockerfile.

# on our local system we cd to media
cd /media

# them move to ngs_projects toplevel directory
cd ngs_projects

# and then the ngseasy folder with all our ngs scripts
# git  clone https://github.com/KHP-Informatics/ngs.git
# if you havent alreay
cd ngseasy

# move to ngseasy_stampy folder
cd ngs/ngs_docker_debian/ngseasy_novoalign
ls 

the directory should contain the following:-

Dockerfile
novoalign.lic
README.md
novosortV1.03.01.Linux3.0.tar.gz
novocraftV3.02.08.Linux3.0.tar.gz

build novoalign

# build
docker build -t compbio/ngseasy-novoalign:v1.0 .

6.4 Building GATK

You need to register and accept the GATK license agreement at https://www.broadinstitute.org/gatk/.

Once done, download GATK and place in the GTAK build directory ngs/ngs_docker_debian/ngseasy_gatk.

Edit the Dockerfile to relfect the correct version of GATK.

# on our local system we cd to media
cd /media

# them move to ngs_projects toplevel directory
cd ngs_projects

# and then the ngseasy folder with all our ngs scripts
# git  clone https://github.com/KHP-Informatics/ngs.git
# if you havent alreay
cd ngseasy

# move to ngseasy_stampy folder
cd ngs/ngs_docker_debian/ngseasy_gatk
ls 

the directory should contain the following:-

Dockerfile
README.md
GenomeAnalysisTK-3.3-0.tar.bz2

build gatk

# build
docker build -t compbio/ngseasy-gatk:v1.0 .

7. Manually Build NGSeasy Variant Annotaion Container Images

The tools used for variant annotation use large databases and the docker images exceed 10GB. Therefore, the user should manually build these container images prior to running the NGS pipelines. Docker build files (Dockerfile) are available for

Note Annovar requires user registration.

Once built on the user system, these container images can persist for as long as the user wants.

Large Variant Annotation Container Images

Tool Build
annovar manual build
vep manual build
snpeff manual build

Its as easy as:-

docker build -t compbio/ngseasy-${TOOL} .

7.1 Build VEP

cd /media/ngs_projects/nsgeasy/ngs/containerized/ngs_docker_debian/ngseasy_vep

sudo docker build -t compbio/ngseasy-vep:${VERSION} .

7.2 Build Annovar

cd /media/ngs_projects/nsgeasy/ngs/containerized/ngs_docker_debian/ngseasy_annovar

sudo docker build -t compbio/ngseasy-annovar:${VERSION} .

7.3 Build snpEff

cd /media/ngs_projects/nsgeasy/ngs/containerized/ngs_docker_debian/ngseasy_snpeff

sudo docker build -t compbio/ngseasy-snpeff:${VERSION} .

8. Set up NGSeasy Project Working Directories

Running the script ngseasy_initiate_project ensures that all relevant directories are set up, and also enforces a clean structure to the NGS project.

Within this we make a raw_fastq folder, where we temporarily store all the raw fastq files for each project. This folder acts as an initial stagging area for the raw fastq files. During the project set up, we copy/move project/sample related fastq files to their own specific directories. Fastq files must have suffix and be gzipped: _1.fq.gz or _2.fq.gz
furture version will allow any format

Running ngseasy_initiate_project with the relevent configuration file, will set up the following directory structure for every project and sample within a project:-

NGS Project Directory

.
ngs_projects  
|  
|__raw_fastq  
|__config_files  
|__reference_genomes_b37  
|__gatk_resources  
|__ngseasy
|
|__ project_id  
	|  
	|__run_logs  
	|__config_files  
	|__project_vcfs  
	|__project_bams  
	|__project_reports  
	|
	|__sample_id_1  
	|	|  
	|	|__fastq  
	|	|__tmp  
	|	|__alignments  
	|	|__vcf  
	|	|__reports  
	|	|__config_files  
	|
	|
	|__sample_id_n  
		|  
		|__fastq  
		|__tmp  
		|__alignments  
		|__vcf  
		|__reports  
		|__config_files  

Running ngseasy_initiate_project

ngseasy_initiate_project -c config.file.tsv -d /media/ngs_projects

9. NGSeasy Project configuration file

In Excel make config file and save as [TAB] Delimited file with .tsv extenstion.
See Example provided and GoogleDoc. Remove the header from this file before running the pipeline. This sets up Information related to: Project Name, Sample Name, Library Type, Pipeline to call, NCPU.

The [config.file.tsv] should contain the following 15 columns for each sample to be run through a pipeline:-

Variable type Description Options/Examples
POJECT_ID string Project ID Cancer
SAMPLE_ID string Sample ID T100
FASTQ1 string Raw fastq file name read 1 foo_1_fq.gz
FASTQ2 string Raw fastq file name read 1 foo_2_fq.gz
PROJECT_DIR string Project Directory /medida/ngs_projects
DNA_PREP_LIBRARY_ID string DNA Libray Prep ID Custom_Cancer
NGS_PLATFORM string Platform Name ILLUMINA
NGS_TYPE string Experiment type WGS/WEX/TGS/
BED_ANNO string Annotation Bed File exons_b37.bed
PIPELINE string NGSeasy Pipeline Script ngs_full_gatk/ngs_full_no_gatk
ALIGNER string Aligner bwa/bowtie/stampy/novoalign
VARCALLER string Variant Caller ensemble/freebayes/platypus/UnifiedGenotyper/HaplotypeCaller
GTMODEGATK string GATK Variant Caller Mode EMIT_ALL_CONFIDENT_SITES/EMIT_VARIANTS_ONLY
CLEANUP string Clean Up Files (TRUE/FALSE) TRUE/FALSE
NCPU number Number of cores to call 1..n
VERSION number NGSeasy Version v0.9/v1.0

In the config file we set PIPELINE to call the pipeline [ngs_full_gatk] or [ngs_full_no_gatk].

coming soon options to add user email, specify non-gatk runs


10. Copy Project Fastq files to relevent Project/Sample Directories

ngseasy_initiate_fastq -c config.file.tsv -d /media/ngs_projects

11. Start the NGSeasy Volume Contaier

In the Docker container the project directory is mounted in /home/pipeman/ngs_projects

ngseasy_volumes_container -d /media/ngs_projects

inside ngseasy_volumes_container. This is what it is calling. Note the directory names and mounts.

# host_vol_dir = ngs_projects

  docker run \
  -d \
  -P \
  -v ${host_vol_dir}/fastq_raw:/home/pipeman/fastq_raw \
  -v ${host_vol_dir}/reference_genomes_b37:/home/pipeman/reference_genomes_b37 \
  -v ${host_vol_dir}/gatk_resources:/home/pipeman/gatk_resources \
  -v ${host_vol_dir}:/home/pipeman/ngs_projects \
  -v ${host_vol_dir}/ngseasy/ngs/bin:/home/pipeman/ngseasy_scripts \
  --name volumes_container \
  -t compbio/ngseasy-base:wheezy

12. Running an NGSeasy full pipeline : from raw fastq to vcf calls

run ngseay

    ngseasy -c config.file.tsv -d /media/nsg_projects
    

The pipeline is defined in the config file as [ngs_full_gatk]


The NGSeasy Pipelines

Pipeline Short Description
ngs_full_gatk fastq to recalibrated bam to vcf using GATK
ngs_full_no_gatk fastq to recalibrated bam to vcf

gatk version includes indel realignment and base recalibration.

Non-academics/commercial groups need to pay for GATK.

Currently ngs_full_gatk pipeline is the most developed module.

The ngs_full_no_gatk pipeline provides alternatives to processing with GATK. Here BamUtil:recab is used to recalibrate base quality scores and freebayes/platypus are the variant callers of choice.

ngs_full_gatk

Each pipeline is a bash wrapper that calls a number of functions/steps set out in The Full NGSeasy pipeline.

Here [ngs_full_gatk] is a wrapper/fucntion for calling an NGS pipeline. The inside to this script is set out below:-

#!/bin/bash -x

#usage printing func
usage()
{
cat << EOF
  This script calls the NGSeasy pipeline ngs_full_gatk

  ARGUMENTS:
  -h      Flag: Show this help message
  -c      NGSeasy project and run configureation file
  -d      NGSeasy project directory

  EXAMPLE USAGE:
    
    ngseasy -c config.file.tsv -d project_directory

EOF
}

#get options for command line args
  while  getopts "hc:d:" opt
  do

      case ${opt} in
	  h)
	  usage #print help
	  exit 0
	  ;;
	  
	  c)
	  config_tsv=${OPTARG}
	  ;;

	  d)
	  project_directory=${OPTARG}
	  ;; 
      esac
  done

#check config file exists.
if [ ! -e "${config_tsv}" ] 
then
	    echo "ERROR :  ${config_tsv} does not exist "
	    usage;
	    exit 1;
fi

#check exists.
  if [ ! -d "${project_directory}" ] 
  then
	  echo " ERROR : ${project_directory} does not exist "
	  usage;
	  exit 1;
  fi

##################  
# start pipeline #
##################

# Each of these fucntions will call the required image/container(s) and run a part of the NGS pipeline. Each step is usually 
# dependent on the previous step(s) - in that they require certain data/input/output in the correct format 
# and with the correct nameing conventions enforced by our pipeline to exist, before executing.

ngseasy_fastqc  -c ${config_tsv} -d ${project_directory}

ngseasy_trimmomatic -c ${config_tsv} -d ${project_directory}

ngseasy_alignment -c ${config_tsv} -d ${project_directory}

ngseasy_addreadgroup -c ${config_tsv} -d ${project_directory}

ngseasy_markduplicates -c ${config_tsv} -d ${project_directory}

ngseasy_indel_realn -c ${config_tsv} -d ${project_directory}

ngseasy_base_recal -c ${config_tsv} -d ${project_directory}

ngseasy_filter_recalbam -c ${config_tsv} -d ${project_directory}

ngseasy_alignment_qc -c ${config_tsv} -d ${project_directory}
 
ngseasy_variant_calling -c ${config_tsv} -d ${project_directory}

# coming soon...
# ngseasy_filter_bam -c ${config_tsv} -d ${project_directory}
# ngseasy_cnv_calling -c ${config_tsv} -d ${project_directory}
# ngseasy_variant_filtering -c ${config_tsv} -d ${project_directory}
# ngseasy_variant_annotation -c ${config_tsv} -d ${project_directory}
# ngseasy_report -c ${config_tsv} -d ${project_directory}

Output suffixes

Alignment Output

*.raw.sam (WEX ~ 8GB) *.raw.bam
*.raw.bai
*.sort.bam (WEX ~ 3GB) *.sort.bai


Addreadgroup

*.addrg.bam
*.addrg.bai
*.addrg.bam.bai


Dupemark

*.dupemk.bam
*.dupemk.bai *.dupemk.bam.bai


Indel realign

*.realn.bam
*.realn.bai *.realn.bam.bai


Base recal

*.recal.bam (WEX ~ 4.4G)
*.recal.bai
*.recal.bam.bai
*.realn.bam.BaseRecalibrator.table
*.recal.bam.BaseRecalibrator.table
*.recal.bam.BaseRecalibrator.BQSR.csv


Thresholds for Variant calling etc

For Freebayes and Platypus tools:-

  • We set min coverage to 10
  • Min mappinng quality to 20
  • Min base quality to 20

For GATK HaplotypeCaller (and UnifiedGenotyper)

-stand_call_conf 30 -stand_emit_conf 10 -dcov 250 -minPruning 10

Note: minPruning 10 was added as many runs of HaplotypeCaller failed when using non-bwa aligend and GATK best practices cleaned BAMs. This fix sorted all problems out, and you really dont want dodgy variant calls...do you? Same goes for thresholds hard coded for use with Freebayes and Platypus.
These setting all work well in our hands. Feel free to edit the scripts to suit your needs.


blah blah blah


Gottchas

bin/bash -c

  • need to add /bin/bash -c ${COMMAND} when software require > redirect to some output

example below for bwa:-

  sudo docker run \
  -P \
  --name sam2bam_${SAMPLE_ID} \
  --volumes-from volumes_container \
  -t compbio/ngseasy-samtools:v0.9 /bin/bash -c \
  "/usr/local/pipeline/samtools/samtools view -bhS ${SOUTDocker}/alignments/${BAM_PREFIX}.raw.bwa.sam > ${SOUTDocker}/alignments/${BAM_PREFIX}.raw.bwa.bam"

runnig this without /bin/bash -c breaks. The > is called outside of the container

The Annoying thing about GATK!

This will break your runs if multiple calls try and access the file when the first call deletes it!

WARN  11:05:27,577 RMDTrackBuilder - Index file /home/pipeman/gatk_resources/Mills_and_1000G_gold_standard.indels.b37.vcf.idx is out of date (index older than input file), deleting and updating the index file 
INFO  11:05:31,699 RMDTrackBuilder - Writing Tribble index to disk for file /home/pipeman/gatk_resources/Mills_and_1000G_gold_standard.indels.b37.vcf.idx 

CNV tools to think about

EXCAVATOR: detecting copy number variants from whole-exome sequencing data @ http://genomebiology.com/2013/14/10/R120

We developed a novel software tool, EXCAVATOR, for the detection of copy number variants (CNVs) from whole-exome sequencing data. EXCAVATOR combines a three-step normalization procedure with a novel heterogeneous hidden Markov model algorithm and a calling method that classifies genomic regions into five copy number states. We validate EXCAVATOR on three datasets and compare the results with three other methods. These analyses show that EXCAVATOR outperforms the other methods and is therefore a valuable tool for the investigation of CNVs in largescale projects, as well as in clinical research and diagnostics. EXCAVATOR is freely available at http://sourceforge.net/projects/excavatortool/ webcite.


Useful Links