Skip to content

Recipe: DNA seq with Halvade on a local Hadoop cluster

Dries Decap edited this page May 26, 2015 · 14 revisions

DNA-seq variant calling with Halvade on a local Hadoop cluster

Step 1: Downloading the required data

Dataset

The dataset used in this example is the NA12878 dataset. This consists of two gzipped fastq files with paired-end reads that are available for download here:

$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194147/ERR194147_2.fastq.gz

Halvade

Every Halvade release is available at github. Download and extract this archive, for version v0.4 do:

$ wget https://github.com/ddcap/halvade/releases/download/v0.4/Halvade_v0.4.tar.gz
$ tar xvf Halvade_v0.4.tar.gz

Reference

The reference is a fasta file containing the human genome, together with an index and a dictionary file, available for download here:

$ wget ftp://[email protected]/bundle/2.8/hg19/ucsc.hg19.fasta.gz
$ wget ftp://[email protected]/bundle/2.8/hg19/ucsc.hg19.fasta.fai.gz
$ wget ftp://[email protected]/bundle/2.8/hg19/ucsc.hg19.dict.gz
$ gunzip ucsc.hg19*.gz

For the BQSR step, GATK needs a dbSNP file which is available for download here:

$ wget ftp://[email protected]/bundle/2.8/hg19/dbsnp_138.hg19.vcf.gz
$ wget ftp://[email protected]/bundle/2.8/hg19/dbsnp_138.hg19.vcf.idx.gz
$ gunzip dbsnp_138.hg19*.gz

While you are downloading the dbsnp database, the bwa index can be created, for this untar the bin.tar.gz file to get the BWA binary and use this to index the (unzipped) fasta file you just downloaded:

$ tar xvf bin.tar.gz
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`pwd`/bin
$ ./bin/bwa index ucsc.hg19.fasta

This will create 5 files which will be uploaded to HDFS to be used in Halvade in the last step.

Step 2: Configuring Hadoop for Halvade

To use Halvade, Hadoop needs to be installed, in this tutorial the Cloudera distribution of Hadoop, called CDH 5 is used. You can find a detailed description on how to install CDH 5 on your cluster here. Make sure each node of your cluster has Java 1.7 installed and uses this as the default Java instance, in Ubuntu use these commands:

$ sudo apt-get install openjdk-7-jre
$ sudo update-alternatives --config java

After CDH 5 is installed, this needs to be configured so that Halvade can use all available resources. In Halvade each tasks processes a portion of the input data, however the execution time can vary to a certain degree. For this the task timeout needs to be set high enough, in mapred-site.xml change this property to 30 minutes:

<property>
  <name>mapreduce.task.timeout</name>
  <value>1800000</value>
</property>

Next, CDH 5 needs to know how many cores and how much memory is available on the nodes, this is set in yarn-site.xml. This is very important for the number of tasks that will be started on the cluster. In this example nodes with 128 GBytes of memory and a dual socket cpu setup with in total 24 cores is used. Because many tools benefit from the hyperthreading capabilities of a cpu, the vcores is set to 48:

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>131072</value>
</property>
<property>
  <name>yarn.nodemanager.resource.cpu-vcores</name>
  <value>48</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>131072</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>512</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-vcores</name>
  <value>48</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-vcores</name>
  <value>1</value>
</property>

After this the configuration needs to be pushed to all nodes:

$ scp *-site.xml myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster/

And the MapReduce service needs to be restarted:

On the Resource Manager:

$ sudo service hadoop-yarn-resourcemanager restart

On each NodeManager:

$ sudo service hadoop-yarn-nodemanager restart

On the JobHistory server:

$ sudo service hadoop-mapreduce-historyserver restart

Intel’s Hadoop Adapter for Lustre

When using Lustre as the filesystem instead of HDFS, using Intel's adapter for Lustre will increase the performance of Halvade. To enable the Adapter for Lustre you need to change some configurations in your Hadoop installation. In core-site.xml you need to point to the location of Lustre and set the Lustre FileSystem class, if Lustre is mounted on /mnt/lustre/ add these to the file:

<property>
    <name>fs.defaultFS</name>
    <value>lustre:///</value>
</property>
<property>
    <name>fs.lustre.impl</name>
    <value>org.apache.hadoop.fs.LustreFileSystem</value>
</property>
<property>
    <name>fs.AbstractFileSystem.lustre.impl</name>
    <value>org.apache.hadoop.fs.LustreFileSystem$LustreFs</value>
</property>
<property>
    <name>fs.root.dir</name>
    <value>/mnt/lustre/hadoop</value>
</property>

Additionally, you need to set the Shuffle class in mapred-site.xml:

<property>
    <name>mapreduce.job.map.output.collector.class</name>
    <value>org.apache.hadoop.mapred.SharedFsPlugins$MapOutputBuffer</value>
</property>
<property>
    <name>mapreduce.job.reduce.shuffle.consumer.plugin.class</name>
    <value>org.apache.hadoop.mapred.SharedFsPlugins$Shuffle</value>
</property>

After adding these settings to the configuration, the files need to be pushed to all nodes again and all services restarted, see above. Additionally the jar containing Intel's Adapter for Lustre should be available on all nodes and added to the classpath of Hadoop. To do this you can find the directories that are currently in your hadoop classpath and add the jar to one of these on every node, to find the directories run:

hadoop classpath

Step 3: Running Halvade

Prepare the data

First, a directory needs to be made where all the reference, the dbSNP and input files will be stored (change the username accordingly):

$ hdfs dfs -mkdir -p /user/<username>/halvade/ref/dbsnp/
$ hdfs dfs -mkdir /user/<username>/halvade/in/

Next the reference, dbSNP and bin.tar.gz files need to be copied to HDFS:

$ hdfs dfs -put ucsc.hg19.* /user/<username>/halvade/ref/
$ hdfs dfs -put dbsnp_138.hg19.vcf* /user/<username>/halvade/ref/dbsnp/
$ hdfs dfs -put bin.tar.gz /user/<username>/halvade/

The input data needs to be preprocessed and stored onto HDFS, this is done with the HalvadeUploaderWithLibs.jar file, run:

$ hadoop jar HalvadeUploaderWithLibs.jar -1 ERR194147_1.fastq.gz -2 ERR194147_2.fastq.gz -O /user/<username>/halvade/in/ -t 2

If a different distributed filesystem like Lustre or GPFS is used instead of HDFS, to make sure that Halvade doesn't download the reference to local scratch, just add these files:

For the BWA and GATK reference you need to make these 2 files, based on the fasta filename, e.g. ucsc.hg19.fasta:

$ touch ucsc.hg19.bwa_ref
$ touch ucsc.hg19.gatk_ref

As for the dbSNP a similar file can be made in the directory of the database, e.g. /lustre//halvade/ref/dbsnp/:

$ touch /lustre/<username>/halvade/ref/dbsnp/.dbsnp

The same is valid if you distributed the reference to each node before executing the job, make sure these files are present on every node so Halvade can find the local files.

Configure Halvade

Now some configurations need to be set to run Halvade properly, in halvade.config set these values, assuming a 16 node cluster with 128GB memory and 24 cores (change accordingly):

nodes=16
mem=128
vcores=24
B="/user/<username>/halvade/bin.tar.gz"
D="/user/<username>/halvade/ref/dbsnp/dbsnp_138.hg19.vcf"
R="/user/<username>/halvade/ref/ucsc.hg19"

and for the halvade_run.config file, set the input and output directories and enable the hyperthreading option by adding the following lines:

I="/user/<username>/halvade/in/"
O="/user/<username>/halvade/out/"
smt

If your reference is distributed on each node or accessible by each node in a different distributed file system like Lustre or GPFS, then add this line to the halvade_run.config file:

refdir="/lustre/<username>/halvade/ref/"

Where /lustre//halvade/ref/ contains the BWA reference, the fasta reference and the dbSNP file.

Using Amazon AWS

When using Amazon AWS, you need to have the AWS Command Line Interface installed, see here. All referenecs and Halvade files should be present on your Amazon S3 Bucket using Amazon Web Services (AWS) or a S3 tool, e.g. s3cmd (link). Additionally, you need to add these EMR configurations to the halvade.config file, if your bucket is called examplebucket and the files are stored in the halvade directory:

emr_jar="s3://examplebucket/halvade/HalvadeWithLibs.jar"
emr_script="s3://examplebucket/halvade/halvade_bootstrap.sh"
emr_type="c3.8xlarge"
emr_ami_v="3.1.0"
tmp="/mnt/halvade/"

As a side note, all directories and configuration used in the configuration files should be adapted to the locations on your S3 bucket as well as the hardware configuration of the node type you want to use. The folder for intermediate data is set to /mnt/halvade/, as this is created by the halvade_bootstrap.sh script.

Run Halvade

Now everything has been set and running Halvade can be done with this command:

./runHalvade.py

When Halvade is finished, a vcf file will be created called /user//halvade/out/merge/HalvadeCombined.vcf which contains all called variants.