Skip to content

Scalable NGS pipeline implemented in Spark environment

Notifications You must be signed in to change notification settings

zhouyangyu/scalable-NGS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scalable NGS pipeline on Dataproc

This porject aimed to solve the problem of too much data in NGS pipelines, especially WGS, WES. In summary, this project implements analysis tools that are built on Apache Spark to align, process, and analyze large NGS data. It inherit the benefits of Spark, such as scalable, elastic, and highly-available.

  • Implemented in Google Cloud Dataproc environment
  • Uses SparkBWA, GATK 4 beta, Cromwell Execution Engine
  • Speeds up end-to-end data processing by at least 3x (using a cluster of 5 worker nodes)

It also:

  • Imports and stores files on Google storage
  • Uses all common analysis tools with excellent documentation and support communities.
  • needs your improvements!

Installation

You need:

  • a Google Cloud account (it comes with $300 credit) or AWS
  • set up a Hadoop cluster using YARN, with Apache Spark 2.2 installed
  • initate all cluster nodes with cluster_init.sh (installs Maven, SparkBWA, GATK)

Todos

  • Implement workflow in WDL and Cromwell
  • ADAM
  • Better documentation

About

Scalable NGS pipeline implemented in Spark environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages