PacBio Human Assembly Pipeline

Workflow for running de novo assembly using human PacBio whole genome sequencing (WGS) data. Written using Workflow Description Language (WDL).

Docker images used by these workflows are defined here.
Common tasks that may be reused within or between workflows are defined here.

Workflow

Workflow entrypoint: workflows/main.wdl

The assembly workflow performs de novo assembly on samples and trios.

Setup

Clone a tagged version of the git repository. Use the --branch flag to pull the desired version, and the --recursive flag to pull code from any submodules.

git clone \
  --depth 1 --branch v1.0.2 \  # for reproducibility
  --recursive \                # to clone submodule
  https://github.com/PacificBiosciences/HiFi-human-assembly-WDL.git

Resource requirements

The workflow requires at minimum 48 cores and 288 GB of RAM. Ensure that the backend environment you're using has enough quota to run the workflow.

Reference datasets and associated workflow files

Reference datasets are hosted publicly for use in the pipeline. For data locations, see the backend-specific documentation and template inputs files for each backend with paths to publicly hosted reference files filled out.

Running the workflow

Select a backend environment
Configure a workflow execution engine in the chosen environment
Fill out the inputs JSON file for your cohort
Run the workflow

Selecting a backend

The workflow can be run on Azure, AWS, GCP, or HPC. Your choice of backend will largely be determined by the location of your data.

For backend-specific configuration, see the relevant documentation:

Configuring a workflow engine and container runtime

An execution engine is required to run workflows. Two popular engines for running WDL-based workflows are miniwdl and Cromwell.

Because workflow dependencies are containerized, a container runtime is required. This workflow has been tested with Docker and Singularity container runtimes.

See backend-specific documentation for details on setting up an engine.

Engine	Azure	AWS	GCP	HPC
miniwdl	Unsupported	Supported via the Amazon Genomics CLI	Unsupported	(SLURM only) Supported via the `miniwdl-slurm` plugin
Cromwell	Supported via Cromwell on Azure	Supported via the Amazon Genomics CLI	Supported via Google's Pipelines API	Supported - Configuration varies depending on HPC infrastructure

Filling out the inputs JSON

The input to a workflow run is defined in JSON format. Template input files with reference dataset information filled out are available for each backend:

Using the appropriate inputs template file, fill in the cohort and sample information (see Workflow Inputs for more information on the input structure).

If using an HPC backend, you will need to download the reference bundle and replace the <local_path_prefix> in the input template file with the local path to the reference datasets on your HPC.

Running the workflow

Run the workflow using the engine and backend that you have configured (miniwdl, Cromwell).

Note that the calls to miniwdl and Cromwell assume you are accessing the engine directly on the machine on which it has been deployed. Depending on the backend you have configured, you may be able to submit workflows using different methods (e.g. using trigger files in Azure, or using the Amazon Genomics CLI in AWS).

Run directly using miniwdl

miniwdl run workflows/main.wdl -i <input_file_path.json>

Run directly using Cromwell

java -jar <cromwell_jar_path> run workflows/main.wdl -i <input_file_path.json>

If Cromwell is running in server mode, the workflow can be submitted using cURL. Fill in the values of CROMWELL_URL and INPUTS_JSON below, then from the root of the repository, run:

# The base URL (and port, if applicable) of your Cromwell server
CROMWELL_URL=
# The path to your inputs JSON file
INPUTS_JSON=

(cd workflows && zip -r dependencies.zip assembly_structs.wdl assemble_genome/ de_novo_assembly_sample/ de_novo_assembly_trio/ wdl-common/)
curl -X "POST" \
  "${CROMWELL_URL}/api/workflows/v1" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "workflowSource=@workflows/main.wdl" \
  -F "workflowInputs=@${INPUTS_JSON};type=application/json" \
  -F "workflowDependencies=@workflows/dependencies.zip;type=application/zip"

To specify workflow options, add the following to the request (assuming your options file is a file called options.json located in the pwd): -F "[email protected];type=application/json".

Workflow inputs

This section describes the inputs required for a run of the workflow. Typically, only the de_novo_assembly.cohort and potentially run/backend-specific sections will be filled out by the user for each run of the workflow. Input templates with reference file locations filled out are provided for each backend.

Cohort

A cohort can include one or more samples. Samples need not be related.

Type	Name	Description	Notes
String	cohort_id	A unique name for the cohort; used to name outputs. Alphanumeric characters, underscore (`_`), and dash (`-`) are allowed.
Array[Sample]	samples	The set of samples for the cohort. At least one sample must be defined.
Boolean	run_de_novo_assembly_trio	Run trio binned de novo assembly.	Cohort must contain at least one valid trio (child and both parents present in the cohort)

Sample

Sample information for each sample in the workflow run.

Type	Name	Description	Notes
String	sample_id	A unique name for the sample; used to name outputs. Alphanumeric characters, underscore (`_`), and dash (`-`) are allowed
Array[IndexData]	movie_bams	The set of unaligned movie BAMs associated with this sample
String?	father_id	Paternal `sample_id`. Alphanumeric characters, underscore (`_`), and dash (`-`) are allowed.
String?	mother_id	Maternal `sample_id`. Alphanumeric characters, underscore (`_`), and dash (`-`) are allowed.
Boolean	run_de_novo_assembly	If true, run single-sample de novo assembly for this sample	[true, false]

ReferenceData

Array of references and their associated names and indices.

These files are hosted publicly in each of the cloud backends; see backends/${backend}/inputs.${backend}.json.

Type	Name	Description	Notes
String	name	Reference name; used to name outputs (e.g., "GRCh38")
IndexData	fasta	Reference genome and associated index

Other inputs

Type	Name	Description	Notes
String	backend	Backend where the workflow will be executed	["Azure", "AWS", "GCP", "HPC"]
String?	zones	Zones where compute will take place; required if backend is set to 'AWS' or 'GCP'.	Determining available zones in AWS Determining available zones in GCP
String?	aws_spot_queue_arn	Queue ARN for the spot batch queue; required if backend is set to 'AWS' and `preemptible` is set to `true`	Determining the AWS queue ARN
String?	aws_on_demand_queue_arn	Queue ARN for the on demand batch queue; required if backend is set to 'AWS' and `preemptible` is set to `false`	Determining the AWS queue ARN
String?	container_registry	Container registry where workflow images are hosted. If left blank, PacBio's public Quay.io registry will be used.
Boolean	preemptible	If set to `true`, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to `false`, on-demand VMs will be used for every task. Ignored if backend is set to HPC.	[true, false]

Workflow outputs

De novo assembly - sample

These files will be output if cohort.samples[sample] is set to true for any sample.

Type	Name	Description
Array[Array[File]?]	zipped_assembly_fastas	De novo dual assembly generated by hifiasm
Array[Array[File]?]	assembly_noseq_gfas	Assembly graphs in GFA format.
Array[Array[File]?]	assembly_lowQ_beds	Coordinates of low quality regions in BED format.
Array[Array[File]?]	assembly_stats	Assembly size and NG50 stats generated by calN50.
Array[Array[IndexData?]]	asm_bam	minimap2 alignment of assembly to reference.
Array[Array[IndexData?]]	paftools_vcf	calls variants from coordinate-sorted assembly-to-reference alignment. It calls variants from the cs tag and identifies confident/callable regions as those covered by exactly one contig `paftools`
Array[Array[File?]]	paftools_vcf_stats	`bcftools stats` summary statistics for `paftools` variant calls

De novo assembly - trio

These files will be output if cohort.de_novo_assembly_trio is set to true and there is at least one parent-parent-kid trio in the cohort.

Type	Name	Description
Array[Array[File]]?	trio_zipped_assembly_fastas	Haplotype-resolved de novo assembly of the trio kid generated by hifiasm with trio binning
Array[Array[File]]?	trio_assembly_noseq_gfas	Assembly graphs in GFA format.
Array[Array[File]]?	trio_assembly_lowQ_beds	Coordinates of low quality regions in BED format.
Array[Array[File]]?	trio_assembly_stats	Assembly size and NG50 stats generated by calN50.
Array[Array[IndexData]?]	trio_asm_bams	minimap2 alignment of assembly to reference.
Array[Map[String, String]]?	haplotype_key	Indication of which haplotype (`hap1`/`hap2`) corresponds to which parent.

Tool versions and Docker images

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio's quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task. Images can be referenced in the following table by looking for the name after the final / character and before the @sha256:.... For example, the image referred to here is "align_hifiasm":

~{runtime_attributes.container_registry}/align_hifiasm@sha256:3968cb<...>b01f80fe

Image	Major tool versions	Links
align_hifiasm	minimap2 2.17 samtools 1.14	Dockerfile
bcftools	bcftools 1.14	Dockerfile
gfatools	gfatools 0.4 htslib 1.14 k8 0.2.5 caln50 01091f2	Dockerfile
hifiasm	hifiasm 0.20.0	Dockerfile
htslib	htslib 1.14	Dockerfile
paftools	paftools 2.26-r1182-dirty	Dockerfile
pyyaml	python 3.8.10; custom scripts	Dockerfile
samtools	samtools 1.14	Dockerfile
yak	yak 0.1	Dockerfile

DISCLAIMER

TO THE GREATEST EXTENT PERMITTED BY APPLICABLE LAW, THIS WEBSITE AND ITS CONTENT, INCLUDING ALL SOFTWARE, SOFTWARE CODE, SITE-RELATED SERVICES, AND DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. ALL WARRANTIES ARE REJECTED AND DISCLAIMED. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THE FOREGOING. PACBIO IS NOT OBLIGATED TO PROVIDE ANY SUPPORT FOR ANY OF THE FOREGOING, AND ANY SUPPORT PACBIO DOES PROVIDE IS SIMILARLY PROVIDED WITHOUT REPRESENTATION OR WARRANTY OF ANY KIND. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A REPRESENTATION OR WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACBIO.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github		.github
backends		backends
images		images
workflows		workflows
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
wdl-ci.config.json		wdl-ci.config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PacBio Human Assembly Pipeline

Workflow

Setup

Resource requirements

Reference datasets and associated workflow files

Running the workflow

Selecting a backend

Configuring a workflow engine and container runtime

Filling out the inputs JSON

Running the workflow

Run directly using miniwdl

Run directly using Cromwell

Workflow inputs

Cohort

Sample

ReferenceData

Other inputs

Workflow outputs

De novo assembly - sample

De novo assembly - trio

Tool versions and Docker images

DISCLAIMER

About

Releases 3

Packages

Contributors 5

Languages

License

PacificBiosciences/HiFi-human-assembly-WDL

Folders and files

Latest commit

History

Repository files navigation

PacBio Human Assembly Pipeline

Workflow

Setup

Resource requirements

Reference datasets and associated workflow files

Running the workflow

Selecting a backend

Configuring a workflow engine and container runtime

Filling out the inputs JSON

Running the workflow

Run directly using miniwdl

Run directly using Cromwell

Workflow inputs

Cohort

Sample

ReferenceData

Other inputs

Workflow outputs

De novo assembly - sample

De novo assembly - trio

Tool versions and Docker images

DISCLAIMER

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 5

Languages

Packages