Skip to content

Expanded tutorial

Wise, Aaron edited this page Feb 15, 2018 · 6 revisions

Let's look at the initial example in more detail.

  1. Make a proto-workflow file.

    A proto-workflow is a very simple json file that lists the pipeline stages you want, in the order you want them. Here's a simple proto-workflow that I use for some RNA-seq applications:

    {
        "stages": ["bcl2fastq", "rsem"]
    }
    

Yeah, that's really it.

  1. Compile the proto-workflow

    execute 'python -m zippy.make_params my_proto.json my_params.json'

  2. Fill in the blanks and/or connect the dots

    1. The output of make_params looks like this:
    {
    "stages": [
        {
            "identifier": "bcl2fastq",
            "output_dir": "",
            "sample_path": "",
            "stage": "bcl2fastq"
        },
        {
            "identifier": "rsem",
            "output_dir": "",
            "previous_stage": "bcl2fastq",
            "stage": "rsem"
        }
    ],
    "sample_sheet": "",
    "bcl2fastq_path": "/home/awise/sngs/dependencies/bin/bcl2fastq2-v2.18.0.6/bin/bcl2fastq",
    "rsem_path": "/home/awise/sngs/dependencies/bin/rsem-1.2.31/bin/rsem-calculate-expression",
    "rsem_annotation": "/home/awise/sngs/dependencies/static_files/rsem/GRCh38",
    "star_path": "/home/awise/sngs/dependencies/bin/STAR_2.5.1b/STAR",
    "genome": "/illumina/development/Isis/Genomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta",
    "scratch_path": ""
    }
    
    1. There are two major sections of a params.json file.

      • The stages section contains all the parameters that must be set per-stage
      • The rest contains all global parameters.
    2. Let's talk about some of these parameters in more detail. -identifier and stage. Each stage must have a unique name (identifier) and a specification of its type (stage). make_params will auto-generate identifiers for you, or you can use your own. Stage must refer to the canonical stage name. See the list of supported stages on the wiki.

      • previous_stage. This parameter tells the stage which previous stage to ask for its input. The previous stage is specified by its identifier. ZIPPY will attempt to infer the proper previous stage. It will often be wrong. This is a parameter that is important to look at.
      • sample_sheet. The pipeline requires an Illumina-style sequencing sample sheet. It uses this samplesheet primarily for determining the sample ids and names, but it may also be used to determine read lengths.
    3. Wildcards! Wildcards are essentially variables. Set the wildcard in the wildcard section, and then when you use that string surrounded by curly braces, it will automatically substitute. See the example below.

    4. And more! Check out the complete params syntax.

  3. Run ZIPPY

    To run ZIPPY, execute 'python -m zippy.zippy my_params.json'

That's it!

Here is a more complicated example, to get a feel for what you can do with ZIPPY!

Here is a pipeline.proto that performs peak calling comparing two different markduplicates methods:

{
  "stages":  ["bcl2fastq", "bwa", "markduplicates", "markduplicates", "bwaalignstats", "macs", "macs"]
}

And here is its configured output. Note that, for pipelines of this complexity, the previous_stage inference methods will not be sufficient. You must ensure that the pipeline stages are properly wired up!

{
    "wildcards": {
        "path": "/path/to/root/dir"
    },
    "stages": [
        {
            "identifier": "bcl2fastq",
            "output_dir": "{path}/fastq",
            "stage": "bcl2fastq"
        },
        {
            "identifier": "bwa",
            "output_dir": "{path}/align",
            "previous_stage": "bcl2fastq",
            "stage": "bwa"
        },
        {
            "use_mate_cigar": true,
            "identifier": "markduplicates",
            "output_dir": "{path}/dedup",
            "previous_stage": "bwa",
            "stage": "markduplicates"
        },
        {
            "use_mate_cigar": false,
            "identifier": "markduplicates.1",
            "output_dir": "{path}/dedupv2",
            "previous_stage": "bwa",
            "stage": "markduplicates"
        },
        {
            "identifier": "bwaalignstats",
            "output_dir": "{path}/stats",
            "previous_stage": "bwa",
            "stage": "bwaalignstats"
        },
        {
            "identifier": "macs",
            "output_dir": "{path}/macs",
            "previous_stage": "markduplicates",
            "stage": "macs"
        },
        {
            "identifier": "macs.1",
            "output_dir": "{path}/macsv2",
            "previous_stage": "markduplicates.1",
            "stage": "macs"
        }
    ],
    "sample_sheet": "{path}/SampleSheet.csv",
    "bcl2fastq_path": "/home/awise/sngs/dependencies/bin/bcl2fastq2-v2.18.0.6/bin/bcl2fastq",
    "sample_path": "{path}",
    "bwa_path": "/illumina/thirdparty/bwa/bwa-0.7.12/bwa",
    "genome": "/illumina/development/Isis/Genomes/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta",
    "picard": "/home/awise/sngs/dependencies/bin/picard-tools-1.129/picard.jar",
    "scratch_path": "/path/to/scratch"
}
Clone this wiki locally