Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/straglr #318

Merged
merged 7 commits into from
Mar 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/background/citations.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ Chen,X. et al. (2016) Manta: rapid detection of structural variants
and indels for germline and cancer sequencing applications.
Bioinformatics, 32, 1220--1222.

## Chiu-2021

Chiu,R. et al. (2021) Straglr: discovering and genotyping tandem repeat
expansions using whole genome long-read sequences. Genome Biol., 22, 224.

## Haas-2017

Haas,B et al. (2017) STAR-Fusion: Fast and Accurate Fusion
Expand Down
4 changes: 4 additions & 0 deletions docs/inputs/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
nav:
- reference.md
- standard.md
- ...
28 changes: 28 additions & 0 deletions docs/inputs/non_python_dependencies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Non-python Dependencies

MAVIS integrates with
[SV callers](./sv_callers.md),
[job schedulers](#job-schedulers), and
[aligners](#aligners). While some of
these dependencies are optional, all currently supported options are
detailed below. The versions column in the tables below list all the
versions which were tested for each tool. Each version listed is known
to be compatible with MAVIS.

## Job Schedulers

MAVIS v3 uses [snakemake](https://snakemake.readthedocs.io/en/stable/) to handle job scheduling

## Aligners

Two aligners are supported [bwa](../../glossary/#bwa) and
[blat](../../glossary/#blat) (default). These are both included in the docker image by default.

| Name | Version(s) | Environment Setting |
| ---------------------------------------------- | ----------------------- | ------------------------- |
| [blat](../../glossary/#blat) | `36x2` `36` | `MAVIS_ALIGNER=blat` |
| [bwa mem <bwa>](../../glossary/#bwa mem <bwa>) | `0.7.15-r1140` `0.7.12` | `MAVIS_ALIGNER='bwa mem'` |

!!! note
When setting the aligner you will also need to set the
[aligner_reference](../../configuration/settings/#aligner_reference) to match
16 changes: 8 additions & 8 deletions docs/inputs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@ To improve the install experience for the users, different
configurations of the MAVIS annotations file have been made available.
These files can be downloaded below, or if the required configuration is
not available,
(instructions on generating the annotations file)[/inputs/reference/#generating-the-annotations-from-ensembl] can be found below.
[instructions on generating the annotations file](/inputs/reference/#generating-the-annotations-from-ensembl) can be found below.

| File Name (Type/Format) | Environment Variable | Download |
| --------------------------------------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [reference genome](../../inputs/reference/#reference-genome) ([fasta](../../glossary/#fasta)) | `MAVIS_REFERENCE_GENOME` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz) <br> [![](../images/get_app-24px.svg) GRCh38](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.tar.gz) |
| File Name (Type/Format) | Environment Variable | Download |
| --------------------------------------------------------------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [reference genome](../../inputs/reference/#reference-genome) ([fasta](../../glossary/#fasta)) | `MAVIS_REFERENCE_GENOME` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz) <br> [![](../images/get_app-24px.svg) GRCh38](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.tar.gz) |
| [annotations](../../inputs/reference/#annotations) ([JSON](../../glossary/#json)) | `MAVIS_ANNOTATIONS` | [![](../images/get_app-24px.svg) GRCh37/Hg19 + Ensembl69](http://www.bcgsc.ca/downloads/mavis/v3/ensembl69_hg19_annotations.v3.json.gz) <br> [![](../images/get_app-24px.svg) GRCh38 + Ensembl79](http://www.bcgsc.ca/downloads/mavis/v3/ensembl79_hg38_annotations.v3.json.gz) |
| [masking](../../inputs/reference/#masking-file) (text/tabbed) | `MAVIS_MASKING` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://www.bcgsc.ca/downloads/mavis/hg19_masking.tab)<br>[![](../images/get_app-24px.svg) GRCh38](http://www.bcgsc.ca/downloads/mavis/GRCh38_masking.tab) |
| [template metadata](../../inputs/reference/#template-metadata) (text/tabbed) | `MAVIS_TEMPLATE_METADATA` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz)<br>[![](../images/get_app-24px.svg) GRCh38](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz) |
| [DGV annotations](../../inputs/reference/#dgv-database-of-genomic-variants) (text/tabbed) | `MAVIS_DGV_ANNOTATION` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://www.bcgsc.ca/downloads/mavis/dgv_hg19_variants.tab)<br>[![](../images/get_app-24px.svg) GRCh38](http://www.bcgsc.ca/downloads/mavis/dgv_hg38_variants.tab) |
| [aligner reference](../../inputs/reference/#aligner-reference) | `MAVIS_ALIGNER_REFERENCE` | [![](../images/get_app-24px.svg) GRCh37/Hg19 2bit (blat)](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit)<br>[![](../images/get_app-24px.svg) GRCh38 2bit (blat)](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit) |
| [masking](../../inputs/reference/#masking-file) (text/tabbed) | `MAVIS_MASKING` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://www.bcgsc.ca/downloads/mavis/hg19_masking.tab)<br>[![](../images/get_app-24px.svg) GRCh38](http://www.bcgsc.ca/downloads/mavis/GRCh38_masking.tab) |
| [template metadata](../../inputs/reference/#template-metadata) (text/tabbed) | `MAVIS_TEMPLATE_METADATA` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz)<br>[![](../images/get_app-24px.svg) GRCh38](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz) |
| [DGV annotations](../../inputs/reference/#dgv-database-of-genomic-variants) (text/tabbed) | `MAVIS_DGV_ANNOTATION` | [![](../images/get_app-24px.svg) GRCh37/Hg19](http://www.bcgsc.ca/downloads/mavis/dgv_hg19_variants.tab)<br>[![](../images/get_app-24px.svg) GRCh38](http://www.bcgsc.ca/downloads/mavis/dgv_hg38_variants.tab) |
| [aligner reference](../../inputs/reference/#aligner-reference) | `MAVIS_ALIGNER_REFERENCE` | [![](../images/get_app-24px.svg) GRCh37/Hg19 2bit (blat)](http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit)<br>[![](../images/get_app-24px.svg) GRCh38 2bit (blat)](http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit) |

If the environment variables above are set they will be used as the
default values when any step of the pipeline script is called (including
Expand Down
158 changes: 158 additions & 0 deletions docs/inputs/sv_callers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# SV Callers

MAVIS supports output from a wide-variety of SV callers. Assumptions are made for each tool based on interpretation of the output and the publications for each tool.

## Configuring Conversions

Adding a conversion step to your MAVIS run is as simple as adding that section to the input JSON config.

The general structure of this section is as follows

```jsonc
{
"convert": {
"<ALIAS>": {
"file_type": "<TOOL OUTPUT TYPE>",
"name": "<TOOL NAME>", // optional field for supported tools
"inputs": [
"/path/to/tool/output/file"
]
}
}
}
```

A full version of the input configuration file specification can be found in the [configuration](../configuration/general.md) section.

## Supported Tools

The tools and versions currently supported are given below. Versions listed indicate the version of the tool for which output files have been tested as input into MAVIS. MAVIS also supports a [general VCF input](#general-vcf-inputs).

| SV Caller | Version(s) Tested | Files used as MAVIS input |
| --------------------------------------------------------------------------- | ----------------- | --------------------------------------------- |
| [BreakDancer (Chen, 2009)](../../background/citations#chen-2009) | `1.4.5` | `Tools main output file(s)` |
| [BreakSeq (Abyzov, 2015)](../../background/citations#abyzov-2015) | `2.2` | `work/breakseq.vcf.gz` |
| [Chimerascan (Iyer, 2011)](../../background/citations#iyer-2011) | `0.4.5` | `*.bedpe` |
| [CNVnator (Abyzov, 2011)](../../background/citations#abyzov-2011) | `0.3.3` | `Tools main output file(s)` |
| [CuteSV (Jiang, 2020)](../../background/citations#jiang-2020) | `1.0.10` | `*.vcf` |
| [DeFuse (McPherson. 2011)](../../background/citations#mcpherson-2011) | `0.6.2` | `results/results.classify.tsv` |
| [DELLY (Rausch, 2012)](../../background/citations#rausch-2012) | `0.6.1` `0.7.3` | `combined.vcf` (converted from bcf) |
| [Manta (Chen, 2016)](../../background/citations#chen-2016) | `1.0.0` | `{diploidSV,somaticSV}.vcf` |
| [Pindel (Ye, 2009)](../../background/citations#ye-2009) | `0.2.5b9` | `Tools main output file(s)` |
| [Sniffles (Sedlazeck, 2018)](../../background/citations#sedlazeck-2018) | `1.0.12b` | `*.vcf` |
| [STAR-Fusion (Haas, 2017)](../../background/citations#haas-2017) | `1.4.0` | `star-fusion.fusion_predictions.abridged.tsv` |
| [Straglr (Chiu, 2021)](../../background/citations#chiu-2021) | | |
| [Strelka (Saunders, 2012)](../../background/citations#saunders-2012) | `1.0.6` | `passed.somatic.indels.vcf` |
| [Trans-ABySS (Robertson, 2010)](../../background/citations/#robertson-2010) | `1.4.8 (custom)` | `{indels/events_novel_exons,fusions/*}.tsv` | `<output_prefix>.bed` |

!!! note
[Trans-ABySS](../../glossary/#trans-abyss): The trans-abyss version
used was an in-house dev version. However the output columns are
compatible with 1.4.8 as that was the version branched from.
Additionally, although indels can be used from both genome and
transcriptome outputs of Trans-ABySS, it is recommended to only use the
genome indel calls as the transcriptome indels calls (for versions
tested) introduce a very high number of false positives. This will slow
down validation. It is much faster to simply use the genome indels for
both genome and transcriptome.

## [DELLY](../../glossary/#delly) Post-processing

Some post-processing on the delly output files is generally done prior
to input. The output BCF files are converted to a VCF file

```bash
bcftools concat -f /path/to/file/with/vcf/list --allow-overlaps --output-type v --output combined.vcf
```

## General VCF inputs

Assuming that the tool outputting the VCF file follows standard
conventions, then it is possible to use a
[general VCF conversion](../../package/mavis/tools/vcf)
that is not tool-specific. Given the wide variety in content for VCF files,
MAVIS makes a number of assumptions and the VCF conversion may not work
for all VCFs. In general MAVIS follows the [VCF 4.2
specification](https://samtools.github.io/hts-specs/VCFv4.2.pdf). If the
input tool you are using differs, it would be better to use a
[custom conversion script](#custom-conversions).

Using the general VCF tool with a non-standard tool can be done as follows

```json
{
"convert": {
"my_tool_alias": {
"file_type": "vcf",
"name": "my_tool",
"inputs": ["/path/to/my_tool/output.vcf"]
}
}
}
```

### Assumptions on non-standard INFO fields

- `PRECISE` if given, Confidence intervals are ignored if given in favour of exact breakpoint calls using pos and END as the breakpoint positions
- `CT` values if given are representative of the breakpoint orientations.
- `CHR2` is given for all interchromosomal events

### Translating BND type Alt fields

There are four possible configurations for the alt field of a BND type structural variant
based on the VCF specification. These correspond 1-1 to the orientation types for MAVIS
translocation structural variants.

```text
r = reference base/seq
u = untemplated sequence/alternate sequence
p = chromosome:position
```

| alt format | orients |
| ---------- | ------- |
| `ru[p[` | LR |
| `[p[ur` | RR |
| `]p]ur` | RL |
| `ru]p]` | LL |

## Custom Conversions

If there is a tool that is not yet supported by MAVIS and you would like it to be, you can either add a [feature request](https://github.com/bcgsc/mavis/issues) to our GitHub page or tackle writing the conversion script yourself. Either way there are a few things you will need

- A sample output from the tool in question
- Tool metadata for the citation, version, etc

### Logic Example - [Chimerascan](../../glossary/#chimerascan)

The following is a description of how the conversion script for
[Chimerascan](../../background/citations/#iyer-2011) was generated.
While this is a built-in conversion command now, the logic could also
have been put in an external script. As mentioned above, there are a
number of assumptions that had to be made about the tools output to
convert it to the
[standard mavis format](../../inputs/standard/). Assumptions were then verified by reviewing at a series of
called events in [IGV](../../glossary/#igv). In the current
example, [Chimerascan](../../background/citations/#iyer-2011) output
has six columns of interest that were used in the conversion

- start3p
- end3p
- strand3p
- start5p
- end5p
- strand5p

The above columns describe two segments which are joined. MAVIS requires
the position of the join. It was assumed that the segments are always
joined as a [sense fusion](../../glossary/#sense-fusion). Using this
assumption there are four logical cases to determine the position of the
breakpoints.

i.e. the first case would be: If both strands are positive, then the end
of the five-prime segment (end5p) is the first breakpoint and the start
of the three-prime segment is the second breakpoint

### Calling a Custom Conversion Script

Since MAVIS v3+ is run using [snakemake](https://snakemake.readthedocs.io/en/stable/) the simplest way to incorporate your custom conversion scripts is to modify the Snakefile and add them as rules.
5 changes: 5 additions & 0 deletions src/mavis/convert/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from .cnvnator import convert_row as _parse_cnvnator
from .constants import SUPPORTED_TOOL, TOOL_SVTYPE_MAPPING, TRACKING_COLUMN
from .starfusion import convert_row as _parse_starfusion
from .straglr import convert_row as _parse_straglr
from .transabyss import convert_row as _parse_transabyss
from .vcf import convert_file as read_vcf

Expand Down Expand Up @@ -142,6 +143,10 @@ def _convert_tool_row(
{k: v for k, v in row.items() if k not in {'Type', 'Chr1', 'Chr2', 'Pos1', 'Pos2'}}
)

elif file_type == SUPPORTED_TOOL.STRAGLR:

std_row.update(_parse_straglr(row))

else:
raise NotImplementedError('unsupported file type', file_type)

Expand Down
1 change: 1 addition & 0 deletions src/mavis/convert/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ class SUPPORTED_TOOL(MavisNamespace):
CNVNATOR = 'cnvnator'
STRELKA = 'strelka'
STARFUSION = 'starfusion'
STRAGLR = 'straglr'


TOOL_SVTYPE_MAPPING = {v: [v] for v in SVTYPE.values()} # type: ignore
Expand Down
32 changes: 32 additions & 0 deletions src/mavis/convert/straglr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from typing import Dict

from ..constants import COLUMNS, SVTYPE


def convert_row(row: Dict) -> Dict:
"""
Converts the fields from the original STRAGLR BED output into MAVIS definitions of an SV
Since STRAGLR defines regions where short tandem repeats exist we make the definitions here fairly
non-specific

See their github page for more details: https://github.com/bcgsc/straglr

BED Columns
- chrom: chromosome name
- start: start coordinate of locus
- end: end coordinate of locus
- repeat_unit: repeat motif
- allele<N>.size: where N={1,2,3...} depending on --max_num_clusters e.g. N={1,2} if --max_num_clusters==2 (default)
- allele<N>.copy_number
- allele<N>.support
"""
return {
COLUMNS.break1_chromosome: row['chrom'],
COLUMNS.break2_chromosome: row['chrom'],
COLUMNS.break1_position_start: row['start'],
COLUMNS.break1_position_end: row['end'],
COLUMNS.break2_position_start: row['start'],
COLUMNS.break2_position_end: row['end'],
COLUMNS.untemplated_seq: None,
COLUMNS.event_type: SVTYPE.INS,
}
10 changes: 10 additions & 0 deletions tests/data/straglr.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#chrom start end repeat_unit allele1:size allele1:copy_number allele1:support allele2:size allele2:copy_number allele2:support
chr11 776686 778078 CT 100.0 150.0 10 100.0 100.0 1
chr10 3079216 3079421 AGAGGTCACCACCCCTTCCCAACAATCCAGTAACAATCC 100.0 150.0 10 100.0 100.0 1
chr9 2080637 2081030 CTCCTTCCCTCCGCCCCCACCTCGGTCCCTGT 100.0 150.0 10 100.0 100.0 1
chrX 244719 245293 CCCCGGGAACCGCCT 100.0 150.0 10 - - -
chr7 284096 284233 GGT 100.0 150.0 10 - - -
chr8 288173 290242 CCCTGCTCCGT 100.0 150.0 10 100.0 100.0 1
chr3 2382228 2382908 CCGTGGGGGAGGCTGAGGCTATGGGGACT 100.0 100.0 10 - - -
chr2 2427285 2427528 CCTCC 100.0 150.0 10 - - -
chr2 2427953 2428216 GGAGG 100.0 150.0 10 100.0 100.0 1
11 changes: 11 additions & 0 deletions tests/test_mavis/convert/test_convert.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
import os
import shutil
import sys
Expand Down Expand Up @@ -47,6 +48,16 @@ def run_main(self, inputfile, file_type, strand_specific=False):
result.setdefault(pair.data['tracking_id'], []).append(pair)
return result

def test_straglr(self):
result = self.run_main(get_data('straglr.bed'), SUPPORTED_TOOL.STRAGLR, False)
assert len(result) == 9
for bpp in itertools.chain(*result.values()):
assert bpp.break1.chr == bpp.break2.chr
assert bpp.break1.start == bpp.break2.start
assert bpp.break1.end == bpp.break2.end
assert bpp.event_type == SVTYPE.INS
assert bpp.untemplated_seq is None

def test_chimerascan(self):
self.run_main(get_data('chimerascan_output.bedpe'), SUPPORTED_TOOL.CHIMERASCAN, False)

Expand Down