Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for svim and sniffles vcf as input #208

Open
oneillkza opened this issue May 15, 2020 · 5 comments
Open

Add support for svim and sniffles vcf as input #208

oneillkza opened this issue May 15, 2020 · 5 comments
Assignees
Labels
enhancement long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio

Comments

@oneillkza
Copy link
Collaborator

oneillkza commented May 15, 2020

Svim and sniffles are two SV callers specifically for long-read sequence data. It would be highly beneficial to be able to input them to MAVIS, both to cluster calls with each other, do somatic calling, and to integrate them with short-read sequence data.

My initial tests suggest that the 'vcf' input in MAVIS crashes for vcfs from both tools. This is somewhat unsurprising given the lack of standardisation for representing SVs in vcf format. So I'll undertake to create load scripts for the vcfs from these two tools.

@oneillkza
Copy link
Collaborator Author

OK, so first issue, looking at sniffles vcfs, is that sniffles has an svtype "INV/DEL", eg:

2	232443764	3319_0	N	<DEL/INV>	.	PASS	PRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=2;END=232443951;STD_quant_start=3.911521;STD_quant_stop=4.207137;Kurtosis_quant_start=4.480132;Kurtosis_quant_stop=-1.544728;SVTYPE=DEL/INV;SUPTYPE=NR;SVLEN=-187;STRANDS=+-;RE=16;REF_strand=7,5;AF=0.571429	GT:DR:DV	0/1:12:16

However, there are only two of those, and they both look like false positives/artifacts. I've also noticed that the same variants seem to get called in other samples. Having checked in a case where I know there is a combined inv/del event, that event is not called as DEL/INV, but the artifactual ones are.

It's also not clear how these events would fit into the vcf format, since the combined event has three breakpoints, and I believe vcf only allows for specifying two.

I think the correct behaviour would be to ignore lines with SVTYPE=DEL/INV.

@oneillkza
Copy link
Collaborator Author

There's also a DUP/INS:

4	186894524	6309_0	N	<DUP/INS>	.	PASS	IMPRECISE;SVMETHOD=Snifflesv1.0.10;CHR2=4;END=186895121;STD_quant_start=25.670995;STD_quant_stop=316.374778;Kurtosis_quant_start=6.435367;Kurtosis_quant_stop=-1.961370;SVTYPE=DUP/INS;SUPTYPE=AL,SR;SVLEN=913;STRANDS=+-;RE=5;REF_strand=4,4;AF=0.384615	GT:DR:DV	0/1:8:5

which is a bit of a mess when I look in IGV since there does look like a real insertion, with maybe a real duplication, but they're in the middle of a poly(T) region. The variant reported seems to correspond in terms of breakpoints to the insertion. Also the insertion, when I BLASTed it, seemed to be a real germline variation reported in this paper:

https://www.ncbi.nlm.nih.gov/pubmed/28250455

However, there is only one of these, and again it may be easiest to ignore them, since it isn't going to be clear which of the variants the breakpoints are referring to. I'll also make a ticket over at the Sniffles repo about this.

This is also a note to myself: for insertions, Sniffles reports bp2 = bp1 + svlen. I guess this looks nicer in genome browsers, but will likely need correcting when loading in MAVIS.

@oneillkza
Copy link
Collaborator Author

Lastly, sniffles uses the SVTYPE INVDUP, for an inverted duplication. This one at least should only have two breakpoints, and may be best to treat like a translocation.

Or ignored. The only places these are called in the COLO829 test data is in MT and GL000225.1/GL000220.1. It might be safe to assume that they are always artifactual.

@oneillkza
Copy link
Collaborator Author

OK, got it to load in the rows by ignoring any unrecognised SVTYPEs.

Next is to add some checks/fixes for breakpoints being in the wrong order:

AttributeError: ('interval start > end is not allowed', 1, 0)

@oneillkza oneillkza added the long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio label May 27, 2020
@oneillkza
Copy link
Collaborator Author

Sniffles vcf conversion seems to all be working, and now has test coverage. I've merged that into the long read branch in #211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio
Projects
None yet
Development

No branches or pull requests

2 participants