Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for long-read bams (genome) #210

Open
oneillkza opened this issue May 27, 2020 · 2 comments
Open

Add support for long-read bams (genome) #210

oneillkza opened this issue May 27, 2020 · 2 comments
Labels
enhancement long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio

Comments

@oneillkza
Copy link
Collaborator

Reading in vcfs from variant callers that run on long-read bams is only part of the problem. MAVIS still needs bam files for most operations. Such bams have a few key differences from short-read ("NGS") sequence:

  • Single end rather than paired-end
  • Variable (and long) read length
  • Relatively high error rate (5-10%), especially for homopolymers

This makes them very good for detecting large structural variants, especially since they can map through low-complexity regions, but less good for smaller variants.

This ticket is to track work on reading in long-read genome bams.

@oneillkza oneillkza added the long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio label May 27, 2020
@oneillkza
Copy link
Collaborator Author

So, the first major design decision is to create a new file type, genome_longread for long read genomic bams. This is distinct from genome, for short read paired-end genomic bams. I'm probably going to be copying a lot of the code to handle the genome bam type, but I think that'll be cleaner than having if statements everywhere.

e.g. in stats I've created compute_genome_longread_bam_stats, which is a modified copy of compute_genome_bam_stats

@oneillkza
Copy link
Collaborator Author

OK, got it as far as being able to do config and setup. Clustering works, but it fails on validate.

ValueError: ('protocol error', 'genome_longread')

This is somewhat unsurprising. Looks like the next step is to create a class in validate/evidence.py, and a case in validate/main.py to match up the genome_longread protocol to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement long read support Support for long read sequence data, e.g. from Oxford Nanopore or PacBio
Projects
None yet
Development

No branches or pull requests

1 participant