-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a simple BAM parser to dnaio #116
Conversation
I contributed back the encoded sequence parsing to htslib: samtools/htslib#1677 . It turns out that the code is already faster even without the SSSE3 routine due to the while loop. QED while loops are better than for loops for walking over bits of memory. It makes sense because a loop control variable not needed, so that is one less operation. On the other hand, integer operations are lightning fast so it is a bit surprising. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1
or READ2
flag is set.
I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.
BTW, it’s super great that this gets better performance than
The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2). |
Completely agree. I just wanted to make sure that the fundamentals (performance-wise) are solid.
You didn't mention it before, but I also agree on this part. I think minimap2 goes up to SSE4.1 and I think the tradeoff in cost vs benefits is very clear in that case. Same here with the SSSE3 routine. I am currently working on other things, but I will come back later (hopefully this week) to address the BAM parser concerns. |
When I wrote the parser I only had unaligned BAM for nanopore reads in mind. That is going to be the primary use case. These reads are single-end (allthough duplex reads are possible but that works differently).
It is going to be next to impossible to get the correct pairs unless they are ordered in interleaved fashion. (In which case dnaio already has code to handle it.) It is possible using BAM indexing, but that is difficult to implement, and it relies on alignment which is not great in this use case. What do you think about this? |
Maybe one could return tuples of SequenceRecords like the And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine). |
These are the bits that can be found in the BAM sequence. Most of them have difficult implications to how the reads should be interpreted in the terms of dnaio. So I checked what actual flags are used in dorado ubam.
So that is easy then. How about:
In that way there is a good guarantee that it does not let any weird garbage through.
Won't that work automatically already if the BAM is passed with How do you think about this? Can just the one check be implemented and this be done? |
BTW, did you know you can run
Yes, sounds good to be quite restrictive about the type of BAM that is accepted. We can later relax those restrictions when we have an actual use case.
That should mostly work, except that for an unmapped BAM with paired-end reads, we would need to require the flag values to be 0x45 (PAIRED,UNMAP,READ1) and 0x85 (PAIRED,UNMAP,READ2). I think for this to be usable in Cutadapt, that would need to be implemented, but I’d be happy to do that myself when the time comes. Let’s just add single-end support for now, which is what you need. BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches). |
0551427
to
dd2b4cb
Compare
94c6c04
to
cbfa8d2
Compare
cbfa8d2
to
42c6f20
Compare
All the review comments are addressed. I also made sure the SSSE3 routine is checked in the CI. The CFLAGS variable is inherited by tox when building the package, no extra configuration is needed. (I have verified this using a print statement in the SSSE3 routine).
I will do that on the next PR. For this PR that would mean I would need to open a new PR. It seems I cannot choose a different base branch on the fly. |
@marcelm Have your concerns be adequately addressed? |
Please give me two more days |
Sorry, it was not my intention to rush you. Take all the days you need. We are expecting a lot of nanopore requests in the feature, yet I have had only one request yet. And that has already been handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t have time for a super in-depth review, especially the SIMD code I only glanced over, but I’m fine with that as long as the tests pass.
setup.py
Outdated
import platform | ||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The os
and sys
imports are now unused again, right?
cursor += 1; | ||
dest_cursor += 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline at end of file
cursor += 1; | ||
dest_cursor += 1; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} |
Let’s do this. I’ll add a changelog entry after merging. |
Hi, sorry for my late response to your review. I am currently at sick leave (again unfortunately). I expect to be working again tomorrow. As regards tot the BAM parser. I intend to follow up with tag parsing, putting the tags in the header in the same fashion as samtools fastq does this, so this information is retained for use in/after cutadapt. |
Thanks and no worries, hope you get better soon! |
I want to do tag parsing at a later date so the code review (and my efforts) can be split into two sessions. Getting name, sequence and qualities out is by far the most important so I decided to do that first.
As for performance, I enabled SSSE3 (slightly later than SSE3) support by default on linux. This makes the parsing of the encoded sequences 5 times faster. This does not result in a great speedup since that was already quite fast, but still, it is a great fit for the PSHUFB instruction, so it would be bad not to use it. Users can still easily compile from source setting
DNAIO_CPU_BASIC=1
in their environment. That will disable the use of the-mssse3
compile flag. Linux users will always have a compiler installed, so this should not be a problem. SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.As for the performance, even though I did not do anything special, it is much faster than samtools fastq. I benchmarked with this script on python 3.11
Samtools fastq
Dnaio with SSSE3
dnaio without SSSE3
EDIT: So the speed difference is quite good, but even more so when you realize that python-isal is doing the bgzip parsing, which takes approximately 800ms. bgzip however only takes 250 ms to completely decompress this (uncompressed) file (basically it will probably optimize by just straight out copying the uncompressed data, that trick is very fast). However that means that samtools is taking 2 seconds to create the fastq while dnaio only takes 200ms. Samtools is really doing something weird here. The SSE2 instructions on the quality copy will make a difference, sure, but not this much.