Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a simple BAM parser to dnaio #116

Merged
merged 34 commits into from
Nov 2, 2023
Merged

Conversation

rhpvorderman
Copy link
Collaborator

@rhpvorderman rhpvorderman commented Sep 26, 2023

I want to do tag parsing at a later date so the code review (and my efforts) can be split into two sessions. Getting name, sequence and qualities out is by far the most important so I decided to do that first.

As for performance, I enabled SSSE3 (slightly later than SSE3) support by default on linux. This makes the parsing of the encoded sequences 5 times faster. This does not result in a great speedup since that was already quite fast, but still, it is a great fit for the PSHUFB instruction, so it would be bad not to use it. Users can still easily compile from source setting DNAIO_CPU_BASIC=1 in their environment. That will disable the use of the -mssse3 compile flag. Linux users will always have a compiler installed, so this should not be a problem. SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.

As for the performance, even though I did not do anything special, it is much faster than samtools fastq. I benchmarked with this script on python 3.11

import sys 

import dnaio

def main():
    with dnaio.open(sys.argv[1]) as reader:
        for record in reader:
            sys.stdout.buffer.write(record.fastq_bytes())

if __name__ == "__main__":
    main()

Samtools fastq

Benchmark 1: samtools fastq ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      2.188 s ±  0.008 s    [User: 2.092 s, System: 0.096 s]
  Range (min … max):    2.179 s …  2.199 s    5 runs

Dnaio with SSSE3

Benchmark 1: python dnaio_read.py ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      1.005 s ±  0.010 s    [User: 0.833 s, System: 0.172 s]
  Range (min … max):    0.989 s …  1.017 s    5 runs

dnaio without SSSE3

Benchmark 1: python dnaio_read.py ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      1.108 s ±  0.047 s    [User: 0.945 s, System: 0.162 s]
  Range (min … max):    1.072 s …  1.185 s    5 runs

EDIT: So the speed difference is quite good, but even more so when you realize that python-isal is doing the bgzip parsing, which takes approximately 800ms. bgzip however only takes 250 ms to completely decompress this (uncompressed) file (basically it will probably optimize by just straight out copying the uncompressed data, that trick is very fast). However that means that samtools is taking 2 seconds to create the fastq while dnaio only takes 200ms. Samtools is really doing something weird here. The SSE2 instructions on the quality copy will make a difference, sure, but not this much.

@rhpvorderman
Copy link
Collaborator Author

I contributed back the encoded sequence parsing to htslib: samtools/htslib#1677 . It turns out that the code is already faster even without the SSSE3 routine due to the while loop. QED while loops are better than for loops for walking over bits of memory. It makes sense because a loop control variable not needed, so that is one less operation. On the other hand, integer operations are lightning fast so it is a bit surprising.

Copy link
Owner

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1 or READ2 flag is set.

I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.

src/dnaio/_core.pyx Outdated Show resolved Hide resolved
src/dnaio/readers.py Outdated Show resolved Hide resolved
tests/test_internal.py Outdated Show resolved Hide resolved
tests/test_internal.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@marcelm
Copy link
Owner

marcelm commented Sep 27, 2023

BTW, it’s super great that this gets better performance than samtools fastq. However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).

SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.

The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).

@rhpvorderman
Copy link
Collaborator Author

However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).

Completely agree. I just wanted to make sure that the fundamentals (performance-wise) are solid.

The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).

You didn't mention it before, but I also agree on this part. I think minimap2 goes up to SSE4.1 and I think the tradeoff in cost vs benefits is very clear in that case. Same here with the SSSE3 routine.

I am currently working on other things, but I will come back later (hopefully this week) to address the BAM parser concerns.

@rhpvorderman
Copy link
Collaborator Author

I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.

When I wrote the parser I only had unaligned BAM for nanopore reads in mind. That is going to be the primary use case. These reads are single-end (allthough duplex reads are possible but that works differently).
So how to best handle this? I think it is going to be quite some effort to handle all the alignment cases when cutadapt will be dealing with a 100% uBAM in practice. Maybe it is best to only allow to read BAM in single-end format, no paired files, and with no knowledge of mates. It can check the alignment flags and throw an error that it works on unaligned BAM only. Alternatively some filter could be made that catches everything but secondary alignments, but I am not quite sure if the semantics can be completely right.

Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1 or READ2 flag is set.

It is going to be next to impossible to get the correct pairs unless they are ordered in interleaved fashion. (In which case dnaio already has code to handle it.) It is possible using BAM indexing, but that is difficult to implement, and it relies on alignment which is not great in this use case.

What do you think about this?

@marcelm
Copy link
Owner

marcelm commented Sep 27, 2023

Maybe one could return tuples of SequenceRecords like the MultipleFileReader already does. The tuples would have one or two elements depending on whether it’s a single-end or paired-end read.

And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).

@rhpvorderman
Copy link
Collaborator Author

afbeelding

These are the bits that can be found in the BAM sequence. Most of them have difficult implications to how the reads should be interpreted in the terms of dnaio. So I checked what actual flags are used in dorado ubam.


samtools view HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex.bam  | cut -f 2 | sort| uniq -c                                                                          
7646392 4

So that is easy then. How about:

if flags  != 4:
    raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")

In that way there is a good guarantee that it does not let any weird garbage through.

And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).

Won't that work automatically already if the BAM is passed with interleaved=True ? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.

How do you think about this? Can just the one check be implemented and this be done?

@marcelm
Copy link
Owner

marcelm commented Oct 17, 2023

BTW, did you know you can run samtools flags to get the table with SAM flags? You can also give it a flag value and it will explain what that means (for example, samtools flags 99 outputs PAIRED,PROPER_PAIR,MREVERSE,READ1).

So that is easy then. How about:

if flags  != 4:
    raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")

Yes, sounds good to be quite restrictive about the type of BAM that is accepted. We can later relax those restrictions when we have an actual use case.

Won't that work automatically already if the BAM is passed with interleaved=True ? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.

That should mostly work, except that for an unmapped BAM with paired-end reads, we would need to require the flag values to be 0x45 (PAIRED,UNMAP,READ1) and 0x85 (PAIRED,UNMAP,READ2). I think for this to be usable in Cutadapt, that would need to be implemented, but I’d be happy to do that myself when the time comes.

Let’s just add single-end support for now, which is what you need.

BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).

.github/workflows/ci.yml Outdated Show resolved Hide resolved
src/dnaio/readers.py Outdated Show resolved Hide resolved
@rhpvorderman
Copy link
Collaborator Author

rhpvorderman commented Oct 17, 2023

All the review comments are addressed. I also made sure the SSSE3 routine is checked in the CI. The CFLAGS variable is inherited by tox when building the package, no extra configuration is needed. (I have verified this using a print statement in the SSSE3 routine).
It is only tested on linux for one version of python but I suppose that is enough as there is no interaction with cpython api for that particular function.

BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).

I will do that on the next PR. For this PR that would mean I would need to open a new PR. It seems I cannot choose a different base branch on the fly.

@rhpvorderman
Copy link
Collaborator Author

@marcelm Have your concerns be adequately addressed?

@marcelm
Copy link
Owner

marcelm commented Oct 23, 2023

Please give me two more days

@rhpvorderman
Copy link
Collaborator Author

Sorry, it was not my intention to rush you. Take all the days you need. We are expecting a lot of nanopore requests in the feature, yet I have had only one request yet. And that has already been handled.

Copy link
Owner

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t have time for a super in-depth review, especially the SIMD code I only glanced over, but I’m fine with that as long as the tests pass.

setup.py Outdated
import platform
import sys
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os and sys imports are now unused again, right?

cursor += 1;
dest_cursor += 1;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at end of file

setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
cursor += 1;
dest_cursor += 1;
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

@marcelm
Copy link
Owner

marcelm commented Nov 2, 2023

Let’s do this. I’ll add a changelog entry after merging.

@marcelm marcelm merged commit 650e423 into marcelm:main Nov 2, 2023
16 of 17 checks passed
@rhpvorderman
Copy link
Collaborator Author

Hi, sorry for my late response to your review. I am currently at sick leave (again unfortunately). I expect to be working again tomorrow. As regards tot the BAM parser. I intend to follow up with tag parsing, putting the tags in the header in the same fashion as samtools fastq does this, so this information is retained for use in/after cutadapt.

@marcelm
Copy link
Owner

marcelm commented Nov 2, 2023

Thanks and no worries, hope you get better soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants