Add a simple BAM parser to dnaio #116

rhpvorderman · 2023-09-26T06:08:19Z

I want to do tag parsing at a later date so the code review (and my efforts) can be split into two sessions. Getting name, sequence and qualities out is by far the most important so I decided to do that first.

As for performance, I enabled SSSE3 (slightly later than SSE3) support by default on linux. This makes the parsing of the encoded sequences 5 times faster. This does not result in a great speedup since that was already quite fast, but still, it is a great fit for the PSHUFB instruction, so it would be bad not to use it. Users can still easily compile from source setting DNAIO_CPU_BASIC=1 in their environment. That will disable the use of the -mssse3 compile flag. Linux users will always have a compiler installed, so this should not be a problem. SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.

As for the performance, even though I did not do anything special, it is much faster than samtools fastq. I benchmarked with this script on python 3.11

import sys 

import dnaio

def main():
    with dnaio.open(sys.argv[1]) as reader:
        for record in reader:
            sys.stdout.buffer.write(record.fastq_bytes())

if __name__ == "__main__":
    main()

Samtools fastq

Benchmark 1: samtools fastq ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      2.188 s ±  0.008 s    [User: 2.092 s, System: 0.096 s]
  Range (min … max):    2.179 s …  2.199 s    5 runs

Dnaio with SSSE3

Benchmark 1: python dnaio_read.py ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      1.005 s ±  0.010 s    [User: 0.833 s, System: 0.172 s]
  Range (min … max):    0.989 s …  1.017 s    5 runs

dnaio without SSSE3

Benchmark 1: python dnaio_read.py ~/test/nanopore_100000reads.u.bam > /dev/null
  Time (mean ± σ):      1.108 s ±  0.047 s    [User: 0.945 s, System: 0.162 s]
  Range (min … max):    1.072 s …  1.185 s    5 runs

EDIT: So the speed difference is quite good, but even more so when you realize that python-isal is doing the bgzip parsing, which takes approximately 800ms. bgzip however only takes 250 ms to completely decompress this (uncompressed) file (basically it will probably optimize by just straight out copying the uncompressed data, that trick is very fast). However that means that samtools is taking 2 seconds to create the fastq while dnaio only takes 200ms. Samtools is really doing something weird here. The SSE2 instructions on the quality copy will make a difference, sure, but not this much.

rhpvorderman · 2023-09-26T08:30:26Z

I contributed back the encoded sequence parsing to htslib: samtools/htslib#1677 . It turns out that the code is already faster even without the SSSE3 routine due to the while loop. QED while loops are better than for loops for walking over bits of memory. It makes sense because a loop control variable not needed, so that is one less operation. On the other hand, integer operations are lightning fast so it is a bit surprising.

marcelm

Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1 or READ2 flag is set.

I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.

src/dnaio/_core.pyx

src/dnaio/readers.py

tests/test_internal.py

setup.py

marcelm · 2023-09-27T08:22:08Z

BTW, it’s super great that this gets better performance than samtools fastq. However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).

SSSE3 was introduced 17 years ago, so I doubt anyone is still using CPUs that can't handle it.

The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).

rhpvorderman · 2023-09-27T09:14:54Z

However – and sorry for dampening your enthusiasm – at this stage, it’s more important for me to get the semantics correct (single-end vs paired-end, what to do if certain flags are set in the input).

Completely agree. I just wanted to make sure that the fundamentals (performance-wise) are solid.

The Steam hardware survey agrees with you. I’m not sure I mentioned this before, but I’m fine with requiring x86-64-v2 (which goes up to SSE4.2).

You didn't mention it before, but I also agree on this part. I think minimap2 goes up to SSE4.1 and I think the tradeoff in cost vs benefits is very clear in that case. Same here with the SSSE3 routine.

I am currently working on other things, but I will come back later (hopefully this week) to address the BAM parser concerns.

rhpvorderman · 2023-09-27T09:33:05Z

I also don’t think it’s super obvious what a BAM reader does. It could just return unmapped reads, for example, or have certain requirements on the BAM file. What happens to supplementary alignments, secondary alignments etc. So some decisions need to be made and those need to be documented.

When I wrote the parser I only had unaligned BAM for nanopore reads in mind. That is going to be the primary use case. These reads are single-end (allthough duplex reads are possible but that works differently).
So how to best handle this? I think it is going to be quite some effort to handle all the alignment cases when cutadapt will be dealing with a 100% uBAM in practice. Maybe it is best to only allow to read BAM in single-end format, no paired files, and with no knowledge of mates. It can check the alignment flags and throw an error that it works on unaligned BAM only. Alternatively some filter could be made that catches everything but secondary alignments, but I am not quite sure if the semantics can be completely right.

Thanks! I think what also needs to be addressed is that BAM files can contain mixed single-end and paired-end data. IIUC, this version of the PR just returns all reads as single-end reads, no matter whether the READ1 or READ2 flag is set.

It is going to be next to impossible to get the correct pairs unless they are ordered in interleaved fashion. (In which case dnaio already has code to handle it.) It is possible using BAM indexing, but that is difficult to implement, and it relies on alignment which is not great in this use case.

What do you think about this?

marcelm · 2023-09-27T10:02:37Z

Maybe one could return tuples of SequenceRecords like the MultipleFileReader already does. The tuples would have one or two elements depending on whether it’s a single-end or paired-end read.

And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).

rhpvorderman · 2023-10-13T13:31:20Z

These are the bits that can be found in the BAM sequence. Most of them have difficult implications to how the reads should be interpreted in the terms of dnaio. So I checked what actual flags are used in dorado ubam.


samtools view HG002_20230424_1302_3H_PAO89685_2264ba8c_hac_simplex.bam  | cut -f 2 | sort| uniq -c                                                                          
7646392 4

So that is easy then. How about:

if flags  != 4:
    raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")

In that way there is a good guarantee that it does not let any weird garbage through.

And yes, I think requiring name-sorted BAMs (that is, same order as interleaved FASTQs would be) is totally fine. I think the focus should be supporting uBAM (and if it happens to work for a file with mapped reads, that’s also fine).

Won't that work automatically already if the BAM is passed with interleaved=True ? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.

How do you think about this? Can just the one check be implemented and this be done?

marcelm · 2023-10-17T08:24:18Z

BTW, did you know you can run samtools flags to get the table with SAM flags? You can also give it a flag value and it will explain what that means (for example, samtools flags 99 outputs PAIRED,PROPER_PAIR,MREVERSE,READ1).

So that is easy then. How about:

if flags  != 4:
    raise NotImplementedError("dnaio was implemented with unmapped BAM files in mind to support nanopore sequencer input. Mapped BAM files are not supported. Please use samtools fastq.")

Yes, sounds good to be quite restrictive about the type of BAM that is accepted. We can later relax those restrictions when we have an actual use case.

Won't that work automatically already if the BAM is passed with interleaved=True ? It will check then if the nth and the n+1th sequence have the same name right? It will throw an error otherwise. So I don't think there is a specific need to implement this separately for the BAM reader.

That should mostly work, except that for an unmapped BAM with paired-end reads, we would need to require the flag values to be 0x45 (PAIRED,UNMAP,READ1) and 0x85 (PAIRED,UNMAP,READ2). I think for this to be usable in Cutadapt, that would need to be implemented, but I’d be happy to do that myself when the time comes.

Let’s just add single-end support for now, which is what you need.

BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).

.github/workflows/ci.yml

src/dnaio/readers.py

This reverts commit e4d2b3d.

rhpvorderman · 2023-10-17T12:18:40Z

All the review comments are addressed. I also made sure the SSSE3 routine is checked in the CI. The CFLAGS variable is inherited by tox when building the package, no extra configuration is needed. (I have verified this using a print statement in the SSSE3 routine).
It is only tested on linux for one version of python but I suppose that is enough as there is no interaction with cpython api for that particular function.

BTW, you can also work in branches in this repo directly instead of in your own fork (makes it also sligthly easier for me to check out branches).

I will do that on the next PR. For this PR that would mean I would need to open a new PR. It seems I cannot choose a different base branch on the fly.

rhpvorderman · 2023-10-23T12:43:21Z

@marcelm Have your concerns be adequately addressed?

marcelm · 2023-10-23T20:08:17Z

Please give me two more days

rhpvorderman · 2023-10-24T06:03:05Z

Sorry, it was not my intention to rush you. Take all the days you need. We are expecting a lot of nanopore requests in the feature, yet I have had only one request yet. And that has already been handled.

marcelm

I don’t have time for a super in-depth review, especially the SIMD code I only glanced over, but I’m fine with that as long as the tests pass.

marcelm · 2023-10-27T10:24:16Z

setup.py

 import platform
+import sys


The os and sys imports are now unused again, right?

marcelm · 2023-10-27T11:51:58Z

src/dnaio/bam.h

+        cursor += 1;
+        dest_cursor += 1;
+    }
+}


Missing newline at end of file

setup.py

marcelm · 2023-11-02T10:18:53Z

src/dnaio/bam.h

+        cursor += 1;
+        dest_cursor += 1;
+    }
+}


Suggested change

}

}

marcelm · 2023-11-02T10:23:31Z

Let’s do this. I’ll add a changelog entry after merging.

rhpvorderman · 2023-11-02T10:39:55Z

Hi, sorry for my late response to your review. I am currently at sick leave (again unfortunately). I expect to be working again tomorrow. As regards tot the BAM parser. I intend to follow up with tag parsing, putting the tags in the header in the same fashion as samtools fastq does this, so this information is retained for use in/after cutadapt.

marcelm · 2023-11-02T10:52:57Z

Thanks and no worries, hope you get better soon!

rhpvorderman added 18 commits September 25, 2023 14:10

Construct BamIter class

cb199f4

Allow reading

66b7194

Fix compile errors

da86a00

Fix small errors

cc81baa

Fix segfault and compile issue

20d0474

Fix small size offset issue

fcd3f75

Use SSSE3 on linux x86_64

7e8be76

Add basic reading test

df726da

Reformatting

f82f8f6

Add a header parsing test

99c5f09

Start on tests for truncated records

2a2071d

Make sure EOF is thrown for truncated headers

0b21fe9

Test truncated records

24b2b34

Test truncation before magic

d931ea1

Add more bam parser tests

4aa4d56

Reformat setup.py

d7e1fb0

Fix mypy issues, create stub for BamIter

b2955b5

Add missing header file

7ef5606

rhpvorderman added 2 commits September 26, 2023 13:11

Prevent \r\n line endings in sam file

c4a4805

Add missing tests

076ae56

marcelm reviewed Sep 27, 2023

View reviewed changes

src/dnaio/_core.pyx Outdated Show resolved Hide resolved

src/dnaio/readers.py Outdated Show resolved Hide resolved

tests/test_internal.py Outdated Show resolved Hide resolved

tests/test_internal.py Outdated Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

rhpvorderman added 2 commits October 17, 2023 13:17

Address formatting errors and typo's

59aff44

Do not set -mssse3 in setup.py

e14b58a

rhpvorderman added 3 commits October 17, 2023 13:22

Build linux wheels with ssse3 support

e4d2b3d

Fix typo in test

fd045a3

Throw an error on mapped files

739e59a

marcelm reviewed Oct 17, 2023

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

marcelm reviewed Oct 17, 2023

View reviewed changes

src/dnaio/readers.py Outdated Show resolved Hide resolved

Test mapped reads throw errors

dd2b4cb

rhpvorderman force-pushed the bamparser branch from 0551427 to dd2b4cb Compare October 17, 2023 11:44

rhpvorderman added 6 commits October 17, 2023 13:45

Revert "Build linux wheels with ssse3 support"

0233247

This reverts commit e4d2b3d.

Build with -mssse3 on linux

63eb2bf

Reformat setup.py

89c7ae3

Merge branch 'main' into bamparser

fae9f8b

Remove erroneous docstring

4e14cc3

Remove unused ignore comment

f98d72f

rhpvorderman force-pushed the bamparser branch from 94c6c04 to cbfa8d2 Compare October 17, 2023 12:14

Make sure -mssse3 flag is tested

42c6f20

rhpvorderman force-pushed the bamparser branch from cbfa8d2 to 42c6f20 Compare October 17, 2023 12:15

marcelm approved these changes Oct 27, 2023

View reviewed changes

marcelm reviewed Nov 2, 2023

View reviewed changes

Apply suggestions from code review

9afbee7

marcelm merged commit 650e423 into marcelm:main Nov 2, 2023
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple BAM parser to dnaio #116

Add a simple BAM parser to dnaio #116

rhpvorderman commented Sep 26, 2023 •

edited

Loading

rhpvorderman commented Sep 26, 2023

marcelm left a comment

marcelm commented Sep 27, 2023

rhpvorderman commented Sep 27, 2023

rhpvorderman commented Sep 27, 2023

marcelm commented Sep 27, 2023

rhpvorderman commented Oct 13, 2023

marcelm commented Oct 17, 2023

rhpvorderman commented Oct 17, 2023 •

edited

Loading

rhpvorderman commented Oct 23, 2023

marcelm commented Oct 23, 2023

rhpvorderman commented Oct 24, 2023

marcelm left a comment

marcelm Oct 27, 2023

marcelm Oct 27, 2023

marcelm Nov 2, 2023

marcelm commented Nov 2, 2023

rhpvorderman commented Nov 2, 2023

marcelm commented Nov 2, 2023

Add a simple BAM parser to dnaio #116

Add a simple BAM parser to dnaio #116

Conversation

rhpvorderman commented Sep 26, 2023 • edited Loading

rhpvorderman commented Sep 26, 2023

marcelm left a comment

Choose a reason for hiding this comment

marcelm commented Sep 27, 2023

rhpvorderman commented Sep 27, 2023

rhpvorderman commented Sep 27, 2023

marcelm commented Sep 27, 2023

rhpvorderman commented Oct 13, 2023

marcelm commented Oct 17, 2023

rhpvorderman commented Oct 17, 2023 • edited Loading

rhpvorderman commented Oct 23, 2023

marcelm commented Oct 23, 2023

rhpvorderman commented Oct 24, 2023

marcelm left a comment

Choose a reason for hiding this comment

marcelm Oct 27, 2023

Choose a reason for hiding this comment

marcelm Oct 27, 2023

Choose a reason for hiding this comment

marcelm Nov 2, 2023

Choose a reason for hiding this comment

marcelm commented Nov 2, 2023

rhpvorderman commented Nov 2, 2023

marcelm commented Nov 2, 2023

rhpvorderman commented Sep 26, 2023 •

edited

Loading

rhpvorderman commented Oct 17, 2023 •

edited

Loading