Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools can no longer write BAM files with more than 2GB of headers (since v1.10) #1420

Closed
jmarshall opened this issue Mar 15, 2022 · 1 comment · Fixed by #1421
Closed
Assignees

Comments

@jmarshall
Copy link
Member

jmarshall commented Mar 15, 2022

Even though SAMv1.pdf §4.2 limits (via bounds on l_text) the textual size of headers in BAM files to 2 GB, traditionally samtools has treated this field as a uint32_t and been able to read and write BAM files with headers up to 4 GB.

See e.g. samtools/samtools#67 which noted that 0.1.19 had introduced an error when trying to write out such files whereas samtools up to 0.1.18 completed without error. HTSlib-based samtools of the time fixed this problem in 0.2.0-rc4 and samtools versions from 1.0 have been able to read and write BAM files with headers up to 4 GB. See also samtools/samtools#1613 and samtools/hts-specs#460 (comment) which describes this as a “quality-of-implementation” feature of htslib/samtools. To be sure, people working with tens of millions of reference sequences (who thus have >2GB of text headers) are pushing the envelope, but samtools has historically supported them in their endeavours.

(In analysing this history now, I have discovered that — as least as compiled with today's compilers — samtools 0.1.18 and earlier completed without error when writing such BAM files but also did not produce valid output files! It is hard to believe that samtools/samtools#67's reporters did not notice this…)

This changed with the new header API in HTSlib 1.10 due to 62f9909, which adds a 2 GB check when writing BAM files out. (This check is “artificial” — there is no related internal implementation limit, and reading in BAM files with larger headers continues to work.)

This removal of existing samtools functionality was not mentioned in either HTSlib's or samtools's NEWS files. (Probably it was not realised that the existing behaviour for 2 GB … 4 GB existed.)

While this BAM spec extension potentially causes interoperability issues with other BAM implementations, it also enables these samtools users to do their work with rougher assemblies.
IMHO either the previous functionality should be restored or the new limitation needs to be called out in NEWS.

@jmarshall
Copy link
Member Author

(This Perl script generates a stupid SAM file with 2.5 GB of headers: gen-huge-headers)

@jkbonfield jkbonfield transferred this issue from samtools/samtools Apr 8, 2022
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Apr 8, 2022
This isn't permitted by the BAM specification, but was accepted by
earlier htslib release.  62f9909 added code to check the maximum
length.  This now has a warning at 2GB and the hard-failure at 4GB.

Fixes samtools#1420.  Fixes samtools/samtools#1613
whitwham pushed a commit that referenced this issue Apr 13, 2022
This isn't permitted by the BAM specification, but was accepted by
earlier htslib release.  62f9909 added code to check the maximum
length.  This now has a warning at 2GB and the hard-failure at 4GB.

Fixes #1420.  Fixes samtools/samtools#1613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants