Signedness for fields that should only represent positive values such as lengths #460

brainstorm · 2019-12-10T04:29:23Z

While writing an streaming parser for BAI files, Rust made me realise that n_refs should really be an unsigned integer (among many other fields, for very similar reasons):

error[E0277]: the trait bound `i32: nom::ToUsize` is not satisfied
   --> src/parser_bai.rs:38:10
    |
38  |     take(n_refs)(input)
    |          ^^^^^^ the trait `nom::ToUsize` is not implemented for `i32`
    | 
   ::: /Users/romanvg/.cargo/registry/src/github.com-1ecc6299db9ec823/nom-5.0.1/src/bytes/streaming.rs:385:6
    |
385 |   C: ToUsize,
    |      ------- required by this bound in `nom::bytes::streaming::take`

What do you actually expect to happen if i.e, the n_ref is represented as a negative integer?

I git blamed the BAI section of the spec and it has not been modified since ~10 years ago by @lh3. Does this make sense or are there some obscure implementation details or reasons behind that I am missing?

I also pointed out in the changeset that correctness and safety is preferred over "freedom of implementation" (as it reads now) in an attempt to close future cans of worms...

…values such as lengths

hts-specs-bot · 2019-12-10T04:31:19Z

Changed PDFs as of 5afde85: SAMv1 (diff).

brainstorm · 2019-12-10T04:40:35Z

Perhaps worth including in #283 (ongoing?) efforts, @d-cameron?

jmarshall · 2019-12-12T15:19:25Z

The usual reason given for these things being described as plain (signed) integers is Java's historic disbelief in the concept of unsigned.

See also previous discussion in #274 (comment) and likely elsewhere too.

Re the “The signedness and size used for each integer value” footnote: I think you misunderstand what this footnote is about. It's about choosing which of cCsSiI to use when writing out an integer tag field in BAM.

jkbonfield · 2019-12-12T15:30:38Z

Personally I'd prefer the sizes to be unsigned in the BAM format table. If an implementation has issues with that then it is purely an implementation limitation and not an issue for the specification.

Besides even in Java it's an easy fix - just use 64-bit signed integers which are large enough to cope with the 32-bit unsigned values here.

jmarshall · 2019-12-12T15:39:56Z

I'm inclined to agree that these naturally size_t-style values would ideally be described as unsigned, and would support a careful change in that direction.

However it behoves the specification to be conveniently implementable in common implementation languages. And it's a bit academic having us two C programmers discussing what's convenient for the other camp 😄

tfenne · 2019-12-12T18:49:39Z

Yeah - unsigned integer types are present in Java but they are a huge pain and are barely supported. It can be made to work, but no native arithmetic operators work correctly IIRC, so doing any math on unsigned numbers (even comparisons) becomes challenging. You're right that an unsigned 32-bit can be stored as a signed 64-bit int, but that same solution doesn't work for 64-bit numbers (there's no 96 or 128 integer type in Java).

Perhaps a compromise would be to label these fields as "unsigned 31-bit"? Therefore making it clear that they can only be positive integers, but restricting the range to what's easily workable within the two reference implementations?

brainstorm · 2019-12-13T10:15:24Z

I'm with James: a specification should reflect a "blueprint" and not be (too) influenced by the limitations and quirks of one of the possible implementations (Java in our current discussion).

@jmarshall I see what you mean about the footnote... re-reading it seems like removing the footnote altogether could be a better to avoid confusion as the text is already clear enough?

@tfenne I hear you, after reading about it a bit about the topic it seems tricky to get it right with Java. OTOH, I'm a bit repelled by the 31-bit suggestion as it sounds like settling for middle endian in a debate between big and little endian: a pretty bad middle ground to be in... perhaps not the best analogy, but I hope I'm getting my message across?

What are the steps to get this PR approved and merged for good? Is it a function of time? Should we ask more opinions on different people and languages? Etc... just curious about the "formal" process to get these kind of changes in.

I think that getting it merged sooner than later might avoid future faulty implementations?

/cc @ohofmann

jmarshall · 2019-12-13T13:17:43Z

Re the footnote: it was added due to #274 (comment), in which several people felt it was important to point out that implementations can make different choices here. If you have suggestions for changing the text to avoid your confusion, we will consider them.

jkbonfield · 2019-12-13T13:57:14Z

I understand it may be too hard to make it worthwhile supporting unsigned 32-bit quantities in Java, but we have implementation specific limits already in various forms. The spec should be clear in intent, which means removing the ability to have negative sizes.

So how about this as an appropriate middle ground. We use unsigned 32-bit values, as in this PR, but add a footnote to explain that some implementations may only support 31-bit values, so we have wriggle room and also warn people that they can't blindly assume the full 32-bit range will be available with all tools.

I feel at the moment we've got the emphasis back to front.

jkbonfield · 2019-12-13T14:38:36Z

Argh! Yes I'm going down with a cold and my brain is frazzled. Too many tabs, and too much brain scrambling for trying to figure out how bcf works internally! Sorry. I'll migrate them over.

brainstorm · 2019-12-13T20:18:58Z

@jmarshall, gotcha, I've rolled back the footnote as it was also sort of unrelated with this PR change set.

@jkbonfield, as I was writing in the "middle ground 31-bit" footnote you suggested, I realised that even if it's tricky or impractical to have unsigned support in Java, it is not impossible since there are BigInteger libs (clumsy, but they exist)... and there might even be optimisations (even hw accelerators?) for those types and accompanying arithmetics libs anyway.

On the one hand I suspect that relatively well established implementations like htsjdk will be doing "business as usual" and stick to their current internal representations anyway, so there's no harm done for them?

On the other hand, new implementations should not fall into the same signed traps and a clear, "footnote/gotcha"-less spec would help while new programmers cruise towards good and unambiguous implementations... for the record, not picking on htsjdk here, I've used it myself and it's good stuff! ;)

hts-specs-bot · 2019-12-13T20:23:11Z

Changed PDFs as of 3355653: SAMv1 (diff).

jmarshall · 2020-01-09T15:28:55Z

[This is an incomplete reply from before the Christmas break. The point is: (1) Java (and likely other languages) has actual limitations besides not having a uint32_t type that would make changing some of these fields to unsigned unreasonable; (2) each of these fields should be considered individually — for some of them the reasonable range is much less than 2³¹, let alone 2³².

The reason for the proposed changes is to avoid error situations. Very laudable. But defining these error situations out of existence by saying these values are (sometimes-)unfeasibly large instead of negative is not the only way to define what to do in these error situations.]

There are different categories of implementation limitations. In general, a specification is a compromise between a Platonic ideal blueprint and its implementations. If that statement is not taken as a given, then the specification is consigning itself to irrelevance.

In this case, your PR does a blanket s/int32_t/uint32_t/ on a bunch of fields of various kinds — but each kind needs to be considered individually on its merits.

This is complicated by the fact that int32_t in this text is being used to state two orthogonal properties: (1) that the field takes up 4 bytes = 32 bits in the file; (2) a statement about the valid range of values of the field. (So we couldn't just change (some of) them to uint31_t as that would be confusing as to how much space they take up.)

However you and your Rust compiler just want to know what to do if you see a negative number in these fields. Fair enough, but defining this error handling problem out of existence by changing the fields to uint32_t so all values are valid is only one approach to the problem.

Other choices are to detect negative values and throw errors, or to consider the field to be unsigned anyway.

Some of these fields are used to allocate memory buffers or arrays of objects, so extending them from 2Gb to 4Gb may be an academic exercise as actually exercising that extra range may just lead immediately to out-of-memory errors. Also there are constraints in various languages that make that impossible or unwise:

Java arrays are indexed by signed 32-bit integers, so buffers sized by one of these fields made unsigned would not be able to be implemented as arrays. Similarly with Java's String class.
Lazy C programmers use int rather than size_t for indices and sizes, so making these fields unsigned risks exposing these latent bugs.

Let's look at the each of the fields suggested for change in this PR in turn:

l_text is the length of the 1-per-BAM-file header text. As a 1-per-file memory allocation, it's not unreasonable to extend this to 4Gb where needed. In fact samtools has always as a QoI matter considered this a uint32_t (which has led to lazy int bugs in the past, see Core dump when processing large sam header samtools#67). OTOH this would make it unimplementable as an array or String in some languages.
n_ref allocates an array of header objects. But more to the point: do you anticipate mapping to 4 billion contigs? If so, how are you going to squeeze them into refID/next_refID which is necessarily signed (as it has a special meaning for -1)?
l_name is the length of a string. Do you really need a chromosome name of more than 2GB?
l_ref is a n_ref-per-file chromosome length. It's not used to allocate arrays so inviting Java to store it as a 64-bit long would not be unreasonable. But using all 32 bits for positions would mean that TLEN would in theory need 33 bits and risk other bugs. In fact HTSlib/samtools as a QoI matter does consider this a uint32_t (at least, some code does), but really creatures with assembled chromosomes longer than 2Gb will soon have them longer than 4Gb too so IMHO there's not a lot of point going to unsigned here rather than just going to 64-bit positions, but that's a major BAM bump.
l_seq allocates arrays of bases and base qualities. Do you really need to store individual reads greater than 2Gb in length in a BAM file? Similarly block_size and B-array's count allocate arrays and invite the question of whether anyone has or foresees a need to process individual SAM alignment records in excess of 2GB.
[etc]

…mits. The main tables should describe sign and on-disk sizes, which they now do, but on disk size is not the same thing as range of valid values. I feel this is clearer than the alternative option of keeping some as int32_t and needing to define their valid ranges.

jkbonfield · 2020-02-05T15:11:40Z

I have an additional commit adding to this in https://github.com/jkbonfield/hts-specs/tree/signedness_fixes

@umccr I can either make another PR and close this or if you agree to my changes you could pull in from my copy of this branch.

Basically the change jkbonfield@535dc6b is to add explanatory text detailing that while we're using uint32_t to define the value as unsigned and occupying 4 bytes, it does not equate to stating that all 4 billion values are in the valid usable range.

I've defined it rather vaguely on purpose, as defining the exact ranges is hard and debatable.

For example, l_ref in BAM could be 1 billion and work fine provided we don't have a corresponding text header containing the SQ lines. Doing so would put the SQ header over 2 billion (over 4 billion in fact). The practical limit for having a text header in BAM would be around 100 million references. For CRAM which needs M5 tags it's around 40 million currently. For downstream BCF it's 75 million, but the ##contigs lines are optional (albeit via questionable behaviour). I don't feel we want a full treatise on the appropriate ranges for all different formats we may wish to convert to.

Instead it's sufficient to point out that there will be implementation defined limits and for sure anything beyond 31-bit is liable to commonly cause problems. Edit: but I do still feel it is appropriate for an unsigned value to be using uint32_t instead of int32_t.

hts-specs-bot · 2020-02-11T03:38:59Z

Changed PDFs as of 535dc6b: SAMv1 (diff).

brainstorm · 2020-02-11T04:05:56Z

Thanks @jkbonfield, I just merged your suggestion in!

@jmarshall:

There are different categories of implementation limitations. In general, a specification is a compromise between a Platonic ideal blueprint and its implementations. If that statement is not taken as a given, then the specification is consigning itself to irrelevance.

I hear you, perhaps I err more on the side of blueprint, not an ideal one, but one that is not too influenced by particular implementation languages. Be it big (Java) or small (Rust) if that compromise is stretched towards the implementation pains (as it has been for ~10 years now AFAICT), then footnotes with implementation-specific details come up on a spec that should otherwise be simple to "parse" as a developer.

For instance, implementation footnote details about Java compression class choices and other known language-specific bugs belong to sites like StackOverflow or similar, not on an official spec.

l_name is the length of a string. Do you really need a chromosome name of more than 2GB?

I understand what you mean and the possible memory implications if misused, but this is a length, therefore it should just be a positive integer.

Upon re-reading the thread I do feel that @jkbonfield's recent comments on 535dc6b#diff-3a7f1ad5ddd700c3d2fc07c6d77bf537R950 reflect and cover this and most of your concerns well, @jmarshall?

jkbonfield · 2020-02-11T09:52:23Z

I too agree that implementation details should not be a normative part of the specification. However it's no good being the only perfect implementation that can cope with a 4Gb text header size if your files are readable by no other software. We have to be pragmatic and acknowledge what have essentially become reference implementations. Hence notes indicating known implementation defects can be usefully informative (not normative).

We discussed the actual limits a bit on our monthly call. It's possible we may wish to be tighter and describe per field what range of values are sensible, down to the current implementation limits, however it is not always so clear cut. Eg the maximum useful size of n_ref is one that doesn't cause l_text to overflow. That depends in turn on the length of sequence identifiers and whether or not you have extra tags in the SQ lines such as M5 strings. How therefore do we document the range of permissible values in n_ref? It's just not worth it IMO as something else hits a limit first (`l_text).

We could however specify limits on the things which will likely cap out first, so l_text, l_name (just something sane would help to stop people naming their reference the same as its sequence!), block_size, pos, etc. Other fields could have ranges defined in terms of each other, which could clarify the spec. Eg the legal range for refID is -1 to n_ref-1 inclusive, as is already documented.

I could see us maybe adding an additional column to the table with limits listed where appropriate. However that's perhaps something for another day and I can live with an interim blanket statement indicating data size indicates space on disk and not the full range of legal values.

jmarshall · 2020-03-03T12:14:37Z

For instance, implementation footnote details about Java compression class choices and other known language-specific bugs belong to sites like StackOverflow or similar, not on an official spec.

Quite the contrary: non-normative footnotes explaining unfamiliar items or forewarning of common mistakes are present in this specification because they are useful to its audience.

Upon re-reading the thread I do feel that @jkbonfield's recent comments on 535dc6b reflect and cover this and most of your concerns well, @jmarshall?

If you actually read what I write, you will realise that you don't have to spend time convincing me. See e.g. #460 (comment).

However I think we can do better than the paragraph proposed in James's commit, which is a bit vague: Who should take care? How? What actually are the language issues?

We can be far more usefully precise about the ranges for each particular field proposed for changing to uint32_t, and in fact there is already a Values column in the tables that can be used for the purpose. So as an alternative to this PR, please see PR #476 which extends the s/int32_t/uint32_t/ parts of this PR's initial commit with individual specific range details.

[From @jkbonfield] The practical limit for having a text header in BAM would be around 100 million references. For CRAM which needs M5 tags it's around 40 million currently. For downstream BCF it's 75 million, but the ##contigs lines are optional (albeit via questionable behaviour). I don't feel we want a full treatise on the appropriate ranges for all different formats we may wish to convert to.

Agreed that such calculations are entertaining but that we don't need the full treatise in the spec. However the more conservative range limit of 2³¹ is consistent with these calculations and itself is useful to have in the spec, as it tells Java programmers that they can just use a plain int for this field after all. Hence the proposed text in PR #476.

…amtools#476) These fields are by definition unsigned, but add notes on their usual ranges -- which mostly do not extend beyond the int32_t maximum and mostly are much smaller. Co-authored-by: John Marshall <[email protected]>

jmarshall · 2020-05-05T14:03:26Z

Thanks for the suggestion. Proposed text superseded by that in PR #476, which has now been merged.

brainstorm · 2020-05-05T14:15:32Z

@jmarshall I find it a bit strange that you had to open another PR midway when this one was following its course towards a solution (you could have added your commits in here, retaining authorship of all parties involved), but thanks anyway! ;)

jmarshall · 2020-05-05T14:22:54Z

In fact it is not strange at all for multiple approaches to exist as PRs simultaneously.

If you care to look at the commit merged (31d6e44), you will see that the authorship of all relevant parties was retained. It goes without saying that this would be done as a matter of course.

Revisiting signedness for fields that should only represent positive …

5afde85

…values such as lengths

jkbonfield added the sam label Dec 10, 2019

jkbonfield mentioned this pull request Dec 13, 2019

Jkbonfield vcf int64 samtools/htslib#1003

Closed

Rollback unrelated footnote about cCsSiI BAM fields

3355653

brainstorm mentioned this pull request Feb 3, 2020

Update to htslib 1.10.2 rust-bio/rust-htslib#184

Merged

5 tasks

jmarshall mentioned this pull request Mar 3, 2020

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

Merged

jmarshall closed this May 5, 2020

samtools locked as resolved and limited conversation to collaborators May 5, 2020

samtools unlocked this conversation May 11, 2020

brainstorm mentioned this pull request Jul 2, 2020

Add basic functionality to read indexed FASTA files (continued) rust-bio/rust-htslib#214

Merged

brainstorm referenced this pull request in zaeleus/noodles May 18, 2021

vcf/reader: Add query iterator

302033d

This was referenced Mar 14, 2022

[E::bam_hdr_write] Header too long for BAM format; [main_samview] failed to write the SAM header samtools/samtools#1613

Closed

Samtools can no longer write BAM files with more than 2GB of headers (since v1.10) samtools/htslib#1420

Closed

jmarshall mentioned this pull request Jul 18, 2023

Consider 64-bit values in BCF, BAM and CRAM #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signedness for fields that should only represent positive values such as lengths #460

Signedness for fields that should only represent positive values such as lengths #460

brainstorm commented Dec 10, 2019 •

edited

Loading

hts-specs-bot commented Dec 10, 2019

brainstorm commented Dec 10, 2019 •

edited

Loading

jmarshall commented Dec 12, 2019

jkbonfield commented Dec 12, 2019

jmarshall commented Dec 12, 2019

tfenne commented Dec 12, 2019

brainstorm commented Dec 13, 2019 •

edited

Loading

jmarshall commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

brainstorm commented Dec 13, 2019

hts-specs-bot commented Dec 13, 2019

jmarshall commented Jan 9, 2020 •

edited

Loading

jkbonfield commented Feb 5, 2020 •

edited

Loading

hts-specs-bot commented Feb 11, 2020

brainstorm commented Feb 11, 2020 •

edited

Loading

jkbonfield commented Feb 11, 2020

jmarshall commented Mar 3, 2020 •

edited

Loading

jmarshall commented May 5, 2020

brainstorm commented May 5, 2020 •

edited

Loading

jmarshall commented May 5, 2020

Signedness for fields that should only represent positive values such as lengths #460

Signedness for fields that should only represent positive values such as lengths #460

Conversation

brainstorm commented Dec 10, 2019 • edited Loading

hts-specs-bot commented Dec 10, 2019

brainstorm commented Dec 10, 2019 • edited Loading

jmarshall commented Dec 12, 2019

jkbonfield commented Dec 12, 2019

jmarshall commented Dec 12, 2019

tfenne commented Dec 12, 2019

brainstorm commented Dec 13, 2019 • edited Loading

jmarshall commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

brainstorm commented Dec 13, 2019

hts-specs-bot commented Dec 13, 2019

jmarshall commented Jan 9, 2020 • edited Loading

jkbonfield commented Feb 5, 2020 • edited Loading

hts-specs-bot commented Feb 11, 2020

brainstorm commented Feb 11, 2020 • edited Loading

jkbonfield commented Feb 11, 2020

jmarshall commented Mar 3, 2020 • edited Loading

jmarshall commented May 5, 2020

brainstorm commented May 5, 2020 • edited Loading

jmarshall commented May 5, 2020

brainstorm commented Dec 10, 2019 •

edited

Loading

brainstorm commented Dec 10, 2019 •

edited

Loading

brainstorm commented Dec 13, 2019 •

edited

Loading

jmarshall commented Jan 9, 2020 •

edited

Loading

jkbonfield commented Feb 5, 2020 •

edited

Loading

brainstorm commented Feb 11, 2020 •

edited

Loading

jmarshall commented Mar 3, 2020 •

edited

Loading

brainstorm commented May 5, 2020 •

edited

Loading