Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signedness for fields that should only represent positive values such as lengths #460

Closed
wants to merge 3 commits into from
Closed

Conversation

brainstorm
Copy link
Contributor

@brainstorm brainstorm commented Dec 10, 2019

While writing an streaming parser for BAI files, Rust made me realise that n_refs should really be an unsigned integer (among many other fields, for very similar reasons):

error[E0277]: the trait bound `i32: nom::ToUsize` is not satisfied
   --> src/parser_bai.rs:38:10
    |
38  |     take(n_refs)(input)
    |          ^^^^^^ the trait `nom::ToUsize` is not implemented for `i32`
    | 
   ::: /Users/romanvg/.cargo/registry/src/github.com-1ecc6299db9ec823/nom-5.0.1/src/bytes/streaming.rs:385:6
    |
385 |   C: ToUsize,
    |      ------- required by this bound in `nom::bytes::streaming::take`

What do you actually expect to happen if i.e, the n_ref is represented as a negative integer?

I git blamed the BAI section of the spec and it has not been modified since ~10 years ago by @lh3. Does this make sense or are there some obscure implementation details or reasons behind that I am missing?

I also pointed out in the changeset that correctness and safety is preferred over "freedom of implementation" (as it reads now) in an attempt to close future cans of worms...

@hts-specs-bot
Copy link

Changed PDFs as of 5afde85: SAMv1 (diff).

@brainstorm
Copy link
Contributor Author

brainstorm commented Dec 10, 2019

Perhaps worth including in #283 (ongoing?) efforts, @d-cameron?

@jkbonfield jkbonfield added the sam label Dec 10, 2019
@jmarshall
Copy link
Member

The usual reason given for these things being described as plain (signed) integers is Java's historic disbelief in the concept of unsigned.

See also previous discussion in #274 (comment) and likely elsewhere too.

Re the “The signedness and size used for each integer value” footnote: I think you misunderstand what this footnote is about. It's about choosing which of cCsSiI to use when writing out an integer tag field in BAM.

@jkbonfield
Copy link
Contributor

Personally I'd prefer the sizes to be unsigned in the BAM format table. If an implementation has issues with that then it is purely an implementation limitation and not an issue for the specification.

Besides even in Java it's an easy fix - just use 64-bit signed integers which are large enough to cope with the 32-bit unsigned values here.

@jmarshall
Copy link
Member

I'm inclined to agree that these naturally size_t-style values would ideally be described as unsigned, and would support a careful change in that direction.

However it behoves the specification to be conveniently implementable in common implementation languages. And it's a bit academic having us two C programmers discussing what's convenient for the other camp 😄

@tfenne
Copy link
Member

tfenne commented Dec 12, 2019

Yeah - unsigned integer types are present in Java but they are a huge pain and are barely supported. It can be made to work, but no native arithmetic operators work correctly IIRC, so doing any math on unsigned numbers (even comparisons) becomes challenging. You're right that an unsigned 32-bit can be stored as a signed 64-bit int, but that same solution doesn't work for 64-bit numbers (there's no 96 or 128 integer type in Java).

Perhaps a compromise would be to label these fields as "unsigned 31-bit"? Therefore making it clear that they can only be positive integers, but restricting the range to what's easily workable within the two reference implementations?

@brainstorm
Copy link
Contributor Author

brainstorm commented Dec 13, 2019

I'm with James: a specification should reflect a "blueprint" and not be (too) influenced by the limitations and quirks of one of the possible implementations (Java in our current discussion).

@jmarshall I see what you mean about the footnote... re-reading it seems like removing the footnote altogether could be a better to avoid confusion as the text is already clear enough?

@tfenne I hear you, after reading about it a bit about the topic it seems tricky to get it right with Java. OTOH, I'm a bit repelled by the 31-bit suggestion as it sounds like settling for middle endian in a debate between big and little endian: a pretty bad middle ground to be in... perhaps not the best analogy, but I hope I'm getting my message across?

What are the steps to get this PR approved and merged for good? Is it a function of time? Should we ask more opinions on different people and languages? Etc... just curious about the "formal" process to get these kind of changes in.

I think that getting it merged sooner than later might avoid future faulty implementations?

/cc @ohofmann

@jmarshall
Copy link
Member

Re the footnote: it was added due to #274 (comment), in which several people felt it was important to point out that implementations can make different choices here. If you have suggestions for changing the text to avoid your confusion, we will consider them.

@jkbonfield
Copy link
Contributor

I understand it may be too hard to make it worthwhile supporting unsigned 32-bit quantities in Java, but we have implementation specific limits already in various forms. The spec should be clear in intent, which means removing the ability to have negative sizes.

So how about this as an appropriate middle ground. We use unsigned 32-bit values, as in this PR, but add a footnote to explain that some implementations may only support 31-bit values, so we have wriggle room and also warn people that they can't blindly assume the full 32-bit range will be available with all tools.

I feel at the moment we've got the emphasis back to front.

@jkbonfield
Copy link
Contributor

Argh! Yes I'm going down with a cold and my brain is frazzled. Too many tabs, and too much brain scrambling for trying to figure out how bcf works internally! Sorry. I'll migrate them over.

@brainstorm
Copy link
Contributor Author

@jmarshall, gotcha, I've rolled back the footnote as it was also sort of unrelated with this PR change set.

@jkbonfield, as I was writing in the "middle ground 31-bit" footnote you suggested, I realised that even if it's tricky or impractical to have unsigned support in Java, it is not impossible since there are BigInteger libs (clumsy, but they exist)... and there might even be optimisations (even hw accelerators?) for those types and accompanying arithmetics libs anyway.

On the one hand I suspect that relatively well established implementations like htsjdk will be doing "business as usual" and stick to their current internal representations anyway, so there's no harm done for them?

On the other hand, new implementations should not fall into the same signed traps and a clear, "footnote/gotcha"-less spec would help while new programmers cruise towards good and unambiguous implementations... for the record, not picking on htsjdk here, I've used it myself and it's good stuff! ;)

@hts-specs-bot
Copy link

Changed PDFs as of 3355653: SAMv1 (diff).

@jmarshall
Copy link
Member

jmarshall commented Jan 9, 2020

[This is an incomplete reply from before the Christmas break. The point is: (1) Java (and likely other languages) has actual limitations besides not having a uint32_t type that would make changing some of these fields to unsigned unreasonable; (2) each of these fields should be considered individually — for some of them the reasonable range is much less than 231, let alone 232.

The reason for the proposed changes is to avoid error situations. Very laudable. But defining these error situations out of existence by saying these values are (sometimes-)unfeasibly large instead of negative is not the only way to define what to do in these error situations.]


There are different categories of implementation limitations. In general, a specification is a compromise between a Platonic ideal blueprint and its implementations. If that statement is not taken as a given, then the specification is consigning itself to irrelevance.

In this case, your PR does a blanket s/int32_t/uint32_t/ on a bunch of fields of various kinds — but each kind needs to be considered individually on its merits.

This is complicated by the fact that int32_t in this text is being used to state two orthogonal properties: (1) that the field takes up 4 bytes = 32 bits in the file; (2) a statement about the valid range of values of the field. (So we couldn't just change (some of) them to uint31_t as that would be confusing as to how much space they take up.)

However you and your Rust compiler just want to know what to do if you see a negative number in these fields. Fair enough, but defining this error handling problem out of existence by changing the fields to uint32_t so all values are valid is only one approach to the problem.

Other choices are to detect negative values and throw errors, or to consider the field to be unsigned anyway.

Some of these fields are used to allocate memory buffers or arrays of objects, so extending them from 2Gb to 4Gb may be an academic exercise as actually exercising that extra range may just lead immediately to out-of-memory errors. Also there are constraints in various languages that make that impossible or unwise:

  • Java arrays are indexed by signed 32-bit integers, so buffers sized by one of these fields made unsigned would not be able to be implemented as arrays. Similarly with Java's String class.

  • Lazy C programmers use int rather than size_t for indices and sizes, so making these fields unsigned risks exposing these latent bugs.

Let's look at the each of the fields suggested for change in this PR in turn:

  • l_text is the length of the 1-per-BAM-file header text. As a 1-per-file memory allocation, it's not unreasonable to extend this to 4Gb where needed. In fact samtools has always as a QoI matter considered this a uint32_t (which has led to lazy int bugs in the past, see Core dump when processing large sam header samtools#67). OTOH this would make it unimplementable as an array or String in some languages.

  • n_ref allocates an array of header objects. But more to the point: do you anticipate mapping to 4 billion contigs? If so, how are you going to squeeze them into refID/next_refID which is necessarily signed (as it has a special meaning for -1)?

  • l_name is the length of a string. Do you really need a chromosome name of more than 2GB?

  • l_ref is a n_ref-per-file chromosome length. It's not used to allocate arrays so inviting Java to store it as a 64-bit long would not be unreasonable. But using all 32 bits for positions would mean that TLEN would in theory need 33 bits and risk other bugs. In fact HTSlib/samtools as a QoI matter does consider this a uint32_t (at least, some code does), but really creatures with assembled chromosomes longer than 2Gb will soon have them longer than 4Gb too so IMHO there's not a lot of point going to unsigned here rather than just going to 64-bit positions, but that's a major BAM bump.

  • l_seq allocates arrays of bases and base qualities. Do you really need to store individual reads greater than 2Gb in length in a BAM file? Similarly block_size and B-array's count allocate arrays and invite the question of whether anyone has or foresees a need to process individual SAM alignment records in excess of 2GB.

  • [etc]

…mits.

The main tables should describe sign and on-disk sizes, which they now
do, but on disk size is not the same thing as range of valid values.

I feel this is clearer than the alternative option of keeping some as
int32_t and needing to define their valid ranges.
@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 5, 2020

I have an additional commit adding to this in https://github.com/jkbonfield/hts-specs/tree/signedness_fixes

@umccr I can either make another PR and close this or if you agree to my changes you could pull in from my copy of this branch.

Basically the change jkbonfield@535dc6b is to add explanatory text detailing that while we're using uint32_t to define the value as unsigned and occupying 4 bytes, it does not equate to stating that all 4 billion values are in the valid usable range.

I've defined it rather vaguely on purpose, as defining the exact ranges is hard and debatable.

For example, l_ref in BAM could be 1 billion and work fine provided we don't have a corresponding text header containing the SQ lines. Doing so would put the SQ header over 2 billion (over 4 billion in fact). The practical limit for having a text header in BAM would be around 100 million references. For CRAM which needs M5 tags it's around 40 million currently. For downstream BCF it's 75 million, but the ##contigs lines are optional (albeit via questionable behaviour). I don't feel we want a full treatise on the appropriate ranges for all different formats we may wish to convert to.

Instead it's sufficient to point out that there will be implementation defined limits and for sure anything beyond 31-bit is liable to commonly cause problems. Edit: but I do still feel it is appropriate for an unsigned value to be using uint32_t instead of int32_t.

@hts-specs-bot
Copy link

Changed PDFs as of 535dc6b: SAMv1 (diff).

@brainstorm
Copy link
Contributor Author

brainstorm commented Feb 11, 2020

Thanks @jkbonfield, I just merged your suggestion in!

@jmarshall:

There are different categories of implementation limitations. In general, a specification is a compromise between a Platonic ideal blueprint and its implementations. If that statement is not taken as a given, then the specification is consigning itself to irrelevance.

I hear you, perhaps I err more on the side of blueprint, not an ideal one, but one that is not too influenced by particular implementation languages. Be it big (Java) or small (Rust) if that compromise is stretched towards the implementation pains (as it has been for ~10 years now AFAICT), then footnotes with implementation-specific details come up on a spec that should otherwise be simple to "parse" as a developer.

For instance, implementation footnote details about Java compression class choices and other known language-specific bugs belong to sites like StackOverflow or similar, not on an official spec.

l_name is the length of a string. Do you really need a chromosome name of more than 2GB?

I understand what you mean and the possible memory implications if misused, but this is a length, therefore it should just be a positive integer.

Upon re-reading the thread I do feel that @jkbonfield's recent comments on 535dc6b#diff-3a7f1ad5ddd700c3d2fc07c6d77bf537R950 reflect and cover this and most of your concerns well, @jmarshall?

@jkbonfield
Copy link
Contributor

I too agree that implementation details should not be a normative part of the specification. However it's no good being the only perfect implementation that can cope with a 4Gb text header size if your files are readable by no other software. We have to be pragmatic and acknowledge what have essentially become reference implementations. Hence notes indicating known implementation defects can be usefully informative (not normative).

We discussed the actual limits a bit on our monthly call. It's possible we may wish to be tighter and describe per field what range of values are sensible, down to the current implementation limits, however it is not always so clear cut. Eg the maximum useful size of n_ref is one that doesn't cause l_text to overflow. That depends in turn on the length of sequence identifiers and whether or not you have extra tags in the SQ lines such as M5 strings. How therefore do we document the range of permissible values in n_ref? It's just not worth it IMO as something else hits a limit first (`l_text).

We could however specify limits on the things which will likely cap out first, so l_text, l_name (just something sane would help to stop people naming their reference the same as its sequence!), block_size, pos, etc. Other fields could have ranges defined in terms of each other, which could clarify the spec. Eg the legal range for refID is -1 to n_ref-1 inclusive, as is already documented.

I could see us maybe adding an additional column to the table with limits listed where appropriate. However that's perhaps something for another day and I can live with an interim blanket statement indicating data size indicates space on disk and not the full range of legal values.

@jmarshall
Copy link
Member

jmarshall commented Mar 3, 2020

For instance, implementation footnote details about Java compression class choices and other known language-specific bugs belong to sites like StackOverflow or similar, not on an official spec.

Quite the contrary: non-normative footnotes explaining unfamiliar items or forewarning of common mistakes are present in this specification because they are useful to its audience.

Upon re-reading the thread I do feel that @jkbonfield's recent comments on 535dc6b reflect and cover this and most of your concerns well, @jmarshall?

If you actually read what I write, you will realise that you don't have to spend time convincing me. See e.g. #460 (comment).

However I think we can do better than the paragraph proposed in James's commit, which is a bit vague: Who should take care? How? What actually are the language issues?

We can be far more usefully precise about the ranges for each particular field proposed for changing to uint32_t, and in fact there is already a Values column in the tables that can be used for the purpose. So as an alternative to this PR, please see PR #476 which extends the s/int32_t/uint32_t/ parts of this PR's initial commit with individual specific range details.

[From @jkbonfield] The practical limit for having a text header in BAM would be around 100 million references. For CRAM which needs M5 tags it's around 40 million currently. For downstream BCF it's 75 million, but the ##contigs lines are optional (albeit via questionable behaviour). I don't feel we want a full treatise on the appropriate ranges for all different formats we may wish to convert to.

Agreed that such calculations are entertaining but that we don't need the full treatise in the spec. However the more conservative range limit of 231 is consistent with these calculations and itself is useful to have in the spec, as it tells Java programmers that they can just use a plain int for this field after all. Hence the proposed text in PR #476.

jmarshall added a commit to jmarshall/hts-specs that referenced this pull request May 5, 2020
…amtools#476)

These fields are by definition unsigned, but add notes on their usual
ranges -- which mostly do not extend beyond the int32_t maximum and
mostly are much smaller.

Co-authored-by: John Marshall <[email protected]>
@jmarshall
Copy link
Member

Thanks for the suggestion. Proposed text superseded by that in PR #476, which has now been merged.

@jmarshall jmarshall closed this May 5, 2020
@brainstorm
Copy link
Contributor Author

brainstorm commented May 5, 2020

@jmarshall I find it a bit strange that you had to open another PR midway when this one was following its course towards a solution (you could have added your commits in here, retaining authorship of all parties involved), but thanks anyway! ;)

@jmarshall
Copy link
Member

In fact it is not strange at all for multiple approaches to exist as PRs simultaneously.

If you care to look at the commit merged (31d6e44), you will see that the authorship of all relevant parties was retained. It goes without saying that this would be done as a matter of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants