-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signedness for fields that should only represent positive values such as lengths #460
Conversation
…values such as lengths
Perhaps worth including in #283 (ongoing?) efforts, @d-cameron? |
The usual reason given for these things being described as plain (signed) integers is Java's historic disbelief in the concept of See also previous discussion in #274 (comment) and likely elsewhere too. Re the “The signedness and size used for each integer value” footnote: I think you misunderstand what this footnote is about. It's about choosing which of |
Personally I'd prefer the sizes to be unsigned in the BAM format table. If an implementation has issues with that then it is purely an implementation limitation and not an issue for the specification. Besides even in Java it's an easy fix - just use 64-bit signed integers which are large enough to cope with the 32-bit unsigned values here. |
I'm inclined to agree that these naturally However it behoves the specification to be conveniently implementable in common implementation languages. And it's a bit academic having us two C programmers discussing what's convenient for the other camp 😄 |
Yeah - unsigned integer types are present in Java but they are a huge pain and are barely supported. It can be made to work, but no native arithmetic operators work correctly IIRC, so doing any math on unsigned numbers (even comparisons) becomes challenging. You're right that an unsigned 32-bit can be stored as a signed 64-bit int, but that same solution doesn't work for 64-bit numbers (there's no 96 or 128 integer type in Java). Perhaps a compromise would be to label these fields as "unsigned 31-bit"? Therefore making it clear that they can only be positive integers, but restricting the range to what's easily workable within the two reference implementations? |
I'm with James: a specification should reflect a "blueprint" and not be (too) influenced by the limitations and quirks of one of the possible implementations (Java in our current discussion). @jmarshall I see what you mean about the footnote... re-reading it seems like removing the footnote altogether could be a better to avoid confusion as the text is already clear enough? @tfenne I hear you, after reading about it a bit about the topic it seems tricky to get it right with Java. OTOH, I'm a bit repelled by the 31-bit suggestion as it sounds like settling for middle endian in a debate between big and little endian: a pretty bad middle ground to be in... perhaps not the best analogy, but I hope I'm getting my message across? What are the steps to get this PR approved and merged for good? Is it a function of time? Should we ask more opinions on different people and languages? Etc... just curious about the "formal" process to get these kind of changes in. I think that getting it merged sooner than later might avoid future faulty implementations? /cc @ohofmann |
Re the footnote: it was added due to #274 (comment), in which several people felt it was important to point out that implementations can make different choices here. If you have suggestions for changing the text to avoid your confusion, we will consider them. |
I understand it may be too hard to make it worthwhile supporting unsigned 32-bit quantities in Java, but we have implementation specific limits already in various forms. The spec should be clear in intent, which means removing the ability to have negative sizes. So how about this as an appropriate middle ground. We use unsigned 32-bit values, as in this PR, but add a footnote to explain that some implementations may only support 31-bit values, so we have wriggle room and also warn people that they can't blindly assume the full 32-bit range will be available with all tools. I feel at the moment we've got the emphasis back to front. |
Argh! Yes I'm going down with a cold and my brain is frazzled. Too many tabs, and too much brain scrambling for trying to figure out how bcf works internally! Sorry. I'll migrate them over. |
@jmarshall, gotcha, I've rolled back the footnote as it was also sort of unrelated with this PR change set. @jkbonfield, as I was writing in the "middle ground 31-bit" footnote you suggested, I realised that even if it's tricky or impractical to have unsigned support in Java, it is not impossible since there are BigInteger libs (clumsy, but they exist)... and there might even be optimisations (even hw accelerators?) for those types and accompanying arithmetics libs anyway. On the one hand I suspect that relatively well established implementations like htsjdk will be doing "business as usual" and stick to their current internal representations anyway, so there's no harm done for them? On the other hand, new implementations should not fall into the same signed traps and a clear, "footnote/gotcha"-less spec would help while new programmers cruise towards good and unambiguous implementations... for the record, not picking on htsjdk here, I've used it myself and it's good stuff! ;) |
[This is an incomplete reply from before the Christmas break. The point is: (1) Java (and likely other languages) has actual limitations besides not having a The reason for the proposed changes is to avoid error situations. Very laudable. But defining these error situations out of existence by saying these values are (sometimes-)unfeasibly large instead of negative is not the only way to define what to do in these error situations.] There are different categories of implementation limitations. In general, a specification is a compromise between a Platonic ideal blueprint and its implementations. If that statement is not taken as a given, then the specification is consigning itself to irrelevance. In this case, your PR does a blanket This is complicated by the fact that However you and your Rust compiler just want to know what to do if you see a negative number in these fields. Fair enough, but defining this error handling problem out of existence by changing the fields to Other choices are to detect negative values and throw errors, or to consider the field to be unsigned anyway. Some of these fields are used to allocate memory buffers or arrays of objects, so extending them from 2Gb to 4Gb may be an academic exercise as actually exercising that extra range may just lead immediately to out-of-memory errors. Also there are constraints in various languages that make that impossible or unwise:
Let's look at the each of the fields suggested for change in this PR in turn:
|
…mits. The main tables should describe sign and on-disk sizes, which they now do, but on disk size is not the same thing as range of valid values. I feel this is clearer than the alternative option of keeping some as int32_t and needing to define their valid ranges.
I have an additional commit adding to this in https://github.com/jkbonfield/hts-specs/tree/signedness_fixes @umccr I can either make another PR and close this or if you agree to my changes you could pull in from my copy of this branch. Basically the change jkbonfield@535dc6b is to add explanatory text detailing that while we're using uint32_t to define the value as unsigned and occupying 4 bytes, it does not equate to stating that all 4 billion values are in the valid usable range. I've defined it rather vaguely on purpose, as defining the exact ranges is hard and debatable. For example, Instead it's sufficient to point out that there will be implementation defined limits and for sure anything beyond 31-bit is liable to commonly cause problems. Edit: but I do still feel it is appropriate for an unsigned value to be using uint32_t instead of int32_t. |
Thanks @jkbonfield, I just merged your suggestion in!
I hear you, perhaps I err more on the side of blueprint, not an ideal one, but one that is not too influenced by particular implementation languages. Be it big (Java) or small (Rust) if that compromise is stretched towards the implementation pains (as it has been for ~10 years now AFAICT), then footnotes with implementation-specific details come up on a spec that should otherwise be simple to "parse" as a developer. For instance, implementation footnote details about Java compression class choices and other known language-specific bugs belong to sites like StackOverflow or similar, not on an official spec.
I understand what you mean and the possible memory implications if misused, but this is a length, therefore it should just be a positive integer. Upon re-reading the thread I do feel that @jkbonfield's recent comments on 535dc6b#diff-3a7f1ad5ddd700c3d2fc07c6d77bf537R950 reflect and cover this and most of your concerns well, @jmarshall? |
I too agree that implementation details should not be a normative part of the specification. However it's no good being the only perfect implementation that can cope with a 4Gb text header size if your files are readable by no other software. We have to be pragmatic and acknowledge what have essentially become reference implementations. Hence notes indicating known implementation defects can be usefully informative (not normative). We discussed the actual limits a bit on our monthly call. It's possible we may wish to be tighter and describe per field what range of values are sensible, down to the current implementation limits, however it is not always so clear cut. Eg the maximum useful size of We could however specify limits on the things which will likely cap out first, so I could see us maybe adding an additional column to the table with limits listed where appropriate. However that's perhaps something for another day and I can live with an interim blanket statement indicating data size indicates space on disk and not the full range of legal values. |
Quite the contrary: non-normative footnotes explaining unfamiliar items or forewarning of common mistakes are present in this specification because they are useful to its audience.
If you actually read what I write, you will realise that you don't have to spend time convincing me. See e.g. #460 (comment). However I think we can do better than the paragraph proposed in James's commit, which is a bit vague: Who should take care? How? What actually are the language issues? We can be far more usefully precise about the ranges for each particular field proposed for changing to
Agreed that such calculations are entertaining but that we don't need the full treatise in the spec. However the more conservative range limit of 231 is consistent with these calculations and itself is useful to have in the spec, as it tells Java programmers that they can just use a plain |
…amtools#476) These fields are by definition unsigned, but add notes on their usual ranges -- which mostly do not extend beyond the int32_t maximum and mostly are much smaller. Co-authored-by: John Marshall <[email protected]>
Thanks for the suggestion. Proposed text superseded by that in PR #476, which has now been merged. |
@jmarshall I find it a bit strange that you had to open another PR midway when this one was following its course towards a solution (you could have added your commits in here, retaining authorship of all parties involved), but thanks anyway! ;) |
In fact it is not strange at all for multiple approaches to exist as PRs simultaneously. If you care to look at the commit merged (31d6e44), you will see that the authorship of all relevant parties was retained. It goes without saying that this would be done as a matter of course. |
While writing an streaming parser for BAI files, Rust made me realise that
n_refs
should really be an unsigned integer (among many other fields, for very similar reasons):What do you actually expect to happen if i.e, the n_ref is represented as a negative integer?
I git blamed the BAI section of the spec and it has not been modified since ~10 years ago by @lh3. Does this make sense or are there some obscure implementation details or reasons behind that I am missing?
I also pointed out in the changeset that correctness and safety is preferred over "freedom of implementation" (as it reads now) in an attempt to close future cans of worms...