Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

jmarshall · 2020-03-03T12:10:43Z

Revisit PR #460 (see conversation there), adding notes in the existing Value table columns indicating length/count fields whose maximum values are still (well) within the int32_t range convenient for e.g. Java arrays. This clarifies that, while the fields are naturally described as uint32_t, they are not required to be implemented using such a type if that is inconvenient in your implementation language.

~~OTOH this PR explicitly raises the l_text limit from 2GiB to 4GiB:~~
[Decided during April 2nd meeting to leave l_text as is for this PR, perhaps to be revisited separately later.]

HTSlib already handles 4GiB of BAM header text.
HTSJDK represents SAM/BAM/etc headers as a collection of separate headers so no representation change is needed to deal with this l_text range increase. However BAMFileReader/Writer.java's readHeader() and writeHeader() currently operate on the full header text as a String so would require some rearranging internally to deal with files with such large headers.

hts-specs-bot · 2020-03-03T12:13:01Z

Changed PDFs as of 36b71a1: SAMv1 (diff).

jkbonfield · 2020-03-03T15:19:43Z

I'm not convinced by "memory", although I'm struggling to think of a better term. Memory seems weak, as I may (will!) have well over 4Gb spare to play with. Perhaps limited?

I also don't like having some fields with their maximum sizes being displayed in the Value column, which for now is for fixed values "BAM\1" or default values "[-1]", given we have many other fields where the limits are written in the description. e.g. "Reference sequence ID, −1 ≤ refID < n ref" or "−1 ≤ next refID < n ref". We ought to be consistent here.

I think it's also actually useful to be able to state for example that 0 <= n_ref < 2^{31} as it emphasises that a BAM file with no references is still valid.

Why is block_size 31-bit, or at least memory marked? Does Java also load it into a string? it would seem an unobvious choice given the binary nature of the data. Agreed it's highly unlikely to be large. Is that the message here, that a typical file won't trigger problems? if so neither memory nor limited are the correct term to use, but it's not what should be in a spec either. Maybe I don't understand all of Java's woes here. Is it simply unable to have any block of memory more than 2Gb in total, and thus cannot even read a large block of data without splitting into multiple memory allocations?

If I had to choose, I think for brevity sake and that an extra column may be unwieldy, I'd put the two 2^{31} limits into the description field. Note this is also formatted poorly as the exponent is overflowing the cell bounding slightly (although the same was already happening in footnotes).

jmarshall · 2020-03-09T09:39:13Z

Why is block_size 31-bit, or at least memory marked?

It's memory marked because you're gonna load it into memory. The thinking was: headers are 1-set-per-file so mostly anything goes, but for read records once you've got a hundred or more in memory at once, that's starting to be a lot of memory. However most things do mostly just stream one at a time…

But you're quite right that the memory constraint is the minor consideration; the main thing for these field sizes is simply the reality of the data represented. I'll carry on thinking about how to express that (and improve this overall, as discussed), even if it's just “Note A” in the table with the explanation in the note.

jmarshall · 2020-04-30T15:00:18Z

Changed memory to limited, and removed the lifting of l_text to 4GB.

hts-specs-bot · 2020-04-30T15:01:03Z

Changed PDFs as of 2b863b3: SAMv1 (diff).

jkbonfield · 2020-04-30T16:13:35Z

Thanks for the update. I like it and am happy for this to go in as is.

…amtools#476) These fields are by definition unsigned, but add notes on their usual ranges -- which mostly do not extend beyond the int32_t maximum and mostly are much smaller. Co-authored-by: John Marshall <[email protected]>

jmarshall · 2020-05-05T08:50:13Z

@yfarjoun? I'd like to merge this this week if possible.

yfarjoun · 2020-05-05T13:41:34Z

sorry for the holdup 👍 from me.

jmarshall · 2020-05-05T14:54:48Z

Thanks all; merged now (and SAMv1.pdf rebuilt with this and @RG-PL:DNBSEQ).

I have the 4 GiB l_text proposed change stashed separately and might raise it as its own PR one day — but there's no hurry 😄

jmarshall added the sam label Mar 3, 2020

jmarshall mentioned this pull request Mar 3, 2020

Signedness for fields that should only represent positive values such as lengths #460

Closed

jmarshall force-pushed the uint32ranges branch 2 times, most recently from 6df82fa to 2b863b3 Compare April 30, 2020 14:59

samtools deleted a comment from hts-specs-bot Apr 30, 2020

jmarshall requested review from jkbonfield and yfarjoun April 30, 2020 15:16

jkbonfield approved these changes Apr 30, 2020

View reviewed changes

jmarshall force-pushed the uint32ranges branch from 2b863b3 to 31d6e44 Compare May 5, 2020 14:00

jmarshall merged commit 31d6e44 into samtools:master May 5, 2020

github-pages bot temporarily deployed to github-pages May 5, 2020 14:02 Inactive

jmarshall deleted the uint32ranges branch May 5, 2020 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

jmarshall commented Mar 3, 2020 •

edited

Loading

hts-specs-bot commented Mar 3, 2020 •

edited by jmarshall

Loading

jkbonfield commented Mar 3, 2020 •

edited

Loading

jmarshall commented Mar 9, 2020 •

edited

Loading

jmarshall commented Apr 30, 2020

hts-specs-bot commented Apr 30, 2020

jkbonfield commented Apr 30, 2020

jmarshall commented May 5, 2020

yfarjoun commented May 5, 2020

jmarshall commented May 5, 2020

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

Conversation

jmarshall commented Mar 3, 2020 • edited Loading

hts-specs-bot commented Mar 3, 2020 • edited by jmarshall Loading

jkbonfield commented Mar 3, 2020 • edited Loading

jmarshall commented Mar 9, 2020 • edited Loading

jmarshall commented Apr 30, 2020

hts-specs-bot commented Apr 30, 2020

jkbonfield commented Apr 30, 2020

jmarshall commented May 5, 2020

yfarjoun commented May 5, 2020

jmarshall commented May 5, 2020

jmarshall commented Mar 3, 2020 •

edited

Loading

hts-specs-bot commented Mar 3, 2020 •

edited by jmarshall

Loading

jkbonfield commented Mar 3, 2020 •

edited

Loading

jmarshall commented Mar 9, 2020 •

edited

Loading