Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change BAM and BAI length/count fields to uint32_t (PR#460 alternative) #476

Merged
merged 1 commit into from
May 5, 2020

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Mar 3, 2020

Revisit PR #460 (see conversation there), adding notes in the existing Value table columns indicating length/count fields whose maximum values are still (well) within the int32_t range convenient for e.g. Java arrays. This clarifies that, while the fields are naturally described as uint32_t, they are not required to be implemented using such a type if that is inconvenient in your implementation language.

OTOH this PR explicitly raises the l_text limit from 2GiB to 4GiB:
[Decided during April 2nd meeting to leave l_text as is for this PR, perhaps to be revisited separately later.]

  • HTSlib already handles 4GiB of BAM header text.

  • HTSJDK represents SAM/BAM/etc headers as a collection of separate headers so no representation change is needed to deal with this l_text range increase. However BAMFileReader/Writer.java's readHeader() and writeHeader() currently operate on the full header text as a String so would require some rearranging internally to deal with files with such large headers.

@hts-specs-bot
Copy link

hts-specs-bot commented Mar 3, 2020

Changed PDFs as of 36b71a1: SAMv1 (diff).

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 3, 2020

I'm not convinced by "memory", although I'm struggling to think of a better term. Memory seems weak, as I may (will!) have well over 4Gb spare to play with. Perhaps limited?

I also don't like having some fields with their maximum sizes being displayed in the Value column, which for now is for fixed values "BAM\1" or default values "[-1]", given we have many other fields where the limits are written in the description. e.g. "Reference sequence ID, −1 ≤ refID < n ref" or "−1 ≤ next refID < n ref". We ought to be consistent here.

I think it's also actually useful to be able to state for example that 0 <= n_ref < 2^{31} as it emphasises that a BAM file with no references is still valid.

Why is block_size 31-bit, or at least memory marked? Does Java also load it into a string? it would seem an unobvious choice given the binary nature of the data. Agreed it's highly unlikely to be large. Is that the message here, that a typical file won't trigger problems? if so neither memory nor limited are the correct term to use, but it's not what should be in a spec either. Maybe I don't understand all of Java's woes here. Is it simply unable to have any block of memory more than 2Gb in total, and thus cannot even read a large block of data without splitting into multiple memory allocations?

If I had to choose, I think for brevity sake and that an extra column may be unwieldy, I'd put the two 2^{31} limits into the description field. Note this is also formatted poorly as the exponent is overflowing the cell bounding slightly (although the same was already happening in footnotes).

@jmarshall
Copy link
Member Author

jmarshall commented Mar 9, 2020

Why is block_size 31-bit, or at least memory marked?

It's memory marked because you're gonna load it into memory. The thinking was: headers are 1-set-per-file so mostly anything goes, but for read records once you've got a hundred or more in memory at once, that's starting to be a lot of memory. However most things do mostly just stream one at a time…

But you're quite right that the memory constraint is the minor consideration; the main thing for these field sizes is simply the reality of the data represented. I'll carry on thinking about how to express that (and improve this overall, as discussed), even if it's just “Note A” in the table with the explanation in the note.

@jmarshall jmarshall force-pushed the uint32ranges branch 2 times, most recently from 6df82fa to 2b863b3 Compare April 30, 2020 14:59
@jmarshall
Copy link
Member Author

Changed memory to limited, and removed the lifting of l_text to 4GB.

@hts-specs-bot
Copy link

Changed PDFs as of 2b863b3: SAMv1 (diff).

@samtools samtools deleted a comment from hts-specs-bot Apr 30, 2020
@jkbonfield
Copy link
Contributor

Thanks for the update. I like it and am happy for this to go in as is.

…amtools#476)

These fields are by definition unsigned, but add notes on their usual
ranges -- which mostly do not extend beyond the int32_t maximum and
mostly are much smaller.

Co-authored-by: John Marshall <[email protected]>
@jmarshall
Copy link
Member Author

@yfarjoun? I'd like to merge this this week if possible.

@yfarjoun
Copy link
Contributor

yfarjoun commented May 5, 2020

sorry for the holdup 👍 from me.

@jmarshall jmarshall merged commit 31d6e44 into samtools:master May 5, 2020
@github-pages github-pages bot temporarily deployed to github-pages May 5, 2020 14:02 Inactive
@jmarshall jmarshall deleted the uint32ranges branch May 5, 2020 14:31
@jmarshall
Copy link
Member Author

Thanks all; merged now (and SAMv1.pdf rebuilt with this and @RG-PL:DNBSEQ).

I have the 4 GiB l_text proposed change stashed separately and might raise it as its own PR one day — but there's no hurry 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants