-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jkbonfield vcf int64 #1003
Jkbonfield vcf int64 #1003
Conversation
Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second part (BCF output silently being broken). Fixes samtools/bcftools#1123
This commit addresses several problems introduced by the recently added support of 64-bit coordinates: - 64-bit positions were silently modified on BCF output - any erroneous out-of-range INFO value resulted in unparseable BCF output (in contrast to the previous behavior of a silent under/overflow) - conversion between VCF and BCF is no longer possible Large genomes is a niche area and can be easily worked around by splitting the largest chromosome into two. This commit makes the compilation of 64-bit values optional, favoring the interroperability of BCF with VCF, a feature heavily relied on in many pipelines. The support of large genomes in this commit is limited to POS and INFO tags defined as Number=1 (motivated by INFO/END), all other extreme values are replaced with a missing value ("."). This changes the original ad-hoc approach which supported first field of any INFO tag array. The default 32-bit mode adds checks to prevent under/overflow of INFO and FORMAT integer values and prints a warning when extreme values are replaced with a missing value. An additional check is added to prevent overflow of 64-bit coordinates on BCF output. Note that the 64-bit support is not compiled in by default because: - it breaks VCF/BCF specification - it works only for VCF but cannot produce a functional BCF - there is currently no front-end capable of indexing VCFs with 64bit coordinates (n_lvls is not accessible via `bcftools index`) Note that a full 64-bit support for BCF would require significantly more work and large changes of the codebase, API, and the BCF specification, therefore it is not planned at the moment. The limited POS and INFO/END support for VCF can be compiled in by editing htslib/vcf.c and setting #define ALLOW_INT64 1
My commit fails on several trivial log issues and one major issue: 64-bit positions are not compiled in by default. I made it this way because large positions cannot be stored in BCF but full compatibility of VCF and BCF was an essential part of the design goal. I would rather live without 64-bit positions than give up VCF/BCF interoperability for two reasons: 1) The number of large genomes that will ever be processes is negligible compared to normal sized genomes and 2) large chromosome can always be split into two. |
@@ -1411,7 +1411,7 @@ static inline int64_t bcf_dec_int1(const uint8_t *p, int type, uint8_t **q) | |||
*q = (uint8_t*)p + 4; | |||
return le_to_i32(p); | |||
} else if (type == BCF_BT_INT64) { | |||
*q = (uint8_t*)p + 4; | |||
*q = (uint8_t*)p + 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an additional bug to the one I found. I wonder why it wasn't breaking things. Is this function only used when decoding the BCF file itself, in which case it's not triggerable as it's not actually a valid thing to have in the file?
(But still worth fixing as clearly the code is incorrect.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was breaking things when reading BCF_BT_INT64 values from an unofficial 64-bit flavor of BCF.
I don't agree with the direction this PR is going. If we don't enable If we do enable it, which means editing code (please don't put the #define in the code direct as it forbids defining CPPFLAGS on the configure command line) then we get 64-bit vcf working and 64-bit BCF which breaks the spec and causes other issues. Neither of those scenarios is what we want IMO. |
My alternative to this is #1004 It's also trivial (overloading It also addresses my concern with this PR in that it explicitly forbids creating non-conforming BCF instead of silently extending the file format to be bcftools' own improved flavour of BCF. However I still contend that for this immediate release we can and should get away with the absolute minimal change of unbreaking the VCF parser and doing nothing more. |
I am happy with the #1004 approach of disallowing printing out invalid BCFs. It should be the non-default option though and add the conversion to missing values and warnings for the default case as is done here. |
vcf.c
Outdated
- cannot output 64-bit BCF and if it does, it is not compatible with anything | ||
- experimental, use at your risk | ||
*/ | ||
#define VCF_ALLOW_INT64 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace this with a comment and use #ifdef instead.
and temporarily comment out failing tests which assume 64 bit positions are compiled in by default.
{ | ||
errno = 0; | ||
char *te; | ||
long int tmp_val = strtol(t, &te, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be
int64_t tmp_val = strtoll(t, &te, 10);
so 32-bit systems are still doing the check with64-bit quantities. This is how code elsewhere works. (Eg see the v->pos assignment).
(I can amend this myself though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For FORMAT fields I kept and agree with the original @daviesrob's approach of not supporting 64-bit values. We try to express FORMAT fields using int8_t whenever possible because this is the size bottleneck for VCF, would not like to encourage 64-bit values there at all.
(Having said that, of course, there is the PS tag which can be any number but usually would refer to a genomic coordinate and therefore could be 64-bit in principle. However, this can wait as it is not in widespread use for human data, not speaking about the hypothetical large genomes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok fair enough. I don't know what's supported and what isn't, but if it's just permitted as 32-bit only then it's fine. Thanks.
It's not liking "." (edit: only if I build with -DVCF_ALLOW_INT64). I'll try and debug it after lunch.
If I use MPOS=100 it's fine. |
You are right. This happens when 64-bit goodness is compiled in and BCF produced on output: missing values are now represented as |
Now we're back to square one almost. Bcftools is the normal build and bcftools64 is with -DVCF_ALLOW_INT64.
The two are no longer interchangeable as we've invented a different BCF binary output. |
I'm inclined to think we should just comment out all the int64 stuff, or at least remove the documentation from Makefile and README.large_positions.m4 and delay it to the next release as it's obviously proving hard to get working without breaking the file format. |
I pushed a quick fix, but it is wrong, I realize. |
There are many combinations to test and the bcftools test harness cannot work with two versions. I hope I managed to test all cases now. |
Copied from samtools/hts-specs#460. Sorry, I'm suffering with a cold and will probably head home soon. Try again... I'm now thinking this:
Basically internally it uses |
My test file for the above is:
If you find -2147483648 drops out then you have a bug still. |
Yes. I believe it is fixed now. The problem was with feeding |
Comparing my edited code with your last push, I think they're now doing the same thing via different routes. |
- bug fix in `bcf_dec_typed_int1_safe()`, 64-bit values could not have been read correctly by this function - cosmetic change to declare val1 always as int64_t and explicit cast to int32_t and add explicit LL to a 64-bit constant in case some compilers were fussy
Merged as ca2f667 |
No description provided.