Jkbonfield vcf int64 #1003

pd3 · 2019-12-10T20:35:44Z

No description provided.

Any 64-bit INFO field that wasn't the last in the list would cause subsequent fields to be decoded incorrectly. This commit fixes that, plus updates the tests accordingly so the bug could be triggered. Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second part (BCF output silently being broken). Fixes samtools/bcftools#1123

This commit addresses several problems introduced by the recently added support of 64-bit coordinates: - 64-bit positions were silently modified on BCF output - any erroneous out-of-range INFO value resulted in unparseable BCF output (in contrast to the previous behavior of a silent under/overflow) - conversion between VCF and BCF is no longer possible Large genomes is a niche area and can be easily worked around by splitting the largest chromosome into two. This commit makes the compilation of 64-bit values optional, favoring the interroperability of BCF with VCF, a feature heavily relied on in many pipelines. The support of large genomes in this commit is limited to POS and INFO tags defined as Number=1 (motivated by INFO/END), all other extreme values are replaced with a missing value ("."). This changes the original ad-hoc approach which supported first field of any INFO tag array. The default 32-bit mode adds checks to prevent under/overflow of INFO and FORMAT integer values and prints a warning when extreme values are replaced with a missing value. An additional check is added to prevent overflow of 64-bit coordinates on BCF output. Note that the 64-bit support is not compiled in by default because: - it breaks VCF/BCF specification - it works only for VCF but cannot produce a functional BCF - there is currently no front-end capable of indexing VCFs with 64bit coordinates (n_lvls is not accessible via `bcftools index`) Note that a full 64-bit support for BCF would require significantly more work and large changes of the codebase, API, and the BCF specification, therefore it is not planned at the moment. The limited POS and INFO/END support for VCF can be compiled in by editing htslib/vcf.c and setting #define ALLOW_INT64 1

pd3 · 2019-12-10T20:59:41Z

My commit fails on several trivial log issues and one major issue: 64-bit positions are not compiled in by default. I made it this way because large positions cannot be stored in BCF but full compatibility of VCF and BCF was an essential part of the design goal. I would rather live without 64-bit positions than give up VCF/BCF interoperability for two reasons: 1) The number of large genomes that will ever be processes is negligible compared to normal sized genomes and 2) large chromosome can always be split into two.

jkbonfield · 2019-12-11T09:23:18Z

htslib/vcf.h

@@ -1411,7 +1411,7 @@ static inline int64_t bcf_dec_int1(const uint8_t *p, int type, uint8_t **q)
        *q = (uint8_t*)p + 4;
        return le_to_i32(p);
    } else if (type == BCF_BT_INT64) {
-        *q = (uint8_t*)p + 4;
+        *q = (uint8_t*)p + 8;


This is an additional bug to the one I found. I wonder why it wasn't breaking things. Is this function only used when decoding the BCF file itself, in which case it's not triggerable as it's not actually a valid thing to have in the file?

(But still worth fixing as clearly the code is incorrect.)

It was breaking things when reading BCF_BT_INT64 values from an unofficial 64-bit flavor of BCF.

jkbonfield · 2019-12-11T10:21:44Z

I don't agree with the direction this PR is going.

If we don't enable VCF_ALLOW_INT64 then it forbids END from coping with 64-bit quantities and we don't get he ability to work with large chromosomes in vcf.gz (which is, let's be honest, the dominant variant format).

If we do enable it, which means editing code (please don't put the #define in the code direct as it forbids defining CPPFLAGS on the configure command line) then we get 64-bit vcf working and 64-bit BCF which breaks the spec and causes other issues.

Neither of those scenarios is what we want IMO.

jkbonfield · 2019-12-11T12:27:11Z

My alternative to this is #1004

It's also trivial (overloading bcf1_t->unpacked, but I couldn't find any other adequate place to add flag data that doesn't change the ABI) which gives more confidence that in a knee-jerk reaction to fix one bug we're not introducing several others - although it obviously needs scrutiny still.

It also addresses my concern with this PR in that it explicitly forbids creating non-conforming BCF instead of silently extending the file format to be bcftools' own improved flavour of BCF. However I still contend that for this immediate release we can and should get away with the absolute minimal change of unbreaking the VCF parser and doing nothing more.

pd3 · 2019-12-11T15:18:39Z

I am happy with the #1004 approach of disallowing printing out invalid BCFs. It should be the non-default option though and add the conversion to missing values and warnings for the default case as is done here.

jkbonfield · 2019-12-12T12:32:10Z

vcf.c

+     - cannot output 64-bit BCF and if it does, it is not compatible with anything
+     - experimental, use at your risk
+*/
+#define VCF_ALLOW_INT64 0


Please replace this with a comment and use #ifdef instead.

and temporarily comment out failing tests which assume 64 bit positions are compiled in by default.

jkbonfield · 2019-12-13T12:43:02Z

vcf.c

+                    {
+                        errno = 0;
+                        char *te;
+                        long int tmp_val = strtol(t, &te, 10);


Should be

int64_t tmp_val = strtoll(t, &te, 10);

so 32-bit systems are still doing the check with64-bit quantities. This is how code elsewhere works. (Eg see the v->pos assignment).

(I can amend this myself though.)

For FORMAT fields I kept and agree with the original @daviesrob's approach of not supporting 64-bit values. We try to express FORMAT fields using int8_t whenever possible because this is the size bottleneck for VCF, would not like to encourage 64-bit values there at all.

(Having said that, of course, there is the PS tag which can be any number but usually would refer to a genomic coordinate and therefore could be 64-bit in principle. However, this can wait as it is not in widespread use for human data, not speaking about the hypothetical large genomes.)

Ok fair enough. I don't know what's supported and what isn't, but if it's just permitted as 32-bit only then it's fine. Thanks.

jkbonfield · 2019-12-13T12:49:00Z

It's not liking "." (edit: only if I build with -DVCF_ALLOW_INT64). I'll try and debug it after lunch.

$ ./bcftools view _5.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-6-g2965196
##bcftools_viewCommand=view _5.vcf; Date=Fri Dec 13 12:48:08 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=.

$ ./bcftools view -Ou _5.vcf
[E::bcf_write] Data contains 64-bit values not representable in BCF.  Please use VCF instead
[main_vcfview] Error: cannot write to (null)

If I use MPOS=100 it's fine.

pd3 · 2019-12-13T13:05:44Z

You are right. This happens when 64-bit goodness is compiled in and BCF produced on output: missing values are now represented as bcf_int64_missing which makes them 64-bit values. Replacing with bcf_int32_missing should fix the problem. Will push a fix in a minute.

jkbonfield · 2019-12-13T14:06:49Z

Now we're back to square one almost. Bcftools is the normal build and bcftools64 is with -DVCF_ALLOW_INT64.

$ bcftools view -Ou _5.vcf | bcftools view
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-7-g5073db3-dirty
##bcftools_viewCommand=view -Ou _5.vcf; Date=Fri Dec 13 14:05:00 2019
##bcftools_viewCommand=view; Date=Fri Dec 13 14:05:00 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=.

# Edit: the above output also matches the input - it hasn't replaced any value with dot.

$ ./bcftools64 view -Ou _5.vcf | bcftools view
[W::bcf_record_check] Bad BCF record: Invalid INFO type 4 (unknown)
Error: BCF read error
...

$ ./bcftools64 view -Ou _5.vcf | ./bcftools64 view
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-7-g5073db3-dirty
##bcftools_viewCommand=view -Ou _5.vcf; Date=Fri Dec 13 14:06:16 2019
##bcftools_viewCommand=view; Date=Fri Dec 13 14:06:16 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=-2147483648

The two are no longer interchangeable as we've invented a different BCF binary output.

jkbonfield · 2019-12-13T14:08:05Z

I'm inclined to think we should just comment out all the int64 stuff, or at least remove the documentation from Makefile and README.large_positions.m4 and delay it to the next release as it's obviously proving hard to get working without breaking the file format.

…d to BCF

pd3 · 2019-12-13T14:20:33Z

I pushed a quick fix, but it is wrong, I realize.

pd3 · 2019-12-13T14:32:24Z

There are many combinations to test and the bcftools test harness cannot work with two versions. I hope I managed to test all cases now.

jkbonfield · 2019-12-13T14:40:44Z

Copied from samtools/hts-specs#460. Sorry, I'm suffering with a cold and will probably head home soon.

Try again... I'm now thinking this:

diff --git a/vcf.c b/vcf.c
index 025fa0f..beb7a64 100644
--- a/vcf.c
+++ b/vcf.c
@@ -2720,69 +2720,73 @@ int vcf_parse(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v)
                                 hts_log_error("Could not allocate memory");
                                 v->errcode |= BCF_ERR_LIMITS; // No appropriate code?
                                 goto err;
                             }
                             max_n_val = n_val;
                             val_a = z;
                         }
                         if ((y>>4&0xf) == BCF_HT_INT) {
                             i = 0, t = val;
 #ifdef VCF_ALLOW_INT64
                             int64_t val1;
                             if ( n_val==1 )
                             {
                                 errno = 0;
                                 long long int tmp_val = strtoll(val, &te, 10);
-                                if ( te==val ) tmp_val = bcf_int32_missing;
+                                if ( te==val ) tmp_val = bcf_int64_missing;
                                 else if ( te==val || errno!=0 || tmp_val<BCF_MIN_BT_INT64 || tmp_val>BCF_MAX_BT_INT64 )
                                 {
                                     if ( !extreme_int_warned )
                                     {
                                         hts_log_warning("Extreme INFO/%s value encountered and set to missing at %s:%"PRIhts_pos,key,bcf_seqname(h,v), v->pos+1);
                                         extreme_int_warned = 1;
                                     }
-                                    tmp_val = bcf_int32_missing;
+                                    tmp_val = bcf_int64_missing;
                                 }
                                 val1 = tmp_val;
                                 t = te;
                                 i = 1;  // this is just to avoid adding another nested block...
                             }
 #else
                             int32_t val1;
 #endif
                             for (; i < n_val; ++i, ++t)
                             {
                                 errno = 0;
                                 long int tmp_val = strtol(t, &te, 10);
                                 if ( te==t ) tmp_val = bcf_int32_missing;
                                 else if ( errno!=0 || tmp_val<BCF_MIN_BT_INT32 || tmp_val>BCF_MAX_BT_INT32 )
                                 {
                                     if ( !extreme_int_warned )
                                     {
                                         hts_log_warning("Extreme INFO/%s value encountered and set to missing at %s:%"PRIhts_pos,key,bcf_seqname(h,v), v->pos+1);
                                         extreme_int_warned = 1;
                                     }
                                     tmp_val = bcf_int32_missing;
                                 }
                                 val_a[i] = tmp_val;
                                 for (t = te; *t && *t != ','; t++);
                             }
                             if (n_val == 1) {
 #ifdef VCF_ALLOW_INT64
-                                if ( val1<INT32_MIN || val1>BCF_MAX_BT_INT32 )
-                                    v->unpacked |= BCF_IS_64BIT;
-                                bcf_enc_long1(str, val1);
+                                if (val1 == bcf_int64_missing) {
+                                    bcf_enc_int1(str, bcf_int32_missing);
+                                } else {
+                                    if ( val1<INT32_MIN || val1>BCF_MAX_BT_INT32 )
+                                        v->unpacked |= BCF_IS_64BIT;
+                                    bcf_enc_long1(str, val1);
+                                }
 #else
                                 val1 = val_a[0];
                                 bcf_enc_int1(str, val1);
 #endif
                             } else {
                                 bcf_enc_vint(str, n_val, val_a, -1);
                             }
                             if (n_val==1 && strcmp(key, "END") == 0)
                                 v->rlen = val1 - v->pos;
                         } else if ((y>>4&0xf) == BCF_HT_REAL) {
                             float *val_f = (float *)val_a;
                             for (i = 0, t = val; i < n_val; ++i, ++t)
                             {
                                 val_f[i] = strtod(t, &te);
                                 if ( te==t ) // conversion failed

Basically internally it uses bcf_int64_missing for missing values or out of range values, but when writing out it'll encode as a 32-bit bcf_int32_missing instead. This avoids the need of using a 64-bit missing value, but for the parser it avoids mixing up -2147483648 with bcf_int32_missing.

jkbonfield · 2019-12-13T14:42:07Z

My test file for the above is:

##fileformat=VCFv4.2
##INFO=<ID=END,Number=A,Type=Integer,Description="dummy">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##FORMAT=<ID=XX,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	G	C	.	.	END=42949672,1,5000000000;MPOS=100000000	XX	1,-2147483640,-2147483641,-2147483647,-2147483648,2147483647,2147483648
chr1	1	.	G	C	.	.	END=-2147483640	XX	-2147483640
chr1	1	.	G	C	.	.	END=-2147483641	XX	-2147483641
chr1	1	.	G	C	.	.	END=-2147483647	XX	-2147483647
chr1	1	.	G	C	.	.	END=-2147483648	XX	-2147483648
chr1	1	.	G	C	.	.	END=-2147483649	XX	-2147483649

If you find -2147483648 drops out then you have a bug still.

pd3 · 2019-12-13T14:48:18Z

Yes. I believe it is fixed now. The problem was with feeding bcf_enc_long1 the missing value, it was interpreted as a valid 64-bit number.

jkbonfield · 2019-12-13T14:48:48Z

Comparing my edited code with your last push, I think they're now doing the same thing via different routes.

- bug fix in `bcf_dec_typed_int1_safe()`, 64-bit values could not have been read correctly by this function - cosmetic change to declare val1 always as int64_t and explicit cast to int32_t and add explicit LL to a 64-bit constant in case some compilers were fussy

valeriuo · 2019-12-16T15:56:52Z

Merged as ca2f667

jkbonfield and others added 2 commits December 9, 2019 14:17

pd3 mentioned this pull request Dec 11, 2019

64bit integers break VCF/BCF #999

Closed

Make the warning messages more informative

448de2e

jkbonfield reviewed Dec 11, 2019

View reviewed changes

pd3 mentioned this pull request Dec 11, 2019

Vcf int64 part2 #1004

Closed

Additional checks to prevent writing invalid BCF containing 64-bit data

31aa365

jkbonfield reviewed Dec 12, 2019

View reviewed changes

Allow compilation w/o modifying vcf.c

2965196

and temporarily comment out failing tests which assume 64 bit positions are compiled in by default.

jkbonfield reviewed Dec 13, 2019

View reviewed changes

Use int32 for missing values, otherwise BCF cannot be output

5073db3

Handle another case of forgotten 64-bit values that can pass unnotice…

aa0a090

…d to BCF

Distinguish between genuine missing value and out-of-range replacement

4929c3d

jmarshall mentioned this pull request Dec 13, 2019

Signedness for fields that should only represent positive values such as lengths samtools/hts-specs#460

Closed

valeriuo closed this Dec 16, 2019

pd3 deleted the jkbonfield-vcf_int64 branch February 28, 2023 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jkbonfield vcf int64 #1003

Jkbonfield vcf int64 #1003

pd3 commented Dec 10, 2019

pd3 commented Dec 10, 2019

jkbonfield Dec 11, 2019

pd3 Dec 11, 2019

jkbonfield commented Dec 11, 2019 •

edited

Loading

jkbonfield commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield Dec 12, 2019

jkbonfield Dec 13, 2019 •

edited

Loading

pd3 Dec 13, 2019

jkbonfield Dec 13, 2019

jkbonfield commented Dec 13, 2019 •

edited

Loading

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019 •

edited

Loading

jkbonfield commented Dec 13, 2019

pd3 commented Dec 13, 2019

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

valeriuo commented Dec 16, 2019

Jkbonfield vcf int64 #1003

Jkbonfield vcf int64 #1003

Conversation

pd3 commented Dec 10, 2019

pd3 commented Dec 10, 2019

jkbonfield Dec 11, 2019

Choose a reason for hiding this comment

pd3 Dec 11, 2019

Choose a reason for hiding this comment

jkbonfield commented Dec 11, 2019 • edited Loading

jkbonfield commented Dec 11, 2019

pd3 commented Dec 11, 2019

jkbonfield Dec 12, 2019

Choose a reason for hiding this comment

jkbonfield Dec 13, 2019 • edited Loading

Choose a reason for hiding this comment

pd3 Dec 13, 2019

Choose a reason for hiding this comment

jkbonfield Dec 13, 2019

Choose a reason for hiding this comment

jkbonfield commented Dec 13, 2019 • edited Loading

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019 • edited Loading

jkbonfield commented Dec 13, 2019

pd3 commented Dec 13, 2019

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

pd3 commented Dec 13, 2019

jkbonfield commented Dec 13, 2019

valeriuo commented Dec 16, 2019

jkbonfield commented Dec 11, 2019 •

edited

Loading

jkbonfield Dec 13, 2019 •

edited

Loading

jkbonfield commented Dec 13, 2019 •

edited

Loading

jkbonfield commented Dec 13, 2019 •

edited

Loading