Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jkbonfield vcf int64 #1003

Closed
wants to merge 9 commits into from
Closed

Jkbonfield vcf int64 #1003

wants to merge 9 commits into from

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Dec 10, 2019

No description provided.

jkbonfield and others added 2 commits December 9, 2019 14:17
Any 64-bit INFO field that wasn't the last in the list would cause
subsequent fields to be decoded incorrectly.

This commit fixes that, plus updates the tests accordingly so the bug
could be triggered.

Fixes the first part of samtools#999 (test1.vcf), but doesn't fix the second
part (BCF output silently being broken).

Fixes samtools/bcftools#1123
This commit addresses several problems introduced by the recently added
support of 64-bit coordinates:
- 64-bit positions were silently modified on BCF output
- any erroneous out-of-range INFO value resulted in unparseable BCF
  output (in contrast to the previous behavior of a silent under/overflow)
- conversion between VCF and BCF is no longer possible

Large genomes is a niche area and can be easily worked around by splitting
the largest chromosome into two. This commit makes the compilation of 64-bit
values optional, favoring the interroperability of BCF with VCF, a feature
heavily relied on in many pipelines.

The support of large genomes in this commit is limited to POS and INFO tags
defined as Number=1 (motivated by INFO/END), all other extreme values
are replaced with a missing value ("."). This changes the original ad-hoc
approach which supported first field of any INFO tag array.

The default 32-bit mode adds checks to prevent under/overflow of INFO and
FORMAT integer values and prints a warning when extreme values are replaced
with a missing value. An additional check is added to prevent overflow of
64-bit coordinates on BCF output.

Note that the 64-bit support is not compiled in by default because:
- it breaks VCF/BCF specification
- it works only for VCF but cannot produce a functional BCF
- there is currently no front-end capable of indexing VCFs with 64bit
  coordinates (n_lvls is not accessible via `bcftools index`)

Note that a full 64-bit support for BCF would require significantly more
work and large changes of the codebase, API, and the BCF specification,
therefore it is not planned at the moment.

The limited POS and INFO/END support for VCF can be compiled in by
editing htslib/vcf.c and setting
    #define ALLOW_INT64 1
@pd3
Copy link
Member Author

pd3 commented Dec 10, 2019

My commit fails on several trivial log issues and one major issue: 64-bit positions are not compiled in by default. I made it this way because large positions cannot be stored in BCF but full compatibility of VCF and BCF was an essential part of the design goal. I would rather live without 64-bit positions than give up VCF/BCF interoperability for two reasons: 1) The number of large genomes that will ever be processes is negligible compared to normal sized genomes and 2) large chromosome can always be split into two.

@pd3 pd3 mentioned this pull request Dec 11, 2019
@@ -1411,7 +1411,7 @@ static inline int64_t bcf_dec_int1(const uint8_t *p, int type, uint8_t **q)
*q = (uint8_t*)p + 4;
return le_to_i32(p);
} else if (type == BCF_BT_INT64) {
*q = (uint8_t*)p + 4;
*q = (uint8_t*)p + 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an additional bug to the one I found. I wonder why it wasn't breaking things. Is this function only used when decoding the BCF file itself, in which case it's not triggerable as it's not actually a valid thing to have in the file?

(But still worth fixing as clearly the code is incorrect.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was breaking things when reading BCF_BT_INT64 values from an unofficial 64-bit flavor of BCF.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 11, 2019

I don't agree with the direction this PR is going.

If we don't enable VCF_ALLOW_INT64 then it forbids END from coping with 64-bit quantities and we don't get he ability to work with large chromosomes in vcf.gz (which is, let's be honest, the dominant variant format).

If we do enable it, which means editing code (please don't put the #define in the code direct as it forbids defining CPPFLAGS on the configure command line) then we get 64-bit vcf working and 64-bit BCF which breaks the spec and causes other issues.

Neither of those scenarios is what we want IMO.

@jkbonfield
Copy link
Contributor

My alternative to this is #1004

It's also trivial (overloading bcf1_t->unpacked, but I couldn't find any other adequate place to add flag data that doesn't change the ABI) which gives more confidence that in a knee-jerk reaction to fix one bug we're not introducing several others - although it obviously needs scrutiny still.

It also addresses my concern with this PR in that it explicitly forbids creating non-conforming BCF instead of silently extending the file format to be bcftools' own improved flavour of BCF. However I still contend that for this immediate release we can and should get away with the absolute minimal change of unbreaking the VCF parser and doing nothing more.

@pd3 pd3 mentioned this pull request Dec 11, 2019
@pd3
Copy link
Member Author

pd3 commented Dec 11, 2019

I am happy with the #1004 approach of disallowing printing out invalid BCFs. It should be the non-default option though and add the conversion to missing values and warnings for the default case as is done here.

vcf.c Outdated
- cannot output 64-bit BCF and if it does, it is not compatible with anything
- experimental, use at your risk
*/
#define VCF_ALLOW_INT64 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace this with a comment and use #ifdef instead.

and temporarily comment out failing tests which assume 64 bit positions
are compiled in by default.
{
errno = 0;
char *te;
long int tmp_val = strtol(t, &te, 10);
Copy link
Contributor

@jkbonfield jkbonfield Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be

int64_t tmp_val = strtoll(t, &te, 10);

so 32-bit systems are still doing the check with64-bit quantities. This is how code elsewhere works. (Eg see the v->pos assignment).

(I can amend this myself though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FORMAT fields I kept and agree with the original @daviesrob's approach of not supporting 64-bit values. We try to express FORMAT fields using int8_t whenever possible because this is the size bottleneck for VCF, would not like to encourage 64-bit values there at all.

(Having said that, of course, there is the PS tag which can be any number but usually would refer to a genomic coordinate and therefore could be 64-bit in principle. However, this can wait as it is not in widespread use for human data, not speaking about the hypothetical large genomes.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fair enough. I don't know what's supported and what isn't, but if it's just permitted as 32-bit only then it's fine. Thanks.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 13, 2019

It's not liking "." (edit: only if I build with -DVCF_ALLOW_INT64). I'll try and debug it after lunch.

$ ./bcftools view _5.vcf
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-6-g2965196
##bcftools_viewCommand=view _5.vcf; Date=Fri Dec 13 12:48:08 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=.

$ ./bcftools view -Ou _5.vcf
[E::bcf_write] Data contains 64-bit values not representable in BCF.  Please use VCF instead
[main_vcfview] Error: cannot write to (null)

If I use MPOS=100 it's fine.

@pd3
Copy link
Member Author

pd3 commented Dec 13, 2019

You are right. This happens when 64-bit goodness is compiled in and BCF produced on output: missing values are now represented as bcf_int64_missing which makes them 64-bit values. Replacing with bcf_int32_missing should fix the problem. Will push a fix in a minute.

@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 13, 2019

Now we're back to square one almost. Bcftools is the normal build and bcftools64 is with -DVCF_ALLOW_INT64.

$ bcftools view -Ou _5.vcf | bcftools view
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-7-g5073db3-dirty
##bcftools_viewCommand=view -Ou _5.vcf; Date=Fri Dec 13 14:05:00 2019
##bcftools_viewCommand=view; Date=Fri Dec 13 14:05:00 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=.

# Edit: the above output also matches the input - it hasn't replaced any value with dot.

$ ./bcftools64 view -Ou _5.vcf | bcftools view
[W::bcf_record_check] Bad BCF record: Invalid INFO type 4 (unknown)
Error: BCF read error
...

$ ./bcftools64 view -Ou _5.vcf | ./bcftools64 view
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
##bcftools_viewVersion=1.10-4-g3abf9b8+htslib-1.10-7-g5073db3-dirty
##bcftools_viewCommand=view -Ou _5.vcf; Date=Fri Dec 13 14:06:16 2019
##bcftools_viewCommand=view; Date=Fri Dec 13 14:06:16 2019
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
chr1	1	.	G	C	.	.	MPOS=-2147483648

The two are no longer interchangeable as we've invented a different BCF binary output.

@jkbonfield
Copy link
Contributor

I'm inclined to think we should just comment out all the int64 stuff, or at least remove the documentation from Makefile and README.large_positions.m4 and delay it to the next release as it's obviously proving hard to get working without breaking the file format.

@pd3
Copy link
Member Author

pd3 commented Dec 13, 2019

I pushed a quick fix, but it is wrong, I realize.

@pd3
Copy link
Member Author

pd3 commented Dec 13, 2019

There are many combinations to test and the bcftools test harness cannot work with two versions. I hope I managed to test all cases now.

@jkbonfield
Copy link
Contributor

Copied from samtools/hts-specs#460. Sorry, I'm suffering with a cold and will probably head home soon.

Try again... I'm now thinking this:

diff --git a/vcf.c b/vcf.c
index 025fa0f..beb7a64 100644
--- a/vcf.c
+++ b/vcf.c
@@ -2720,69 +2720,73 @@ int vcf_parse(kstring_t *s, const bcf_hdr_t *h, bcf1_t *v)
                                 hts_log_error("Could not allocate memory");
                                 v->errcode |= BCF_ERR_LIMITS; // No appropriate code?
                                 goto err;
                             }
                             max_n_val = n_val;
                             val_a = z;
                         }
                         if ((y>>4&0xf) == BCF_HT_INT) {
                             i = 0, t = val;
 #ifdef VCF_ALLOW_INT64
                             int64_t val1;
                             if ( n_val==1 )
                             {
                                 errno = 0;
                                 long long int tmp_val = strtoll(val, &te, 10);
-                                if ( te==val ) tmp_val = bcf_int32_missing;
+                                if ( te==val ) tmp_val = bcf_int64_missing;
                                 else if ( te==val || errno!=0 || tmp_val<BCF_MIN_BT_INT64 || tmp_val>BCF_MAX_BT_INT64 )
                                 {
                                     if ( !extreme_int_warned )
                                     {
                                         hts_log_warning("Extreme INFO/%s value encountered and set to missing at %s:%"PRIhts_pos,key,bcf_seqname(h,v), v->pos+1);
                                         extreme_int_warned = 1;
                                     }
-                                    tmp_val = bcf_int32_missing;
+                                    tmp_val = bcf_int64_missing;
                                 }
                                 val1 = tmp_val;
                                 t = te;
                                 i = 1;  // this is just to avoid adding another nested block...
                             }
 #else
                             int32_t val1;
 #endif
                             for (; i < n_val; ++i, ++t)
                             {
                                 errno = 0;
                                 long int tmp_val = strtol(t, &te, 10);
                                 if ( te==t ) tmp_val = bcf_int32_missing;
                                 else if ( errno!=0 || tmp_val<BCF_MIN_BT_INT32 || tmp_val>BCF_MAX_BT_INT32 )
                                 {
                                     if ( !extreme_int_warned )
                                     {
                                         hts_log_warning("Extreme INFO/%s value encountered and set to missing at %s:%"PRIhts_pos,key,bcf_seqname(h,v), v->pos+1);
                                         extreme_int_warned = 1;
                                     }
                                     tmp_val = bcf_int32_missing;
                                 }
                                 val_a[i] = tmp_val;
                                 for (t = te; *t && *t != ','; t++);
                             }
                             if (n_val == 1) {
 #ifdef VCF_ALLOW_INT64
-                                if ( val1<INT32_MIN || val1>BCF_MAX_BT_INT32 )
-                                    v->unpacked |= BCF_IS_64BIT;
-                                bcf_enc_long1(str, val1);
+                                if (val1 == bcf_int64_missing) {
+                                    bcf_enc_int1(str, bcf_int32_missing);
+                                } else {
+                                    if ( val1<INT32_MIN || val1>BCF_MAX_BT_INT32 )
+                                        v->unpacked |= BCF_IS_64BIT;
+                                    bcf_enc_long1(str, val1);
+                                }
 #else
                                 val1 = val_a[0];
                                 bcf_enc_int1(str, val1);
 #endif
                             } else {
                                 bcf_enc_vint(str, n_val, val_a, -1);
                             }
                             if (n_val==1 && strcmp(key, "END") == 0)
                                 v->rlen = val1 - v->pos;
                         } else if ((y>>4&0xf) == BCF_HT_REAL) {
                             float *val_f = (float *)val_a;
                             for (i = 0, t = val; i < n_val; ++i, ++t)
                             {
                                 val_f[i] = strtod(t, &te);
                                 if ( te==t ) // conversion failed

Basically internally it uses bcf_int64_missing for missing values or out of range values, but when writing out it'll encode as a 32-bit bcf_int32_missing instead. This avoids the need of using a 64-bit missing value, but for the parser it avoids mixing up -2147483648 with bcf_int32_missing.

@jkbonfield
Copy link
Contributor

My test file for the above is:

##fileformat=VCFv4.2
##INFO=<ID=END,Number=A,Type=Integer,Description="dummy">
##INFO=<ID=MPOS,Number=A,Type=Integer,Description="dummy">
##FORMAT=<ID=XX,Number=A,Type=Integer,Description="dummy">
##contig=<ID=chr1,length=248956422>
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1
chr1	1	.	G	C	.	.	END=42949672,1,5000000000;MPOS=100000000	XX	1,-2147483640,-2147483641,-2147483647,-2147483648,2147483647,2147483648
chr1	1	.	G	C	.	.	END=-2147483640	XX	-2147483640
chr1	1	.	G	C	.	.	END=-2147483641	XX	-2147483641
chr1	1	.	G	C	.	.	END=-2147483647	XX	-2147483647
chr1	1	.	G	C	.	.	END=-2147483648	XX	-2147483648
chr1	1	.	G	C	.	.	END=-2147483649	XX	-2147483649

If you find -2147483648 drops out then you have a bug still.

@pd3
Copy link
Member Author

pd3 commented Dec 13, 2019

Yes. I believe it is fixed now. The problem was with feeding bcf_enc_long1 the missing value, it was interpreted as a valid 64-bit number.

@jkbonfield
Copy link
Contributor

Comparing my edited code with your last push, I think they're now doing the same thing via different routes.

- bug fix in `bcf_dec_typed_int1_safe()`, 64-bit values could not have been read
  correctly by this function

- cosmetic change to declare val1 always as int64_t and explicit cast to int32_t
  and add explicit LL to a 64-bit constant in case some compilers were fussy
@valeriuo
Copy link
Contributor

Merged as ca2f667

@valeriuo valeriuo closed this Dec 16, 2019
@pd3 pd3 deleted the jkbonfield-vcf_int64 branch February 28, 2023 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants