-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regions has no effect when streaming BCF #248
Comments
Hi @cmdoret, thanks for the detailed issue. I had a look into this, and I believe that it is expected behaviour. This is because the sample file is small, and when compressing it using BGZF, only one block is produced. Since htsget must return a valid file (i.e. byte data cannot be cut across BGZF blocks), any request with that file will always result in the full file being returned, as there is only one BGZF block to return. This can be confirmed by creating a GZI index for the file, and reading its contents: bgzip -r abc.bcf
hexdump -C abc.bcf.gzi Which shows 0 entries, indicating there is only one block: 00000000 00 00 00 00 00 00 00 00 |........|
00000008 Unfortunately, I don't think there's a nicer way to read GZI files using command line tools, although you can use noodles: fn main() {
let index = noodles::bgzf::gzi::read("abc.bcf.gzi").unwrap();
println!("{}", index.len());
} Which outputs:
If you try a different file with more BGZF blocks (e.g. the htsnexus_test_NA12878.bam.gzi file), it should show more blocks: fn main() {
let index = noodles::bgzf::gzi::read("data/bam/htsnexus_test_NA12878.bam.gzi").unwrap();
println!("{}", index.len());
} Which outputs:
In effect, this number determines the maximum amount of "slices" that can be returned by htsget for a given file. Note, that the segments in the logs refer to the HTTP request itself, i.e. the number of segments in the path, and don't mean anything in terms of the underlying BCF file. Apologies, the logs are currently a bit noisy, which I'm aiming to improve (see #250). I think that this issue is related to #238. This feature is currently not supported, but there are plans to implement a mode that allows the htsget server to edit out data that was not requested. Is this a feature that you would be interested in? |
Thanks for your explanation, @mmalenic. Indeed, this does not seem to happen on larger files (feel free to close the issue). I understand and that makes sense, in that case we will filter on client side. As we are not yet dealing with crypt4gh, this is not a priority for us right now. But we would definitely be interested in the future, and I think this would indeed make htsget-rs very interesting for health-related projects. I'd love to contribute, but will need to step up my rust knowledge by then 😅 |
Hello and thanks for providing this excellent implementation of htsget!
We are trying to use it together with a minio S3 storage and so far it worked well, however we noticed that when requesting a specific region, the
/variants
endpoint returned all records regardless. I believe this is a bug in htsget-rs based on the server logs, but I may also have mis-interpreted them or mis-used htsget-rs. Do you have any advice or suggestion on where the issue might be?Environment:
Steps to reproduce:
Observed behaviour: Both queries returned all variant records from all contigs.
Expected behaviour: Only variants of the requested chromosome are returned.
Observations:
Log from the contig 19 request shows that the query was parsed properly, and that segments (10,16) were requested.
Logs from the contig 20 query show that the same segments (10,16) were requested, although the query is different. I am not sure whether I interpret this properly.
The text was updated successfully, but these errors were encountered: