-
Notifications
You must be signed in to change notification settings - Fork 53
Commit
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,137 @@ | ||||||||||||||||||||
use std::io::{self, Read, Seek}; | ||||||||||||||||||||
|
||||||||||||||||||||
use noodles_bgzf as bgzf; | ||||||||||||||||||||
use noodles_tabix::index::reference_sequence::bin::Chunk; | ||||||||||||||||||||
|
||||||||||||||||||||
use crate::{record::Chromosome, Record}; | ||||||||||||||||||||
|
||||||||||||||||||||
use super::Reader; | ||||||||||||||||||||
|
||||||||||||||||||||
enum State { | ||||||||||||||||||||
Seek, | ||||||||||||||||||||
Read(bgzf::VirtualPosition), | ||||||||||||||||||||
End, | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
/// An iterator over records of a BAM reader that intersects a given region. | ||||||||||||||||||||
/// | ||||||||||||||||||||
/// This is created by calling [`Reader::query`]. | ||||||||||||||||||||
pub struct Query<'a, R> | ||||||||||||||||||||
where | ||||||||||||||||||||
R: Read + Seek, | ||||||||||||||||||||
{ | ||||||||||||||||||||
reader: &'a mut Reader<bgzf::Reader<R>>, | ||||||||||||||||||||
chunks: Vec<Chunk>, | ||||||||||||||||||||
reference_sequence_name: String, | ||||||||||||||||||||
start: i32, | ||||||||||||||||||||
This comment has been minimized.
Sorry, something went wrong.
This comment has been minimized.
Sorry, something went wrong.
zaeleus
Author
Owner
|
Format | Field name | Type/range |
---|---|---|
SAM 1.6 | POS | [0, 231-1] |
BAM 1.6 | pos | int32_t |
CRAM 3.0 | alignment start | itf8 (int32 ) |
VCF 4.3 | POS | Integer (32-bit, signed) |
BCF 2.2 | POS | int32_t |
I think most (all?) publicly accessible types or conversions in noodles return what's defined in each spec. It creates a good mapping to the reference specs, but I can see this being confusing to an end user when types don't semantically make sense. I'm not sure if diverging from the spec descriptions is preferable in this case; regardless of type, there will always be a required bounds check, whether it's i32
> 0 or u32
< 231.
Since this is in the context of querying, one interesting limitation of the index formats is that, with the default minimum interval size (~16 Kbp), the highest supported genomic position is actually 229-1, not 231-1!
Also wondering to which extent we want to expose Query's
start/end
struct fields:
I would recommend copying how the the inputs are built for Query
rather than relying on the fields of the iterator.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
jkbonfield
May 12, 2021
•
edited
Loading
edited
Note using unsigned doesn't buy you much when you have fields like template length in BAM which by design is +/- (it's an addition to alignment start/end to get to the other end). So you gain double the range with POS by making it unsigned, but then another part of the spec promptly breaks. :-(
I suspect also many of these POS fields are signed so they can use -1 as an out of bounds value for error or unknown, which isn't so unreasonable.
Really these should have been (signed) int64, but when the formats were designed people only cared about human chromosomes. Internally htslib uses int64 now, so that SAM at least can work on large chromosomes, even if the binary formats cannot. I don't know if we fixed VCF though. That may be a consideration for noodles too perhaps.
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
brainstorm
May 18, 2021
@jkbonfield , thanks for chiming in, always good to have you watching :)... also totally forgot our discussion on samtools/hts-specs#460, thanks for the refresher ;)
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
brainstorm
May 21, 2021
•
edited
Loading
edited
I would recommend copying how the the inputs are built for Query rather than relying on the fields of the iterator.
@zaeleus Do you mean using resolve_region()
instead of query()
? I'm not sure I grok what you mean... in my mind I would just go through the iterator returned by query, fetching Result<Query, E>
's start
& end
for each element and then building a Vec<ByteRanges>
and call build_response()
?
This comment has been minimized.
This comment has been minimized.
Sorry, something went wrong.
zaeleus
May 21, 2021
Author
Owner
Perhaps you're confusing the start
and end
fields in Query
? Those are the start and end positions in the region, not a byte range in the file. To get the byte positions, you're interested in querying the index and iterating through all the chunks. This is why I think you don't actually need to call vcf::Reader::query
and build a Query
iterator.
@zaeleus @chris-zen @mmalenic @victorskl, would it make sense to have start/end on u64 instead of i32 here in noodles
Query
struct? I guess the signedness comes from some edge case from the specs (as in encoding some metada with -1 or similar?). In my opinion start/end should be u64 because genomic positions should be inherently positive and not 32-bit "limited"?Reason I'm asking this is because I'm trying to get the start/end fields returned from
query
and putting them back into our htsget-rsVec<BytesRange>
... here's some pseudocode coming from the/variants
branch:And while getting the i32 -> u64 compiler errors, I thought about this.
Also wondering to which extent we want to expose Query's
start/end
struct fields:to accommodate the use case above or whether there are better ways to do this that don't break encapsulation.