You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.
I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).
I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).
The text was updated successfully, but these errors were encountered:
[number of texts] We store the text begin positions (text_begin), we also have select and rank support for this vector. The number of texts would then be rank(text_begin, size()) // +1 ??. This should be constant.
[number of texts] We number of texts during construction and could just store another size_t. Should be faster than rank, but will change the index serialisation.
[text size] Either store, or have a function that does the nseq == 1/nseq == 2 check.
Question: Do we also need the sizes of individual texts in the collection?
With text_begin as well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one). This should also be constant.
We know the text sizes and number of texts during construction. So, we could store the text_lengths in a vector. Might get "quite" big for big collections, and also changes the index serialisation.
For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.
I have experimented with the
size()
function ofseqan3::bi_fm_index<dna4, text_layout::collection>
in order to calculate it myself. According to the documentation, the value ofsize()
includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar =index.size()
- nseq. For nseq == 1 we have a special case with nchar =index.size()
- 2 (because a single sequence has 2 sentinels).I suggest to provide a function
get_text_size()
for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).The text was updated successfully, but these errors were encountered: