You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Another way of storing the data would be to have each sketch bin stored as a dictionary, with the key as the 14-bits of the bin value (not transposed) and values as the samples which had that bin. Then I think you could do a fast distance query for a new sample by finding matching bins and adding the values from each match.
I think the efficiency of the 'adding the values from each match' would determine whether this is faster or slower than the default method here. Starting with sparse vectors of integers (i.e. just those samples where there is a match) probably makes sense.
The text was updated successfully, but these errors were encountered:
Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'
johnlees
changed the title
Try reverse index for queries
Try reverse index for queries/linclust with bins
May 9, 2024
Another way of storing the data would be to have each sketch bin stored as a dictionary, with the key as the 14-bits of the bin value (not transposed) and values as the samples which had that bin. Then I think you could do a fast distance query for a new sample by finding matching bins and adding the values from each match.
I think the efficiency of the 'adding the values from each match' would determine whether this is faster or slower than the default method here. Starting with sparse vectors of integers (i.e. just those samples where there is a match) probably makes sense.
The text was updated successfully, but these errors were encountered: