Try reverse index for queries/linclust with bins #11

johnlees · 2024-04-21T09:42:01Z

Another way of storing the data would be to have each sketch bin stored as a dictionary, with the key as the 14-bits of the bin value (not transposed) and values as the samples which had that bin. Then I think you could do a fast distance query for a new sample by finding matching bins and adding the values from each match.

I think the efficiency of the 'adding the values from each match' would determine whether this is faster or slower than the default method here. Starting with sparse vectors of integers (i.e. just those samples where there is a match) probably makes sense.

johnlees · 2024-05-09T16:13:16Z

Linclust for sketchlib

Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'

johnlees · 2024-05-10T13:31:47Z

roaringbitmap might be appropriate here to store the samples in which each hash is present
https://docs.rs/roaring/latest/roaring/bitmap/struct.RoaringBitmap.html

johnlees · 2024-09-13T08:19:31Z

Tracking this in #21 now

johnlees added the research testing an algorithm idea label Apr 21, 2024

johnlees self-assigned this May 8, 2024

johnlees changed the title ~~Try reverse index for queries~~ Try reverse index for queries/linclust with bins May 9, 2024

johnlees assigned johannahelene Sep 13, 2024

johnlees mentioned this issue Sep 13, 2024

Dereplication of genomes #21

Open

johnlees closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try reverse index for queries/linclust with bins #11

Try reverse index for queries/linclust with bins #11

johnlees commented Apr 21, 2024

johnlees commented May 9, 2024

johnlees commented May 10, 2024

johnlees commented Sep 13, 2024

Try reverse index for queries/linclust with bins #11

Try reverse index for queries/linclust with bins #11

Comments

johnlees commented Apr 21, 2024

johnlees commented May 9, 2024

johnlees commented May 10, 2024

johnlees commented Sep 13, 2024