Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try reverse index for queries/linclust with bins #11

Closed
johnlees opened this issue Apr 21, 2024 · 3 comments
Closed

Try reverse index for queries/linclust with bins #11

johnlees opened this issue Apr 21, 2024 · 3 comments
Assignees
Labels
research testing an algorithm idea

Comments

@johnlees
Copy link
Member

Another way of storing the data would be to have each sketch bin stored as a dictionary, with the key as the 14-bits of the bin value (not transposed) and values as the samples which had that bin. Then I think you could do a fast distance query for a new sample by finding matching bins and adding the values from each match.

I think the efficiency of the 'adding the values from each match' would determine whether this is faster or slower than the default method here. Starting with sparse vectors of integers (i.e. just those samples where there is a match) probably makes sense.

@johnlees johnlees added the research testing an algorithm idea label Apr 21, 2024
@johnlees johnlees self-assigned this May 8, 2024
@johnlees
Copy link
Member Author

johnlees commented May 9, 2024

Linclust for sketchlib

Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'

@johnlees johnlees changed the title Try reverse index for queries Try reverse index for queries/linclust with bins May 9, 2024
@johnlees
Copy link
Member Author

roaringbitmap might be appropriate here to store the samples in which each hash is present
https://docs.rs/roaring/latest/roaring/bitmap/struct.RoaringBitmap.html

@johnlees
Copy link
Member Author

Tracking this in #21 now

@johnlees johnlees closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research testing an algorithm idea
Projects
None yet
Development

No branches or pull requests

2 participants