-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dereplication of genomes #21
Comments
Rather than converting from current .skd, probably easier to have a dedicated reverse index constructor function, then store an enum with the sketch type in the metadata. |
See also #11 for some earlier thoughts on this.
First use case would be to add a distance function against a new query sample:
Then later, some optimisations:
|
Also, ignore parallelisation and memory use for now – I will try and add these optimisations in future. |
Notes:
Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'
Use a reverse index
First step: sketch between those which share a bin
Can give assembly quality as input and presort, top will always be best (to find representative to align against)
The text was updated successfully, but these errors were encountered: