Dereplication of genomes #21

johnlees · 2024-07-22T11:18:06Z

Notes:
Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'

Use a reverse index
First step: sketch between those which share a bin
Can give assembly quality as input and presort, top will always be best (to find representative to align against)

johnlees · 2024-08-21T15:27:09Z

Rather than converting from current .skd, probably easier to have a dedicated reverse index constructor function, then store an enum with the sketch type in the metadata.

johnlees · 2024-09-13T08:01:40Z

See also #11 for some earlier thoughts on this.

First, add a new command inverted.
We need to make Sketch::new() into two functions, taking the first part out which creates the signs vec, so this can be called separately in the new function.
Then, for the inverted index, we want to go from Vec<Vec<u64>> which is a list signs across the samples (each signs is Vec<u64> for the sketch across the bins).
We then want to make a Vec<Hashmap<u64, BitVec>> from these. The outer Vec is across the bins, in the same order as the Vec<u64>. For each bin, the Hashmap has the bin value as the u64 key, and a list of samples with that bin value as the value stored as a BitVec. This BitVec will be a list of zeros with the same length as the number of samples, but then with 1 bits inserted at the indexes of the samples with that bin value. See https://docs.rs/bitvec/latest/bitvec/vec/struct.BitVec.html.
This then needs to be saved instead of the .skd. For now, I think just use serde around a new struct, and save it as an .ski file.

First use case would be to add a distance function against a new query sample:

Sketch the new sample into signs
Iterate over the inverted index Vec and the new sketch together.
For each bin in this iteration, check if the bin is in the Hashmap. If it is, add together the bitvecs, otherwise do nothing.
The bitvec will contain the matching number of bins for every sample in the index.

Then later, some optimisations:

The u64 keys can become u16, just taking the LSBs (similar to bbits).
The BitVec can be replaced with https://docs.rs/roaring/latest/roaring/bitmap/struct.RoaringBitmap.html.

johnlees · 2024-09-13T08:03:24Z

Also, ignore parallelisation and memory use for now – I will try and add these optimisations in future.

johnlees added the functionality adding a new feature label Jul 22, 2024

johnlees assigned johannahelene Jul 22, 2024

johnlees changed the title ~~Rereplication of genomes~~ Dereplication of genomes Aug 21, 2024

johnlees mentioned this issue Sep 13, 2024

Try reverse index for queries/linclust with bins #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dereplication of genomes #21

Dereplication of genomes #21

johnlees commented Jul 22, 2024 •

edited

Loading

johnlees commented Aug 21, 2024

johnlees commented Sep 13, 2024

johnlees commented Sep 13, 2024

Dereplication of genomes #21

Dereplication of genomes #21

Comments

johnlees commented Jul 22, 2024 • edited Loading

johnlees commented Aug 21, 2024

johnlees commented Sep 13, 2024

johnlees commented Sep 13, 2024

johnlees commented Jul 22, 2024 •

edited

Loading