Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dereplication of genomes #21

Open
johnlees opened this issue Jul 22, 2024 · 3 comments
Open

Dereplication of genomes #21

johnlees opened this issue Jul 22, 2024 · 3 comments
Assignees
Labels
functionality adding a new feature

Comments

@johnlees
Copy link
Member

johnlees commented Jul 22, 2024

Notes:
Find group of queries which share k-mer in a bin
Calculate dists of these to centre (longest)
Cluster:
'Briefly, the file with the validated directed edges from center sequences to member sequences is read in and all reverse edges are added. The list of input sequences is sorted by decreasing length. While the list is not yet empty, the top sequence is removed from the list, together with all sequences still in the list that share an edge with it. These sequences form a new cluster with the top sequence as its representative.'

Use a reverse index
First step: sketch between those which share a bin
Can give assembly quality as input and presort, top will always be best (to find representative to align against)

@johnlees johnlees added the functionality adding a new feature label Jul 22, 2024
@johnlees johnlees changed the title Rereplication of genomes Dereplication of genomes Aug 21, 2024
@johnlees
Copy link
Member Author

Rather than converting from current .skd, probably easier to have a dedicated reverse index constructor function, then store an enum with the sketch type in the metadata.

@johnlees
Copy link
Member Author

See also #11 for some earlier thoughts on this.

  • First, add a new command inverted.
  • We need to make Sketch::new() into two functions, taking the first part out which creates the signs vec, so this can be called separately in the new function.
  • Then, for the inverted index, we want to go from Vec<Vec<u64>> which is a list signs across the samples (each signs is Vec<u64> for the sketch across the bins).
  • We then want to make a Vec<Hashmap<u64, BitVec>> from these. The outer Vec is across the bins, in the same order as the Vec<u64>. For each bin, the Hashmap has the bin value as the u64 key, and a list of samples with that bin value as the value stored as a BitVec. This BitVec will be a list of zeros with the same length as the number of samples, but then with 1 bits inserted at the indexes of the samples with that bin value. See https://docs.rs/bitvec/latest/bitvec/vec/struct.BitVec.html.
  • This then needs to be saved instead of the .skd. For now, I think just use serde around a new struct, and save it as an .ski file.

First use case would be to add a distance function against a new query sample:

  • Sketch the new sample into signs
  • Iterate over the inverted index Vec and the new sketch together.
  • For each bin in this iteration, check if the bin is in the Hashmap. If it is, add together the bitvecs, otherwise do nothing.
  • The bitvec will contain the matching number of bins for every sample in the index.

Then later, some optimisations:

@johnlees
Copy link
Member Author

Also, ignore parallelisation and memory use for now – I will try and add these optimisations in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functionality adding a new feature
Projects
None yet
Development

No branches or pull requests

2 participants