Initial version of clustering that uses the conv filter activations #44
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Builds on feature added in https://github.com/kundajelab/tfmodisco/releases/tag/v0.5.2.0. Data tracks supplied via the
other_tracks
argument can now be used for calculating the affinity matrix. To use the feature, supply the names of the relevant tracks to thetracknames_to_use_for_embedding
argument. This will use those tracks to derive the seqlet embeddings (as opposed to using the gapped kmer embedding, which is what is done in the other workflow). The embedding for each seqlet is created by summing the value of each channel in the concatenated data tracks across the length of the seqlet. The cosine similarity between embeddings is used to create the affinity matrix, which (as with the other workflow) is density-adapted and supplied to multiple rounds of Louvain. There is no separate "fine-grained" affinity matrix calculation, because the fine-grained affinity matrix calculation was specifically used because the affinity matrix derived from the gapped-kmer embedding was considered too coarse-grained. Downstream post-processing remains the same - in particular, input-level importance scores are still used to align seqlets within a cluster and to split/merge clusters in the post-processing phase.A notebook demonstrating the feature is at https://github.com/kundajelab/tfmodisco/blob/fef7d28480ee88236dee0cd3d3660b07f566e0f7/test/nb_test/talgata/TF%20MoDISco%20TAL%20GATA%20with%20Activations.ipynb
@tchiruvolu can you try applying this to the APA dataset?