intersections vs weighted unions or linear combinations for combining sets? #918
Unanswered
jlmelville
asked this question in
Q&A
Replies: 1 comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The motivation for this discussion: I continue to run into datasets which cause the spectral initialization to hang. Usually this seems to be due to poor conditioning of the input sparse matrix, which usually means that the graph is technically connected enough to evade disconnection detection but the edges between clusters in the data are sufficiently weak to give the sparse matrix routines trouble (this also bedevils libraries like Spectra so it's not a sklearn-specific issue).
One strategy that seems simple implementation-wise, and in keeping conceptually with UMAP:
log n
neighbors per item, which has a very high probability of creating a connected graph (see e.g. this lecture PDF for the basic graph theory). This is easy to implement thanks topynndescent.distances
and a basic numba loop.But it doesn't seem ideal to combine these with equal weights. Really, I only want the "global" random set to contribute just enough to the combined set to prevent problems, but not to contribute much else.
general_simplicial_set_intersection
provides aweight
parameter which seems to do what I want, but I have noticed that the usable range ofweight
is quite small: usually between0.3
-0.5
or thereabouts. Outside of that and results are minimally different from "all the normal fuzzy set" or "all the random fuzzy set", which is not very intuitive.At any rate, I have implemented this and it seems to do the trick on the datasets that have caused the spectral initialization grief without it. Based on a comparison with datasets that do fine with the spectral initialization or using PCA initialization, results are not badly perturbed. @lmcinnes if this isn't completely theoretically horrible, if you are interested in a PR for this then I am happy to have a go at it, but is intersection the right way to do the combination?
In contrast to intersection,
general_simplicial_set_union
doesn't provide any weighting. I am curious about this:fuzzy_simplicial_set
carries out a set union, but also provides a
set_op_mix_ratio
parameter that can interpolate between the union and intersection. Could something like this work withgeneral_simplicial_set_union
? It's obvious how to add this, but it doesn't really achieve anything in my experiments. I think this is because ofreset_local_connectivity
which also does a fuzzy set union, so the weighting would also need to be applied there too?Finally, a simple linear combination, e.g.
weight * knn_set + (1-weight)*random_set
(followed byreset_local_connectivity
) withweight=0.9
seems to also be effective. I assume, though, that this is conceptually unwise because it doesn't correspond to any fuzzy set operation?The code to generate an input that can be passed to umap's
init
parameter is pretty short so I am happy to put it here if it is of interest to anyone.Beta Was this translation helpful? Give feedback.
All reactions