Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Research / Analysis] Automatic clustering and subclustering of tweets in the embedding space #3

Open
AbrahamSanders opened this issue Apr 11, 2020 · 0 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed research needs investigation or trial of one or more approaches

Comments

@AbrahamSanders
Copy link
Collaborator

Right now we use the "elbow method" to manually choose the optimal number of top-level k-means clusters, and then we use a fixed number of sub-clusters running k-means again on each top-level cluster.

Open question
We want to evaluate alternative techniques to the "elbow" method that can help automate the selection of optimal clusters both at the top-level and sub-level. Some good starting points for investigation are
https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6
and
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

Alternatively:
Perhaps we can use hierarchical clustering and then slice up the dendrogram to get our top-level and sub-level clusters. It is also important to evaluate the feasibility of automating optimal cluster number selection when slicing the dendrogram. A good starting point (in R) is here: https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/

@AbrahamSanders AbrahamSanders added research needs investigation or trial of one or more approaches good first issue Good for newcomers help wanted Extra attention is needed labels Apr 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed research needs investigation or trial of one or more approaches
Projects
None yet
Development

No branches or pull requests

1 participant