- Classes are not known
- Assign similar documents to same class / cluster
- Show similar documents
- Cluster search results
- Exploratory browsing
- News aggregation
- Unsupervised learning
- No labeled documents available
- Hard: Document belongs to exactly one cluster
- Soft: Document can belong to multiple clusters with varying degrees
- Flat: One level of clusters
- Hierarchical: Sub-clusters
- Set
$K$ random centroids - Assign each document to nearest centroid
- Move centroids to minimize distance to documents
- Terminate or goto 2
© 2008 Cambridge University Press
- Centroids do not move
- Assignment do not change
- Sum of distances does not decrease
- Sum of distances is below threshold
- After
$n$ iterations
- Can get stuck on local minimum
- Can build singleton clusters for outliers
- Can build empty clusters
Notes: