Skip to content

Latest commit

 

History

History
77 lines (53 loc) · 1.7 KB

section_clustering.md

File metadata and controls

77 lines (53 loc) · 1.7 KB

Clustering

Notes:

What?

  • Classes are not known
  • Assign similar documents to same class / cluster

Notes:

Why?

  • Show similar documents
  • Cluster search results
  • Exploratory browsing
  • News aggregation

Notes: Why?

How?

  • Unsupervised learning
  • No labeled documents available

Notes:

Types of clustering

  • Hard: Document belongs to exactly one cluster
  • Soft: Document can belong to multiple clusters with varying degrees
  • Flat: One level of clusters
  • Hierarchical: Sub-clusters

Notes:

K-Means

  1. Set $K$ random centroids
  2. Assign each document to nearest centroid
  3. Move centroids to minimize distance to documents
  4. Terminate or goto 2

Notes:

K-Means

© 2008 Cambridge University Press

Notes:

Termination

  • Centroids do not move
  • Assignment do not change
  • Sum of distances does not decrease
  • Sum of distances is below threshold
  • After $n$ iterations

Notes:

K-Means disadvantages

  • Can get stuck on local minimum
  • Can build singleton clusters for outliers
  • Can build empty clusters

Notes: