Skip to content

Commit

Permalink
Update clustering.html
Browse files Browse the repository at this point in the history
  • Loading branch information
sasibonu authored Mar 2, 2024
1 parent cc12a0c commit af0a1a2
Showing 1 changed file with 3 additions and 19 deletions.
22 changes: 3 additions & 19 deletions clustering.html
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,6 @@ <h2 class="project-title">Clustering</h2>
<p>Since clustering uses distance metrics to see how close data points are, let's look into distance metrics. 4 types of distances can be used to determine clusters, Euclidean, Manhattan, and Cosine.</p>
<div class="project-slider col-md-12">
<img src="images/projects/euclidean.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->
<p> </p>

Expand All @@ -119,17 +117,13 @@ <h3>Euclidean</h3>

<div class="project-slider col-md-12">
<img src="images/projects/manhattan.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<h3>Manhattan</h3>
<p> The total of the absolute differences between the two data points' respective feature values is used to compute the distance between them using this distance metric. When characteristics have different units of measurement or the data is categorical, it is frequently employed. Since the distance between two points is determined only by the variations in their coordinates, it is less susceptible to outliers than the Euclidean distance metric.</p>

<div class="project-slider col-md-12">
<img src="images/projects/cosine.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<h3>Cosine</h3>
Expand All @@ -146,8 +140,6 @@ <h3>Cosine</h3>

<div class="project-slider col-md-12">
<img src="images/projects/cluster_2.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<p>The data looks fairly categorized. To essentially get an idea of how many clusters to begin with, or how many clusters would be a good fit. A method known as elbow method is used.
Expand All @@ -156,8 +148,6 @@ <h3>Cosine</h3>

<div class="project-slider col-md-12">
<img src="images/projects/elbow.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<p>Elbow method suggests that 3 or 4 should be a good value of k to begin with, as the curve starts to flatten out after 4. This signifies that there's no real change after you increase the number of clusters from 4.
Expand All @@ -166,28 +156,22 @@ <h3>Cosine</h3>

<div class="project-slider col-md-12">
<img src="images/projects/cluster_4.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<p> Another method that can be used to see which number of clusters can give a good fit is silhouette method.
It's an additional guideline for figuring out how many clusters, on average, a dataset should have in order to apply clustering algorithms like k-means. It gauges an object's cohesion—how similar it is to its own cluster—as opposed to separation—how similar it is to other clusters.
A high silhouette score means that the object is well matched to its own cluster and poorly matched to nearby clusters. The silhouette score goes from -1 to 1.
<p> Another method that can be used to see which number of clusters can give a good fit is the silhouette method.
It's an additional guideline for figuring out how many clusters, on average, a dataset should have to apply clustering algorithms like k-means. It gauges an object's cohesion—how similar it is to its own cluster—as opposed to separation—how similar it is to other clusters.
A high silhouette score means that the object is well-matched to its own cluster and poorly matched to nearby clusters. The silhouette score goes from -1 to 1.
</p>

<div class="project-slider col-md-12">
<img src="images/projects/silhouette.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->


<p>It can be seen from the above plot that 3 clusters has the highest score. Now, let's use k=3 to run k-means clustering.</p>

<div class="project-slider col-md-12">
<img src="images/projects/cluster_3.png" alt="Slide 1">
<img src="images/projects/big_project_2.jpg" alt="Slide 2">
<img src="images/projects/big_project_3.jpg" alt="Slide 1">
</div> <!-- /.project-slider -->

<p> It is a pretty good fit as the borders are pretty clear and it's visible how well the data is separated.</p>
Expand Down

0 comments on commit af0a1a2

Please sign in to comment.