-
Notifications
You must be signed in to change notification settings - Fork 2
/
Lab5-Clustering.Rmd
executable file
·252 lines (182 loc) · 6.62 KB
/
Lab5-Clustering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Lab 5: Clustering"
subtitle: "High Dimensional Data Analysis practicals"
author: "Milan Malfait"
date: "24 Feb 2022 <br/> (Last updated: 2022-02-22)"
---
```{r setup, include=FALSE, cache=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.show = "hold"
)
options(width = 80)
```
### [Change log](https://github.com/statOmics/HDDA/commits/master/Lab6-Clustering.Rmd) {-}
***
```{r libraries, message=FALSE, warning=FALSE}
## Install necessary packages with:
# install.packages(c("mclust", "gclus", "GGally", "tidyverse"))
# if (!requireNamespace("remotes", quietly = TRUE)) {
# install.packages("remotes")
# }
# remotes::install_github("vqv/ggbiplot")
library(mclust)
library(gclus) # contains the 'wine' data
library(ggbiplot)
library(GGally)
library(tidyverse)
theme_set(theme_minimal())
```
# The wine data
In this lab session, we will explore the [`wine`][wine] data, following the
example analysis from [Scrucca *et al.* (2016)][scrucca2016].
This dataset provides 13 measurements obtained from a chemical analysis of 178
wines grown in the same region in Italy but derived from three different
cultivars (Barolo, Grignolino, Barbera). The original cultivar labels are
provided in the dataset.
We will apply different clustering algorithms and validate them by comparing how
well the clusters capture the original classes.
```{r}
data("wine", package = "gclus")
class <- factor(wine$Class, levels = 1:3, labels = c("Barolo", "Grignolino", "Barbera"))
table(class)
X <- as.matrix(wine[, -1])
summary(X)
```
# Hierarchical clustering
### Tasks {-}
#### 1. Perform hierarhical clustering of the wine data, using a Euclidean distance matrix and the complete-linkage algorithm (see `?hclust`). Plot the clustering *dendrogram*. {-}
<details><summary>Solution</summary>
```{r}
## Calculate distance matrix and perform hierarchical clustering
wine_dist <- dist(X, method = "euclidean")
hc <- hclust(wine_dist, method = "complete")
plot(hc, labels = FALSE)
```
</details>
#### 2. Select an appropriate number of clusters from the hierarchical clustering (see `?cutree`). Visualize the clusters on a PCA biplot and compare with the original labels. {-}
```{r}
hc_clusters <- cutree(hc, k = 3)
table(class, hc_clusters)
```
<details><summary>Solution</summary>
```{r, fig.asp=1}
wine_pca <- prcomp(X, scale. = TRUE)
ggbiplot(wine_pca, groups = class) +
scale_color_brewer(palette = "Set2") +
labs(color = "Original labels") +
theme(aspect.ratio = 0.8, legend.position = "top")
ggbiplot(wine_pca, groups = factor(hc_clusters)) +
scale_color_brewer(palette = "Set2") +
labs(color = "HC clusters") +
theme(aspect.ratio = 0.8, legend.position = "top")
```
</details>
#### Bonus: can you improve the results by using different distance metrics or linkages? {-}
# Model-based clustering
### Tasks {-}
#### 1. Perform model-based clustering on the `wine` data (use [`mclust::Mclust()`][mclust]). Plot the BIC values and interpret the results. Compare the identified clusters with the original (true) labels. {-}
<details><summary>Solution</summary>
```{r}
mod <- Mclust(X)
summary(mod)
summary(mod$BIC)
```
```{r}
table(class, mod$classification)
## Annotate clusters
mc_clusters <- factor(mod$classification)
```
```{r}
plot(mod, what = "BIC", ylim = range(mod$BIC[, -(1:2)], na.rm = TRUE),
legendArgs = list(x = "bottomleft")
)
plot(mod, what = "classification")
```
There is a clear indication of a three-component mixture with covariances having
different shapes and volumes but the same orientation (VVE). See
`?mclustModelNames` for a description of the different `mclust` models.
</details>
#### 2. Visualize the clusters found by `Mclust()` on the PCA biplot. Compare with the original labels. {-}
<details><summary>Solution</summary>
```{r, fig.asp=1}
ggbiplot(wine_pca, groups = class) +
scale_color_brewer(palette = "Set2") +
labs(color = "Original labels") +
theme(aspect.ratio = 0.8, legend.position = "top")
## PCA plot annotated with clusters
ggbiplot(wine_pca, groups = mc_clusters) +
labs(color = "mclust clusters") +
scale_color_brewer(palette = "Set2") +
theme(aspect.ratio = 0.8, legend.position = "top")
```
</details>
#### 3. Perform a dimensionality reduction of the wine data using the PCA. Select an appropriate number of PC's. Redo the clustering on this reduced dimension representation and make the same figures as before. How do the results differ? {-}
<details><summary>Solution</summary>
```{r}
## Calculate total variance by summing the PC variances (sdev's squared)
tot_var <- sum(wine_pca$sdev^2)
## Create data.frame of the proportion of variance explained by each PC
wine_prop_var <- data.frame(
PC = 1:ncol(wine_pca$x),
var = wine_pca$sdev^2
) %>%
## Using `mutate` to calculate prop. var and cum. prop. var
mutate(
prop_var = var / tot_var,
cum_prop_var = cumsum(var / tot_var)
)
wine_prop_var
## Plot the proportion of variance explained by each PC
p1 <- ggplot(wine_prop_var, aes(PC, prop_var)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 6.5, col = "firebrick") +
scale_x_continuous(breaks = 1:ncol(wine_pca$x)) +
labs(y = "Proportion of variance")
## Plot the cumulative proportion of variance explained by each PC
p2 <- ggplot(wine_prop_var, aes(PC, cum_prop_var)) +
geom_point() +
geom_line() +
geom_vline(xintercept = 6.5, col = "firebrick") +
scale_x_continuous(breaks = 1:ncol(wine_pca$x)) +
labs(y = "Cumulative proportion of variance")
gridExtra::grid.arrange(p1, p2, ncol = 2)
```
Selecting first 6 PC's, keeping 85% of the variance.
```{r}
k <- 6
pca_X <- wine_pca$x[, 1:k]
head(pca_X)
```
```{r}
mod2 <- Mclust(pca_X)
summary(mod2)
summary(mod2$BIC)
```
```{r}
table(class, mod2$classification)
## Annotate clusters
mc_pca_clusters <- factor(mod2$classification)
```
```{r, fig.asp=0.8}
plot(mod2, what = "BIC", ylim = range(mod2$BIC[, -(1:2)], na.rm = TRUE),
legendArgs = list(x = "bottomleft")
)
```
```{r, fig.asp=1}
df <- as.data.frame(pca_X)
df$clusters <- mc_pca_clusters
## Using ggscatmat() from GGally package to plot all pairwise PCs
ggscatmat(df, columns = 1:k, color = "clusters") +
theme(legend.position = "bottom", aspect.ratio = 0.6) +
scale_color_brewer(palette = "Set2", name = "mclust-PCA clusters")
```
*Note: you can ignore the upper-right panels of this figure. These give the correlations between each pair of variables (PC's here) for each group, but are not relevant here.*
</details>
```{r, child="_session-info.Rmd"}
```
[wine]: https://rdrr.io/cran/gclus/man/wine.html
[scrucca2016]: https://svn.r-project.org/Rjournal/html/archive/2016/RJ-2016-021/RJ-2016-021.pdf
[mclust]: https://rdrr.io/cran/mclust/man/Mclust.html