-
Notifications
You must be signed in to change notification settings - Fork 1
/
fsd_esr.Rmd
303 lines (231 loc) · 10.1 KB
/
fsd_esr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
title: "Feature Space Diagrams in R"
subtitle: "Testing the methods on Epileptic Seizure Recognition Dataset"
author: "John Erickson"
date: "`r Sys.Date()`"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
This is a first effort to implement "feature space diagrams" in R, inspired by https://towardsdatascience.com/escape-the-correlation-matrix-into-feature-space-4d71c51f25e5
```{r,libraries, results=FALSE, message=FALSE, warning=FALSE}
#library(devtools)
#install_github("vqv/ggbiplot")
library(ggbiplot)
library(lessR)
library(Hmisc)
library(corrplot)
library(gplots)
library(igraph)
library(tidyverse)
library(ggrepel)
library(shiny)
library(plotly)
library(ggrepel)
```
## Workflow
Our workflow is generally as described in the original post:
* Generate the correlation matrix
* Take the absolute value of correlation matrix and subtract each value from 1. The result is a distance matrix.
* Use PCA to reduce our NxN matrix to Nx2.
* Plot each feature’s location using the two principal components.
* Use Feature Agglomeration to generate feature clusters.
* Color each feature by its cluster.
* Draw lines to represent relationships of at least r = 0.7 (or user's choosing)
## Data Load
```{r data_load}
# TODO: User uploads data
# Load data
#boston <- read.csv("boston.csv", header = TRUE, fileEncoding="latin1")
#WDBC <- read.csv("WDBC.csv", header = TRUE, fileEncoding="latin1")
#esr <- read.csv("EpilepticSeizureRecognition.csv", header = TRUE, fileEncoding="latin1")
#esr <- read.csv("https://raw.githubusercontent.com/MattJBritton/ExquisiteCorrpse/master/data/epilepsy.csv",header = TRUE, fileEncoding="latin1")
esr <- readRDS("esr.Rds")
# Ensure we have a matrix
#mydata <- na.omit(as.matrix(boston))
mydata <- na.omit(as.matrix(esr[,2:179]))
#mydata <- as.matrix(mtcars[,c(1:7,10,11)])
```
## Initial Correlation Matrix
```{r correlation}
# Simple correlation matrix calculation
# TODO: user chooses method
#mydata.cor <- cor(mydata, method = c("pearson"), use = "complete.obs")
#mydata.cor <- cor(mydata, method = c("kendall"), use = "complete.obs")
#mydata.cor <- cor(mydata, method = c("spearman"), use = "complete.obs")
mydata.cor <- cor(mydata, method = c("spearman"))
mydata.cor <- round(mydata.cor, 2)
# # Optional
# corrplot(mydata.cor)
#
# palette <- colorRampPalette(c("green", "white", "red")) (20)
# heatmap.2(x = mydata.cor, col = palette, symm = TRUE)
# Reorder correlation matrix based on: https://rdrr.io/cran/lessR/man/corReorder.html
mydata.cor.ro <- corReorder(mydata.cor)
```
## Distance Matrix Calculation
Correlation can often be used as a distance metric for hierarchical clustering; see [e.g. this discussion](https://stats.stackexchange.com/questions/165194/using-correlation-as-distance-metric-for-hierarchical-clustering).
We're using the `dist()` function to provide a number of different distance options (default is _euclidean_ distance).
```{r distance}
# # Simplistic distance matrix (Absolute value et.al.)
# mydata.cor.ro.1 <- abs(mydata.cor.ro) - 1
# Distance Matrix via: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/dist.html
mydata.dist <- dist(mydata.cor.ro, upper = TRUE, method = "manhattan")
mydata.dist.sym <- as.matrix(mydata.dist )
```
## Hiearchical/Agglomerative Clustering (NEW)
```{r}
# See eg:
# * https://metudatascience.github.io/datascience/clustering.html
# * https://uc-r.github.io/hc_clustering
# Distance calculated above
mydata.hclust <- hclust(mydata.dist)
plot(mydata.hclust)
# NOTE! These clusters are in numerical order, not feature order!
mydata.hclust.groups <- cutree(mydata.hclust, k=40) # Usually 3 is used for mtcars demos
rect.hclust(mydata.hclust, k=40, border="red")
```
## PCA on Distance Matrix
```{r PCA}
# PCA
#mydata.dist.sym.pca <- prcomp(mydata.dist.sym,scale=TRUE)
mydata.dist.sym.pca <- prcomp(mydata.dist.sym) # Un-scaled gives results closer to source article
# Plottable version:
mydata.dist.sym.pca.plot <- as.data.frame(mydata.dist.sym.pca$x[,1:2])
```
## Finding the Clusters
Write the cluster assignments and the feature names into the PCA data frame.
```{r clusters}
# Pull out our clusters (calculated)
mydata.dist.sym.pca.plot$cluster <- as.factor(mydata.hclust.groups)
mydata.dist.sym.pca.plot$name <- rownames(mydata.cor.ro)
```
## Finding the Graph!
This is the hard part...
```{r network}
# Determine connectivity!
# This selects the nodes (features) we want to connect
threshold <- 5 # was 5.5
selector <- ((abs(mydata.dist.sym) <= threshold ) * 1) # Filter out lines greater than threshold distance
for (i in 1:length(mydata.dist.sym.pca.plot$cluster)) {selector[,i] <- selector[,i] * as.numeric(mydata.dist.sym.pca.plot$cluster)}
diag(selector) <- 0
filtered <- abs(mydata.dist.sym)
filtered[ filtered <= threshold ] <- 0
# Get the function for re-scaling the line weights
# Line thicknesses will automatically range from 1-10
range <- c(1,10)
domain <- c(min(filtered[filtered > 0]),max(filtered))
line.lm <- lm(range~domain)
adjust <- function(x){
# stupid lm trick to scale
line.lm$coefficients[[2]] * x + line.lm$coefficients[[1]]
}
# Repeat selector matrix, this time for strength of relationships
# Select the nodes (features) we want to connect
thickness <- ((abs(mydata.dist.sym) <= threshold ) * 1) # Filter out lines greater than threshold distance
diag(thickness) <- 0 # remove diagonal
# Use distances as thicknesses for the arcs we're keeping
for (i in 1:length(mydata.dist.sym.pca.plot$cluster)) {
for (j in 1:length(thickness[,i])) {
if (thickness[j,i] != 0 ) {
# thickness[j,i] <- adjust(thickness[j,i] * abs(mydata.dist.sym[j,i]))
thickness[j,i] <- thickness[j,i] * abs(mydata.dist.sym[j,i])
} else { thickness[j,i] <- 0}
}
}
diag(thickness) <- 0
# Switching to networks
# Create igraph structure
network <- graph_from_adjacency_matrix(selector, weighted = TRUE)
network.t <- graph_from_adjacency_matrix(thickness, weighted = TRUE)
# Gets us our edges!
mydata.edges <- get.data.frame(network) %>% # this is where "weight" is introduced
dplyr::rename(Cluster = weight)
mydata.edges.t <- get.data.frame(network.t) %>% # this is where "weight" is introduced
dplyr::rename(thickness = weight) %>%
# mutate(thickness = as.integer(thickness))
mutate(thickness = thickness)
# replace `from` with X1 and Y1, and `to` with X2 and Y2
# These will be our line segments!
mydata.segments <- mydata.edges %>%
left_join(mydata.dist.sym.pca.plot, by=c("from"="name")) %>%
dplyr::select(-cluster) %>%
mutate(X1=PC1, Y1=PC2) %>%
dplyr::select(-PC1, -PC2)
mydata.segments <- mydata.segments %>%
left_join(mydata.dist.sym.pca.plot, by=c("to"="name")) %>%
dplyr::select(-cluster) %>%
mutate(X2=PC1, Y2=PC2) %>%
dplyr::select(-PC1, -PC2)
mydata.segments <- mydata.segments %>%
dplyr::rename(cluster = Cluster)
mydata.dist.sym.pca.plot$cluster <- as.factor(mydata.dist.sym.pca.plot$cluster)
# NEW: cluster used for coloring
mydata.segments$cluster <- as.factor(mydata.segments$cluster)
#mydata.segments$thickness <- as.integer(mydata.edges.t$thickness)
mydata.segments$thickness <- mydata.edges.t$thickness
```
## Building the Plot
```{r,plotting,fig.width=10,fig.height=10}
# Adding line segments to PCA plot
p <- ggplot(mydata.dist.sym.pca.plot, aes(x=PC1, y=PC2)) +
geom_point(aes(color=cluster, size=6), show.legend = FALSE) +
xlim(-60,60) +
ylim(-55,60) +
geom_curve(aes(x=X1, y=Y1, xend=X2, yend=Y2, color=cluster, size=thickness, alpha=0.4), curvature=0.2, data=mydata.segments) +
guides(alpha = FALSE) +
scale_size(range = c(0.1, 2), guide = guide_none()) +
geom_text_repel(aes(label=name),hjust=1, vjust=0, size=3, max.overlaps=30) +
# labs(title="Feature Space Diagram for the Boston Housing data set") +
# labs(title="Feature Space Diagram for the Wisconsin Breast Cancer data set") +
labs(title="Feature Space Diagram for the Epileptic Seizure Recognition Dataset") +
# labs(title="Feature Space Diagram for the classic `mtcars` data set") +
labs(subtitle="Using Spearman correlation") +
labs(caption = "See: https://github.com/TheRensselaerIDEA/FeatureSpaceDiagrams")
p
ggsave(filename="fsd_esr.png", plot = p)
```
## OPTIONAL: Biplot on the original dataset
Usually we review the feature space using a PCA biplot. Let's see how it compares!
```{r ,fig.width=10,fig.height=10, eval=FALSE, echo=FALSE}
# Calculate the PCA on the full dataset
mydata.pca <- prcomp(na.omit(mydata), center = TRUE, scale. = TRUE)
ggbiplot(mydata.pca) +
xlim(-2.5,2.5) +
ylim(-2.5,2.5)
# For completeness, create the scree plot...
var_explained.df <- data.frame(PC = paste0("PC",1:length(mydata.pca$sdev)),
var_explained=(mydata.pca$sdev)^2/sum((mydata.pca$sdev)^2))
# Force the x-axis order to be what we want
var_explained.df$PC <- factor(var_explained.df$PC, levels=unique(var_explained.df$PC))
# Create the plot
var_explained.df %>%
ggplot(aes(x=PC,y=var_explained, group=1))+
geom_point(size=4)+
geom_line()+
labs(title="Scree plot: PCA on scaled data")
```
## OPTIONAL: Interactive Feature Space Diagram!
Make the feature space diagram interactive using `ggplotly()`...
```{r ,fig.width=10,fig.height=10, eval=FALSE, echo=FALSE}
# We need to re-build the plot, since several geometries aren't yet implemented in ggrepl or ggplotly :(
p2 <- ggplot(mydata.dist.sym.pca.plot, aes(x=PC1, y=PC2)) +
geom_point(aes(color=cluster, size=6), show.legend = FALSE) +
xlim(-60,60) +
ylim(-55,60) +
geom_segment(aes(x=X1, y=Y1, xend=X2, yend=Y2, color=cluster, size=thickness, alpha=0.4), data=mydata.segments) +
guides(alpha = FALSE) +
scale_size(range = c(0.1, 2), guide = guide_none()) +
geom_text(aes(label=name),hjust=1, vjust=0, size=3) +
# labs(title="Feature Space Diagram for the Boston Housing data set") +
# labs(title="Feature Space Diagram for the Wisconsin Breast Cancer data set") +
labs(title="Feature Space Diagram for the Epileptic Seizure Recognition Dataset") +
# labs(title="Feature Space Diagram for the classic `mtcars` data set") +
labs(subtitle="Using Spearman correlation") +
labs(caption = "See: https://github.com/TheRensselaerIDEA/FeatureSpaceDiagrams")
ggplotly(p2)
```