-
Notifications
You must be signed in to change notification settings - Fork 11
/
22-GPA_blood_top_LVs.Rmd
358 lines (285 loc) · 10.8 KB
/
22-GPA_blood_top_LVs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "Top differentially expressed LVs in GPA blood"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
Explore the top differentially expressed LVs in the GPA blood dataset (PBMCs).
Are they pathways we could have captured with a PLIER model fit to this dataset?
## Functions and directory setup
```{r}
`%>%` <- dplyr::`%>%`
source(file.path("util", "plier_util.R"))
```
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "22")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "22")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Load DLVE results
```{r}
limma.file <- file.path("results", "19",
"GPA_blood_recount2_model_LV_limma_results.tsv")
limma.df <- readr::read_tsv(limma.file)
```
Let's take a look at the top differentially expressed latent variables.
We'll sort by FDR.
```{r}
limma.df %>%
dplyr::filter(adj.P.Val < 0.05) %>%
dplyr::arrange(adj.P.Val)
```
Let's explore `LV 599` a little further.
```{r}
recount.b.file <- file.path("results", "19",
"GPA_blood_recount2_model_B_long_sample_info.tsv")
recount.b.df <- readr::read_tsv(recount.b.file) %>%
dplyr::mutate(GPA_signature =
dplyr::case_when(
GPA_signature == "GPApos" ~ "GPA-positive",
GPA_signature == "GPAneg" ~ "GPA-negative",
TRUE ~ "Control"
))
```
```{r}
recount.b.df %>%
dplyr::filter(LV == "599,DMAP_ERY2") %>%
ggplot2::ggplot(ggplot2::aes(x = GPA_signature, y = Value)) +
ggplot2::geom_boxplot() +
ggplot2::geom_jitter(position = ggplot2::position_jitter(0.2)) +
ggplot2::theme_bw() +
ggplot2::labs(x = "GPA signature", y = "LV 599",
title = "Cheadle, et al. top LV") +
ggplot2::theme(legend.position = "none",
text = ggplot2::element_text(size = 15),
plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
```
```{r}
plot.file <- file.path(plot.dir, "recount2_LV599_boxplot.pdf")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
```
## Load recount2 PLIER model
```{r}
recount.plier <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
```
### What are the pathways associated with `LV 599`?
```{r}
recount.plier$summary %>%
dplyr::filter(`LV index` == 599)
```
### What genes contribute to this LV (e.g., have the highest loadings)?
Let's look at the top 50.
```{r}
head(sort(recount.plier$Z[, 599], decreasing = TRUE), 50)
```
Some of these are major or minor autoantigens in ANCA-associated vasculitis.
This is consistent with the original publication
([Cheadle, et al. _A&R._ 2010.](https://dx.doi.org/10.1002/art.27398)):
the authors found that "neutrophil granule constituent genes" were highly
expressed in the PBMC fraction from patients with GPA that could not be
attributed to contamination with mature granulocytes.
#### Plot highlighting autoantigens
We'll use the list of autoantigens in Cheadle, et al.
```{r}
autoantigens <- c("MPO", "ELANE", "BPI", "CTSG", "LCN2", "AZU1", "PRTN3")
```
```{r}
# get into appropriate data.frame for bar plot
top.z.df <- as.data.frame(sort(recount.plier$Z[, 599],
decreasing = TRUE)[1:50])
top.z.df <- tibble::rownames_to_column(top.z.df, var = "Gene")
colnames(top.z.df)[2] <- "Z"
# add in autoantigen information
top.z.df <- top.z.df %>%
dplyr::mutate(Autoantigen =
dplyr::case_when(
Gene %in% autoantigens ~ "Yes",
TRUE ~ "No"
))
# reorder for plotting
top.z.df$Gene <- factor(top.z.df$Gene,
levels = top.z.df$Gene[50:1])
```
```{r}
p <- ggplot2::ggplot(top.z.df,
ggplot2::aes(x = Gene, y = Z, fill = Autoantigen)) +
ggplot2::geom_bar(stat = "identity") +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("black", "red")) +
ggplot2::coord_flip() +
ggplot2::ggtitle("recount2 LV 599 Loadings") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold"))
p + ggplot2::theme(axis.text = ggplot2::element_text(size = 6))
```
```{r}
plot.file <- file.path(plot.dir, "recount2_LV599_top_Z_bar.pdf")
ggplot2::ggsave(plot.file,
plot = p +
ggplot2::theme(text = ggplot2::element_text(size = 15)),
width = 6, height = 8)
```
## GPA Blood-specific PLIER model
Would we have detected this "autoantigen signature" using a PLIER model
specifically trained on this GPA blood data?
```{r}
# remove objects we won't need for this next set of analyses
rm(limma.df, recount.b.df, limma.file, plot.file, recount.b.file,
top.z.df)
```
### Read in expression data
```{r}
exprs.file <- file.path("data", "expression_data",
"GSE18885_annotated_mean.pcl")
agg.ma.df <- readr::read_tsv(exprs.file)
# get expression matrix
exprs.mat <- as.matrix(dplyr::select(agg.ma.df, -Gene))
rownames(exprs.mat) <- agg.ma.df$Gene
```
### Train PLIER model
```{r}
gpa.plier <- PLIERNewData(exprs.mat)
```
```{r}
plier.file <- file.path(results.dir, "GPA_blood_PLIER_model.RDS")
saveRDS(gpa.plier, plier.file)
```
### Explore `U` matrix
Taking a look at the `U` (prior information coefficient) matrix will tell us
what kind of pathways were identified with this model.
```{r}
PLIER::plotU(gpa.plier, fontsize_row = 5, fontsize_col = 7)
```
```{r}
pdf(file.path(plot.dir, "GPA_blood_U_plot.pdf"))
PLIER::plotU(gpa.plier, fontsize_row = 6, fontsize_col = 7)
dev.off()
```
`LV13` and `LV20` look like the most likely candidates, but we'll look across
all 30 LVs.
### Does an "autoantigen LV" exist in the GPA blood PLIER model?
```{r}
GetTopNGenes <- function(z.vector, num.genes = 50){
names(head(sort(z.vector, decreasing = TRUE), num.genes))
}
top.genes <- apply(gpa.plier$Z, 2, GetTopNGenes)
ag.logical <- apply(top.genes, 2, function(x) any(x %in% autoantigens))
which(ag.logical)
```
```{r}
head(sort(gpa.plier$Z[, 8], decreasing = TRUE), 50)
```
```{r}
# get into appropriate data.frame for bar plot
top.z.df <- as.data.frame(sort(gpa.plier$Z[, 8],
decreasing = TRUE)[1:50])
top.z.df <- tibble::rownames_to_column(top.z.df, var = "Gene")
colnames(top.z.df)[2] <- "Z"
# add in autoantigen information
top.z.df <- top.z.df %>%
dplyr::mutate(Autoantigen =
dplyr::case_when(
Gene %in% autoantigens ~ "Yes",
TRUE ~ "No"
))
# reorder for plotting
top.z.df$Gene <- factor(top.z.df$Gene,
levels = top.z.df$Gene[50:1])
```
```{r}
p <- ggplot2::ggplot(top.z.df,
ggplot2::aes(x = Gene, y = Z, fill = Autoantigen)) +
ggplot2::geom_bar(stat = "identity") +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("black", "red")) +
ggplot2::coord_flip() +
ggplot2::ggtitle("GPA Blood LV 8 Loadings") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
p + ggplot2::theme(axis.text = ggplot2::element_text(size = 6))
```
```{r}
plot.file <- file.path(plot.dir, "GPA_blood_LV8_top_Z_bar.pdf")
ggplot2::ggsave(plot.file,
plot = p +
ggplot2::theme(text = ggplot2::element_text(size = 15)),
width = 6, height = 8)
```
`GPA blood LV 8` may be _most_ similar to another recount2 LV like
`recount2 LV 68`, which is also differentially expressed in this dataset.
Let's take a quick look at the top 50 genes for `recount2 LV 68`.
```{r}
head(sort(recount.plier$Z[, 68], decreasing = TRUE), 50)
```
_By eye_, this looks more similar to `GPA blood LV 8`, but let's do a more
formal analysis.
### Compare to recount2 LVs
```{r}
# get Z matrices from both models
recount.z <- as.data.frame(recount.plier$Z)
gpa.z <- as.data.frame(gpa.plier$Z)
# we'll need to add the gene identifiers (symbols in this case) to a column
# rather than as rownames -- this will facilitate joining
recount.z <- tibble::rownames_to_column(recount.z, var = "Gene")
colnames(recount.z)[2:ncol(recount.z)] <- paste0("recountLV",
1:(ncol(recount.z) - 1))
gpa.z <- tibble::rownames_to_column(gpa.z, var = "Gene")
colnames(gpa.z)[2:ncol(gpa.z)] <- paste0("gpaLV", 1:(ncol(gpa.z) - 1))
# join -- only genes present in both models
z.df <- dplyr::inner_join(recount.z, gpa.z, by = "Gene")
# need matrix to calculate correlation
z.matrix <- as.matrix(z.df[, 2:ncol(z.df)])
rownames(z.matrix) <- z.df$Gene
```
```{r}
# calculate pearson correlation between LVs -- can we map between models using
# this distance metric?
cor.z.mat <- cor(z.matrix, method = "pearson")
# set diagonal to 0 to help find max correlation between LVs
diag(cor.z.mat) <- 0
# indices for each model
gpa.indx <- grep("gpa", rownames(cor.z.mat))
recount.indx <- grep("recount", rownames(cor.z.mat))
# pertinent indices
impt.cor.z.mat <- cor.z.mat[recount.indx, gpa.indx]
# for each GPA blood model LV, want the highest correlated LV from recount
mapping.df <- reshape2::melt(impt.cor.z.mat,
varnames = c("recount_LV", "GPA_blood_LV"),
value.name = "pearson_Z") %>%
dplyr::group_by(GPA_blood_LV) %>%
dplyr::top_n(1, pearson_Z)
mapping.df
```
It looks like `GPA blood LV 8` does in fact map to `recount2 LV 68` as
determined by taking the Pearson correlation between loadings.
#### Heatmap of mapping
We'll select the top mapping LVs from the recount2 model as well as
`recount2 LV 599` from `impt.cor.z.mat` for plotting as a heatmap.
```{r}
plot.index <- which(rownames(impt.cor.z.mat) %in%
c(as.character(mapping.df$recount_LV), "recountLV599"))
heatmap.mat <- impt.cor.z.mat[plot.index, ]
pheatmap::pheatmap(heatmap.mat, main = "Z matrix mapping")
```
```{r}
plot.file <- file.path(plot.dir, "selected_cor_z_mat_heatmap.pdf")
pdf(plot.file)
pheatmap::pheatmap(heatmap.mat, main = "Z matrix mapping")
dev.off()
```
## Summary
Despite the lack of AAV expression data in the training compendium, the
recount2 PLIER model learns a latent variable highly relevant to AAV.
Specifically, `recount2 LV 599` captures a granulocyte progenitor signature.
The top genes (as determined by the loadings, `Z`) are major (_PRTN3_, _MPO_)
and minor (e.g., _AZU1_, _BPI_) autoantigens in ANCA-associated vasculitis.
We find an LV in the model specifically trained on the GPA blood dataset that
also captures the activity autoantigens (`GPA blood LV 8`).
However, the major autoantigens are not as highly ranked and this maps to
another LV in the recount2 model, `recount2 LV 68`.
This suggests that the recount2 model (e.g., the multiPLIER approach) can
identify additional, highly relevant LVs.