-
Notifications
You must be signed in to change notification settings - Fork 11
/
19-GPA_blood_differential_expression.Rmd
191 lines (154 loc) · 5.54 KB
/
19-GPA_blood_differential_expression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: "GPA blood differential expression"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
Here, we'll perform differential expression (of LV) analyses for the PBMC data
from Cheadle C, Berger AE, Andrade F, et al.
[Transcription of PR3 and Related Myelopoiesis Genes in Peripheral Blood Mononuclear Cells in Active Wegener’s Granulomatosis](https://dx.doi.org/10.1002/art.27398).
_Arthritis & Rheumatism_, 2010. doi: 10.1002/art.27398.
We'll be comparing the healthy controls, the patients with a GPA signature
(GPA-positive) and the patients without a GPA signature (GPA-negative).
We'll first need to project this data into the recount2 PLIER model latent space.
## Functions and directory set up
```{r}
`%>%` <- dplyr::`%>%`
source(file.path("util", "test_LV_differences.R"))
source(file.path("util", "plier_util.R"))
```
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "19")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "19")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Read in data
The GPA blood file is in the form of a GEO series matrix, which means the
sample metadata is in what are essentially comment fields at the beginning of
the file.
Thus, some ugliness/wrangling will be necessary.
### Expression data
```{r}
series.mat.file <- file.path("data", "expression_data",
"GSE18885_series_matrix.txt")
# expression matrix
gpa.ma.data <-
readr::read_delim(series.mat.file,
delim = "\t",
comment = "!",
col_names = TRUE,
skip = 1)
# this information about GPL6140 was downloaded on 15 Sept 2017, and reportedly
# last updated on 18 Jan 2013
gpl.from.geo <- readr::read_tsv(file.path("data", "expression_data",
"GPL6104-11576.txt"), comment = "#")
# use the gene symbols from the GPL file -- PLIER uses gene symbols
gpa.ma.annot <- gpl.from.geo %>%
dplyr::select(c(ID, ILMN_Gene)) %>%
dplyr::left_join(y = gpa.ma.data,
by = c("ID" = "ID_REF"))
colnames(gpa.ma.annot)[2] <- "Gene"
```
```{r}
# are there any duplicates in the gene symbol column?
sum(duplicated(gpa.ma.annot$Gene))
```
```{r}
# aggregate & write to file
agg.ma.df <- PrepExpressionDF(dplyr::select(gpa.ma.annot, -ID))
readr::write_tsv(agg.ma.df,
path = file.path("data", "expression_data",
"GSE18885_annotated_mean.pcl"))
# get expression matrix
exprs.mat <- as.matrix(dplyr::select(agg.ma.df, -Gene))
rownames(exprs.mat) <- agg.ma.df$Gene
```
### Metadata
As mentioned above, metadata is also extracted from the series matrix file.
```{r}
# get more information from the series matrix file
# line 41 has the sample names
conn <- file(series.mat.file)
open(conn)
smpl.name <- read.table(conn, skip = 41, nrow = 1)
close(conn)
# line 51 contains the WG signature status -- I'll call this GPA signature
# GPA-positive, GPA-negative to be in line with the current disease name
conn <- file(series.mat.file)
open(conn)
gpa.sig.status <- read.table(conn, skip = 51, nrow = 1)
close(conn)
# this is the GEO sample accession information
conn <- file(series.mat.file)
open(conn)
gsm.info <- read.table(conn, skip = 77, nrow = 1)
close(conn)
# get those lines into data.frame format
smpl.info.df <- as.data.frame(t(dplyr::bind_rows(gsm.info,
smpl.name,
gpa.sig.status))[-1, ])
colnames(smpl.info.df) <- c("Sample", "Name", "GPA_signature")
# remove 1 row data.frames read in from the series matrix
rm(conn, gsm.info, smpl.name, gpa.sig.status)
```
```{r}
# extract fraction ("cell type(s)") from sample names
smpl.info.df <- smpl.info.df %>%
dplyr::mutate(Cell_type = sub(".*\\ ", "", Name))
# recode the GPA signature bit
smpl.info.df <-
smpl.info.df %>%
dplyr::mutate(GPA_signature = dplyr::case_when(
grepl("control", GPA_signature) ~ "Control",
grepl("WG Sig -", GPA_signature) ~ "GPAneg",
grepl("WG Sig +", GPA_signature) ~ "GPApos"
))
# write to file
readr::write_tsv(smpl.info.df,
path = file.path("data", "sample_info",
"GSE18885_sample_info.tsv"))
```
```{r}
# remove data.frames that won't be needed
rm(agg.ma.df, gpa.ma.annot, gpa.ma.data, gpl.from.geo)
```
## Apply recount2 model
```{r}
# load model itself
recount.plier <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
# project
recount.b <- GetNewDataB(exprs.mat = as.matrix(exprs.mat),
plier.model = recount.plier)
```
```{r}
# save B matrix to file
b.file <- file.path(results.dir, "GPA_blood_recount2_B.RDS")
saveRDS(recount.b, file = b.file)
```
## Differential expression
### PBMC only
We're going to restrict our analyses to the peripheral blood mononuclear cell
(PBMC) fraction.
```{r}
# sample information
pbmc.df <- smpl.info.df %>%
dplyr::filter(Cell_type == "PBMC")
```
```{r}
# b matrix (latent variables)
pbmc.b <- recount.b[, pbmc.df$Sample]
```
### Differential expression analysis itself
```{r}
LVTestWrapper(b.matrix = pbmc.b,
sample.info.df = pbmc.df,
phenotype.col = "GPA_signature",
file.lead = "GPA_blood_recount2_model",
plot.dir = plot.dir,
results.dir = results.dir)
```