Microarray Expression Data Analysis: Effect of Methylglyoxal
- - - - - - - - -Introduction
---Methylglyoxal (MG), a highly reactive dicarbonyl compound, is inevitably formed as a by-product of glycolysis. Methylglyoxal is a major cell-permeant precursor of advanced glycation end-products (AGEs), which are associated with several pathologies including diabetes, aging and neurodegenerative diseases. [Source]
-
--Excessive dicarbonyl generation leads to AGE production, activates inflammatory processes, increases oxidative stress and impairs glucose tolerance, all of which cause metabolic and vascular changes in different tissues. MG is recognized as a trigger for the development and progression of diabetic complications. MG accumulation in adipocytes after exposure to MG causes structural and functional changes in adipose tissue independently of obesity; these alterations may precede the onset of metabolic syndrome and type 2 diabetes (T2D). Indirectly, MG can disturb and impair different signaling pathways and also trigger epigenetic changes . However, the exact mechanism underlying the pathophysiological role of MG and glyoxalase-1 in adipose tissue is not yet fully understood. [Source]
-
Introduction to experiment
-Microarray Affymetrix data for the experiment were obtained here: [Source]
-The main purpose of this experiment was to investigate the effect of methylglyoxal on transcriptome and metabolic changes in visceral adipose tissue using a prediabetic rat. All experimental procedures were carried out using 13 rats (6 control rats and 7 MG-treated rats). In the MG-treated group of rats, MG was administered intragastrically three times a week at a dose of 0.5 mg/kg bodyweight for four weeks. In the control group, water was administered intragastrically over the same period. [Source]
-Libraries
-library(here)
-library(here)
-library(conflicted)
-library(emo)
-library(tidyverse)
-library(glue)
-library(oligo)
-library(limma)
-library(ReportingTools)
-library(lattice)
-library(sva)
-library(ragene21sttranscriptcluster.db)
-source(here("age_library.R"))
-library(DESeq2)
-conflict_prefer("rowRanges", "MatrixGenerics")
-library(magrittr)
-library(stringr)
-library(glue)
-library(ggplot2)
-library(goseq)
-library(clusterProfiler)
-library("biomaRt")
Input file
-<- here("madata")
- DATA_DIR <- here(DATA_DIR, "E-MTAB-9013.sdrf.txt") EXPERIMENT_DATA_DIR
Data preprocessing
-Now, we’ll read the data:
-<- readr::read_delim(EXPERIMENT_DATA_DIR, delim = "\t", progress = FALSE) %>%dplyr::select(1, 27, 25) %>%magrittr::set_colnames(c("sample_name", "sample_type", "cel_file"))
- data == "methyl glyoxal"] <- "methylglyoxal"
- data[data == "none"] <- "control"
- data[data data
sample_name <chr> | sample_type <chr> | cel_file <chr> | ||
---|---|---|---|---|
CTL1 | control | CTL1.cel | ||
CTL2 | control | CTL2.cel | ||
CTL3 | control | CTL3.cel | ||
CTL4 | control | CTL4.cel | ||
CTL5 | control | CTL5.cel | ||
CTL6 | control | CTL6.cel | ||
EXP1 | methylglyoxal | EXP1.cel | ||
EXP2 | methylglyoxal | EXP2.cel | ||
EXP3 | methylglyoxal | EXP3.cel | ||
EXP4 | methylglyoxal | EXP4.cel |
We’ll remodel our dataframe:
-<- as.data.frame(data) %>%magrittr::set_rownames(.$sample_name)
- data data
sample_name <chr> | sample_type <chr> | cel_file <chr> | ||
---|---|---|---|---|
CTL1 | CTL1 | control | CTL1.cel | |
CTL2 | CTL2 | control | CTL2.cel | |
CTL3 | CTL3 | control | CTL3.cel | |
CTL4 | CTL4 | control | CTL4.cel | |
CTL5 | CTL5 | control | CTL5.cel | |
CTL6 | CTL6 | control | CTL6.cel | |
EXP1 | EXP1 | methylglyoxal | EXP1.cel | |
EXP2 | EXP2 | methylglyoxal | EXP2.cel | |
EXP3 | EXP3 | methylglyoxal | EXP3.cel | |
EXP4 | EXP4 | methylglyoxal | EXP4.cel |
Load the data from the madata file:
-<- glue("madata/{data$cel_file}")
- cel_files <- read.celfiles(cel_files) raw_data
## Reading in : madata/CTL1.cel
-## Reading in : madata/CTL2.cel
-## Reading in : madata/CTL3.cel
-## Reading in : madata/CTL4.cel
-## Reading in : madata/CTL5.cel
-## Reading in : madata/CTL6.cel
-## Reading in : madata/EXP1.cel
-## Reading in : madata/EXP2.cel
-## Reading in : madata/EXP3.cel
-## Reading in : madata/EXP4.cel
-## Reading in : madata/EXP5.cel
-## Reading in : madata/EXP6.cel
-## Reading in : madata/EXP7.cel
-Let’s prepare metadata for AnnotatedDataFrame:
-sampleNames(raw_data) <- data$sample_name
-<- data.frame(labelName = colnames(data),labelDescription = c("", "Used drug.", "")
- metadata
- ) metadata
labelName <chr> | labelDescription <chr> | |||
---|---|---|---|---|
sample_name | ||||
sample_type | Used drug. | |||
cel_file |
<- AnnotatedDataFrame(data = data, varMetadata = metadata)) (data
## An object of class 'AnnotatedDataFrame'
-## rowNames: CTL1 CTL2 ... EXP7 (13 total)
-## varLabels: sample_name sample_type cel_file
-## varMetadata: labelName labelDescription
-We’ve created AnnotatedDataFrame:
-phenoData(raw_data) <- Biobase::combine(phenoData(raw_data), data)
-phenoData(raw_data)
## An object of class 'AnnotatedDataFrame'
-## rowNames: CTL1 CTL2 ... EXP7 (13 total)
-## varLabels: index sample_name sample_type cel_file
-## varMetadata: labelDescription channel labelName
-Technical quality control
-We’ll now look at the technical quality control and decide if the experiment was carried out properly.
-Pseudo-image
-We check if the image contains something like “random noise”.
-image(raw_data,which = 10, transfo = log2)
Image seems to be fine.
-MA plot
-Now, we’ll create MA plot for three samples:
-MAplot(raw_data[, 1:3], pairs = TRUE)
Boxplots
-We’ll plot the boxplots to see whether they are similiar or not:
-boxplot(raw_data, target = "core")
These boxplots seem to be very similiar.
-Probe level model
-Another technical quality control step consists of fitting probe level model using a robust regression.
-<- fitProbeLevelModel(raw_data) fit_plm
## Background correcting... OK
-## Normalizing... OK
-## Summarizing... OK
-## Extracting...
-## Estimates... OK
-## StdErrors... OK
-## Weights..... OK
-## Residuals... OK
-## Scale....... OK
-image(fit_plm, which = 10)
image(fit_plm, which = 10, type = "sign.residuals")
Fortunately, we see a random noise.
-Relative log expression (RLE)
-Boxplots should be located around 0.
-RLE(fit_plm)
Luckily, we see that boxplots are located around value 0.
-Normalized unscaled standard errors (NUSE)
-Boxplots should be located around value 1.
-NUSE(fit_plm)
We can see that boxplots are located around value 1.
-Normalization and probe annotation
-We normalize the data:
-<- rma(raw_data) norm_data
## Background correcting
-## Normalizing
-## Calculating Expression
-Next step is to annotate the probe IDs to gene IDs. We’ll look at the featureNames of our norm_data as we use them as the keys for the annotation.
-head(featureNames(norm_data))
## [1] "17600001" "17600003" "17600005" "17600007" "17600009" "17600011"
-We’ll look at the database corresponding to our chip: RaGene-2_1-st, which is containing the right rat geneIDs to our probe IDs.
-::head(keys(ragene21sttranscriptcluster.db)) AnnotationDbi
## [1] "17600001" "17600003" "17600005" "17600007" "17600009" "17600011"
-Now, we’ll load the database and annotate the probeIDs to geneIDs.
-<- AnnotationDbi::select(ragene21sttranscriptcluster.db, columns = c("PROBEID", "ENSEMBL", "SYMBOL", "GENENAME", "ENTREZID"), keys = featureNames(norm_data), keytype = "PROBEID")
- feature_data feature_data
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | GENENAME <chr> | ENTREZID <chr> |
---|---|---|---|---|
17600001 | NA | NA | NA | NA |
17600003 | NA | NA | NA | NA |
17600005 | NA | NA | NA | NA |
17600007 | NA | NA | NA | NA |
17600009 | NA | NA | NA | NA |
17600011 | NA | NA | NA | NA |
17600013 | NA | NA | NA | NA |
17600015 | NA | NA | NA | NA |
17600017 | NA | NA | NA | NA |
17600019 | NA | NA | NA | NA |
table(is.na(feature_data$ENSEMBL))
##
-## FALSE TRUE
-## 20562 17563
-Unfortunately, as you can see we managed to annotate something around 50% of our rat genes.
-So, we’ll look at another solution and try database from Biomart. Here, we annotate the probes from Biomart:
-= useMart(biomart= "ensembl",dataset= "rnorvegicus_gene_ensembl")
- ensembl = c("affy_ragene_2_1_st_v1", "ensembl_gene_id", "uniprot_gn_symbol", "external_gene_name", "entrezgene_trans_name")
- affy_ensembl<- getBM(attributes= affy_ensembl, mart= ensembl, values = "*", uniqueRows=T)
- ragenedf colnames(ragenedf) <- c('PROBEID','ENSEMBL','SYMBOL', 'GENENAME','ENTREZID')
-%>% arrange(PROBEID)%>% head() ragenedf
PROBEID <int> | ENSEMBL <chr> | SYMBOL <chr> | GENENAME <chr> | ENTREZID <chr> | |
---|---|---|---|---|---|
1 | 17600199 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 | |
2 | 17600223 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 | |
3 | 17600225 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 | |
4 | 17600285 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 | |
5 | 17600287 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 | |
6 | 17600289 | ENSRNOG00000013741 | Ube2d3 | Ube2d3 |
Now, we merge the results from database: ragene21sttranscriptcluster.db and Biomart.
-conflict_prefer("select", "dplyr")
## [conflicted] Removing existing preference
-## [conflicted] Will prefer dplyr::select over any other package
-$ENSEMBL[is.na(feature_data$ENSEMBL)] <- ragenedf$ENSEMBL[match(feature_data$PROBEID,ragenedf$PROBEID)][which(is.na(feature_data$ENSEMBL))]
- feature_data$SYMBOL[is.na(feature_data$SYMBOL)] <- ragenedf$SYMBOL[match(feature_data$PROBEID,ragenedf$PROBEID)][which(is.na(feature_data$SYMBOL))]
- feature_data feature_data
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | GENENAME <chr> | ENTREZID <chr> |
---|---|---|---|---|
17600001 | NA | NA | NA | NA |
17600003 | NA | NA | NA | NA |
17600005 | NA | NA | NA | NA |
17600007 | NA | NA | NA | NA |
17600009 | NA | NA | NA | NA |
17600011 | NA | NA | NA | NA |
17600013 | NA | NA | NA | NA |
17600015 | NA | NA | NA | NA |
17600017 | NA | NA | NA | NA |
17600019 | NA | NA | NA | NA |
table(is.na(feature_data$ENSEMBL))
##
-## FALSE TRUE
-## 27182 10943
-Eventually, we managed to annotate something about 2/3 of our probes. 🙂
-Let’s look at the duplicates of PROBEID in our dataframe:
-::get_dupes(feature_data, PROBEID) %>% head() janitor
PROBEID <chr> | dupe_count <int> | ENSEMBL <chr> | SYMBOL <chr> | ||
---|---|---|---|---|---|
1 | 17610368 | 5 | ENSRNOG00000046319 | Vom2r2 | |
2 | 17610368 | 5 | ENSRNOG00000047388 | Vom2r4 | |
3 | 17610368 | 5 | ENSRNOG00000050370 | Vom2r6 | |
4 | 17610368 | 5 | ENSRNOG00000048575 | Vom2r3 | |
5 | 17610368 | 5 | ENSRNOG00000048575 | Vom2r1 | |
6 | 17612506 | 2 | ENSRNOG00000042826 | Zfp52 |
We’ll get rid PROBEID duplicates:
-<- dplyr::distinct(feature_data, PROBEID, .keep_all = TRUE) feature_data
And lastly we filter the norm_data by feature_data:
-<- norm_data[feature_data$PROBEID, ]
- norm_data
-if (any(feature_data$PROBEID != featureNames(norm_data)))
-stop("Feature data mismatch.")
-
-fData(norm_data) <- feature_data
-annotation(norm_data) <- "ragene21sttranscriptcluster.db"
Let’s look at the final number of our annotated genes.
-table(is.na(fData(norm_data)$ENSEMBL))
##
-## FALSE TRUE
-## 25813 10872
-Exploratory analysis - biological quality control
-Finally, we’ll look at our data.
-For plotting the following graphs we’ll use age_library.R script:
-<- pData(norm_data)$sample_type
- groups plot_hc(exprs(norm_data), color_by = groups, color_by_lab = "Sample type")
plot_pca(exprs(norm_data), sample_data = pData(norm_data), n_top_features = 1000, color_by = "sample_type", plot_type = "multi")$plot
plot_heatmap(
-exprs(norm_data),
- n_top_features = 1000,
- z_score = TRUE,
- column_annotation = dplyr::select(pData(norm_data), sample_type),
- title = "Affymetrix",
- legend_title = "z-score",
- show_row_names = FALSE
- )
From the plotted graphs we can say that our data are not affected by batch effect.
-Differential expression analysis (DEA)
-Let’s decide which genes are differentially expressed and which not. For that we will use linear models and package limma.
-We’ll create a matrix:
-<- pData(norm_data)$sample_type %>% factor()
- group <- relevel(group, "control")
- group <- model.matrix(~ group)
- dea_model colnames(dea_model)[1] <- "Intercept"
- dea_model
## Intercept groupmethylglyoxal
-## 1 1 0
-## 2 1 0
-## 3 1 0
-## 4 1 0
-## 5 1 0
-## 6 1 0
-## 7 1 1
-## 8 1 1
-## 9 1 1
-## 10 1 1
-## 11 1 1
-## 12 1 1
-## 13 1 1
-## attr(,"assign")
-## [1] 0 1
-## attr(,"contrasts")
-## attr(,"contrasts")$group
-## [1] "contr.treatment"
-And fit our data to it:
-<- lmFit(norm_data, dea_model) %>% eBayes()
- fit colnames(fit)
## [1] "Intercept" "groupmethylglyoxal"
-We can use the topTable function to look at the top differentially expressed genes (DEGs).
-topTable(fit, coef = "groupmethylglyoxal")
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | ||
---|---|---|---|---|
17660843 | 17660843 | ENSRNOG00000009329 | Nr1d1 | |
17847383 | 17847383 | ENSRNOG00000029543 | Cish | |
17666565 | 17666565 | ENSRNOG00000001843 | Bcl6 | |
17708216 | 17708216 | ENSRNOG00000053706 | Lonrf1 | |
17606863 | 17606863 | NA | NA | |
17797247 | 17797247 | ENSRNOG00000007964 | Tp53inp1 | |
17865965 | 17865965 | ENSRNOG00000014597 | Irs1 | |
17603643 | 17603643 | NA | NA | |
17602335 | 17602335 | NA | NA | |
17670721 | 17670721 | ENSRNOG00000001876 | Ccdc74a |
<- topTable(fit, coef = "groupmethylglyoxal")
- df_top <- dplyr::filter(as.data.frame(df_top), abs(logFC) > 1 & adj.P.Val < 0.1)
- df_top df_top
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | ||
---|---|---|---|---|
17660843 | 17660843 | ENSRNOG00000009329 | Nr1d1 | |
17847383 | 17847383 | ENSRNOG00000029543 | Cish |
However, if we look at the topTable and use threshold FDR <= 0.1 and |LFC| > 1 we obtained only 2 DEG.
-Visualisation of DEG
-Now, we’ll look closer at our DEG, which are genes: “Nr1d1” and “Cish”.
-Data preparation
-We’ll prepare dataframe containing all the samples expression from the exprs(norm_data).
-<- exprs(norm_data)%>%
- data_long as.data.frame() %>%
- ::rownames_to_column("PROBEID") %>%
- tibble::pivot_longer(-PROBEID, names_to = "sample_name", values_to = "E") %>%
- tidyr::left_join(fData(norm_data), by = "PROBEID") %>%
- dplyr::left_join(pData(norm_data), by = "sample_name")
- dplyr
-head(data_long)
PROBEID <chr> | sample_name <chr> | E <dbl> | ENSEMBL <chr> | SYMBOL <chr> | GENENAME <chr> | |
---|---|---|---|---|---|---|
17600001 | CTL1 | 2.827579 | NA | NA | NA | |
17600001 | CTL2 | 2.353455 | NA | NA | NA | |
17600001 | CTL3 | 2.674497 | NA | NA | NA | |
17600001 | CTL4 | 2.471231 | NA | NA | NA | |
17600001 | CTL5 | 2.270251 | NA | NA | NA | |
17600001 | CTL6 | 2.685155 | NA | NA | NA |
We’ll select from this dataframe only our DEG:
-which(grepl("Nr1d1", data_long$SYMBOL))
## [1] 138477 138478 138479 138480 138481 138482 138483 138484 138485 138486 138487 138488 138489
-which(grepl("Cish", data_long$SYMBOL))
## [1] 407486 407487 407488 407489 407490 407491 407492 407493 407494 407495 407496 407497 407498
-<- rbind(data_long[407486:407498,], data_long[138477:138489,]) data_long1
Boxplots
-Now, we’ll look at the boxplots of our DEG.
-plot_boxplots(
-
- data_long1,x = "sample_type",
- y = "E",
- facet_by = "SYMBOL",
- color_by = "sample_type",
- main = "Affymetrix",
- x_lab = "Sample_Group",
- y_lab = "log2(expression intensity)",
- do_t_test = FALSE
- +
- ) theme(
- axis.text.x = element_blank(),
- axis.ticks.x = element_blank()
- )
So, as we can see Cish gene is downregulated gene and Nr1d1 is a upregulated gene be methylglyoxal.
-Volcano plot
-Let’s look at volcano plot to see statistically significant genes.
-<- topTable(fit, coef = "groupmethylglyoxal",number= 10000)
- deg<- deg$adj.P.Val < 0.1
- threshold_OE $threshold <- threshold_OE
- degggplot(deg) +
-geom_point(aes(x=logFC, y=-log10(adj.P.Val), colour=threshold)) +
- ggtitle("Volcano plot of DEG") +
- xlab("log2 fold change") +
- ylab("-log10 adjusted p-value") +
- #scale_y_continuous(limits = c(0,50)) +
- theme(legend.position = "none",
- plot.title = element_text(size = rel(1.5), hjust = 0.5),
- axis.title = element_text(size = rel(1.25)))
Unfortunately, we have only 2 of DEG.
-GGplot
-Let’s look at their expressions:
-ggplot(data_long1) +
-geom_point(aes(x = SYMBOL, y = E, color = sample_type), position=position_jitter(w=0.1,h=0)) +
- scale_y_log10() +
- xlab("Genes") +
- ylab("Expression") +
- ggtitle("Significant DEGs") +
- theme_bw() +
- theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
- theme(plot.title=element_text(hjust=0.5))
Heatmap
-Gene Set Enrichment Analysis (GSEA)
-GSEA s a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes.[Source]
-However, from our data we obtained only 2 DEG, which seems to be low to run GSEA. Well, at least we’ll try and see what the results will be.
-Let’s prepare our dataframe with genes statistics from topTable.
-<- topTable(fit, coef = "groupmethylglyoxal",number = 36685)
- res_dex_shrink res_dex_shrink
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | ||
---|---|---|---|---|
17660843 | 17660843 | ENSRNOG00000009329 | Nr1d1 | |
17847383 | 17847383 | ENSRNOG00000029543 | Cish | |
17666565 | 17666565 | ENSRNOG00000001843 | Bcl6 | |
17708216 | 17708216 | ENSRNOG00000053706 | Lonrf1 | |
17606863 | 17606863 | NA | NA | |
17797247 | 17797247 | ENSRNOG00000007964 | Tp53inp1 | |
17865965 | 17865965 | ENSRNOG00000014597 | Irs1 | |
17603643 | 17603643 | NA | NA | |
17602335 | 17602335 | NA | NA | |
17670721 | 17670721 | ENSRNOG00000001876 | Ccdc74a |
Let’s filter the dataframe and get rid of duplicates
-<- res_dex_shrink %>% drop_na()
- res_dex_shrink conflict_prefer("select", "dplyr")
-#get rid of duplicates
-<- dplyr::distinct(res_dex_shrink, ENSEMBL, .keep_all = TRUE)
- res_dex_shrink<- dplyr::distinct(res_dex_shrink, PROBEID, .keep_all = TRUE)
- res_dex_shrink<- dplyr::distinct(res_dex_shrink, ENTREZID, .keep_all = TRUE)
- res_dex_shrink res_dex_shrink
PROBEID <chr> | ENSEMBL <chr> | SYMBOL <chr> | ||
---|---|---|---|---|
17660843 | 17660843 | ENSRNOG00000009329 | Nr1d1 | |
17847383 | 17847383 | ENSRNOG00000029543 | Cish | |
17666565 | 17666565 | ENSRNOG00000001843 | Bcl6 | |
17797247 | 17797247 | ENSRNOG00000007964 | Tp53inp1 | |
17865965 | 17865965 | ENSRNOG00000014597 | Irs1 | |
17670721 | 17670721 | ENSRNOG00000001876 | Ccdc74a | |
17811287 | 17811287 | ENSRNOG00000015403 | Cd52 | |
17635606 | 17635606 | ENSRNOG00000058105 | Hbb | |
17716283 | 17716283 | ENSRNOG00000061031 | Fzd8 | |
17820043 | 17820043 | ENSRNOG00000015325 | Pigf |
Prepare the vector of LFC.
-<- res_dex_shrink$logFC
- entrez_lfc names(entrez_lfc) <- res_dex_shrink$ENTREZID
-<- entrez_lfc[order(entrez_lfc, decreasing = TRUE)] entrez_lfc
Let’s create a named vector ranked based on the t-statistic values. We need to sort the t-statistic values in descending order here.
-<- res_dex_shrink$t
- t_vector names(t_vector) <- res_dex_shrink$ENTREZID
-<- sort(t_vector, decreasing = TRUE)
- t_vector head(t_vector)
## 252917 303836 297822 117054 24440 681086
-## 8.708741 5.819923 5.148367 4.843744 4.790586 4.574141
-Run GSEA on KEGG pathways:
-<- gseKEGG(
- gsea_kegg_results geneList = t_vector,
- # KEGG organism ID
- organism = "rno",
- # Key type is ENTREZ ID.
- keyType = "ncbi-geneid",
- # Correct p-values for FDR.
- pAdjustMethod = "fdr",
- # FDR adjusted p-value threshold.
- # We are OK with 10% of false positives among all pathways called significant.
- pvalueCutoff = 0.1,
- # Set a constant seed so you will get reproducible results using the same data.
- seed = 1,
- verbose = TRUE
- )
See the results below:
-as.data.frame(gsea_kegg_results)
ID <chr> | Description <chr> | ||
---|---|---|---|
rno05168 | rno05168 | Herpes simplex virus 1 infection | |
rno04141 | rno04141 | Protein processing in endoplasmic reticulum | |
rno03450 | rno03450 | Non-homologous end-joining | |
rno05144 | rno05144 | Malaria | |
rno03060 | rno03060 | Protein export | |
rno05225 | rno05225 | Hepatocellular carcinoma | |
rno05143 | rno05143 | African trypanosomiasis | |
rno00980 | rno00980 | Metabolism of xenobiotics by cytochrome P450 | |
rno00020 | rno00020 | Citrate cycle (TCA cycle) |
We will convert ENTREZIDs to gene symbols using our rat gene database:
-<- setReadable(gsea_kegg_results, "ragene21sttranscriptcluster.db", keyType = "ENTREZID")
- gsea_kegg_results <- dplyr::filter(gsea_kegg_results, p.adjust < 0.05) gsea_kegg_results_filtered
Finally, we’ll plot the results:
-GSEA plot
-library(enrichplot)
-gseaplot2(gsea_kegg_results, geneSetID = 1:3, pvalue_table = TRUE, ES_geom = "dot")
Dotplot
-conflict_prefer("dotplot", "enrichplot")
-dotplot(gsea_kegg_results, showCategory = 15, x = "GeneRatio", font.size = 10)
Gene-Concept Network
-<- cnetplot(gsea_kegg_results, showCategory = 3, foldChange = entrez_lfc, colorEdge = TRUE)
- p <- here("gsea_cnetplot.png")
- p_file ggsave(p_file, p, device = "png", width = 10, height = 10)
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider increasing max.overlaps
-Gene-Concept Network in heatmap
-<- heatplot(gsea_kegg_results, foldChange = entrez_lfc, showCategory = 3) +
- p theme(axis.text.x = element_text(angle = 90, vjust = 2, size = 7))
- <- here("gsea_heatmap.png")
- p_file ggsave(p_file, p, device = "png", width = 15, height = 5)
Enrichment map
-<- pairwise_termsim(gsea_kegg_results)
- gsea_kegg_results emapplot(gsea_kegg_results, color = "NES", showCategory = 20)
PubMed Central plot
-<- gsea_kegg_results$Description[1:3]
- terms <- pmcplot(terms, 2010:2017)
- p <- pmcplot(terms, 2010:2017, proportion = FALSE)
- p2 ::wrap_plots(p, p2, ncol = 2, guides = "collect") patchwork
Signaling pathway impact analysis (SPIA)
-Now we can use our results from GSEA to run SPIA. SPIA calculates: Over-representation (ORA) of DEGs in the pathway and perturbation (PERT): the influence of observed expression changes on pathway output.
-So, let’s try i out.
-<- here("kegg_data")
- KEGG_DATA_DIR dir.create(KEGG_DATA_DIR)
## Warning in dir.create(KEGG_DATA_DIR): '/data/persistent/liskovaf/bio-class-deb10/AGE2021/Exercises/E04-
-## microarrays/kegg_data' already exists
-<- gsea_kegg_results@result$ID[1:5]
- kegg_ids ::map(kegg_ids, ~ download.file(glue("http://rest.kegg.jp/get/{.}/kgml"), glue("{KEGG_DATA_DIR}/{.}.xml"))) purrr
## [[1]]
-## [1] 0
-##
-## [[2]]
-## [1] 0
-##
-## [[3]]
-## [1] 0
-##
-## [[4]]
-## [1] 0
-##
-## [[5]]
-## [1] 0
-library(SPIA)
-
-makeSPIAdata(
-kgml.path = KEGG_DATA_DIR,
- organism = "rno",
- out.path = KEGG_DATA_DIR
- )
## [1] TRUE
-Let’s set our thresholds and prepare gene statistics for SPIA.
-<- 0.1
- PADJ_THRESHOLD <- 1
- LFC_THRESHOLD
-<- topTable(fit, coef = "groupmethylglyoxal", number= 1000)
- res_dex_shrink_ann <- res_dex_shrink_ann %>% drop_na()
- res_dex_shrink_ann <- dplyr::distinct(res_dex_shrink_ann, ENSEMBL, .keep_all = TRUE)
- res_dex_shrink_ann
-<- (!is.na(res_dex_shrink_ann$adj.P.Val)) & (res_dex_shrink_ann$adj.P.Val <= PADJ_THRESHOLD) & (abs(res_dex_shrink_ann$logFC) >= LFC_THRESHOLD)
- deg_filter
-<- res_dex_shrink[deg_filter, "ENTREZID"] deg_entrezids
Vector of LFCs of DEGs. Names are ENTREZIDs.
-<- entrez_lfc[deg_entrezids]
- entrez_lfc_deg <- entrez_lfc_deg[order(entrez_lfc_deg, decreasing = TRUE)] entrez_lfc_deg
<- spia(
- spia_results de = entrez_lfc_deg,
- all = res_dex_shrink$ENTREZID,
- organism = "rno",
- data.dir = paste0(KEGG_DATA_DIR, "/")
- )
##
-## Done pathway 1 : cGMP-PKG signaling pathway..
-## Done pathway 2 : Protein processing in endoplas..
-## Done pathway 3 : Prolactin signaling pathway..
-## Done pathway 4 : Malaria..
-## Done pathway 5 : Herpes simplex virus 1 infecti..
-## Done pathway 6 : Epstein-Barr virus infection..
-## Done pathway 7 : Transcriptional misregulation ..
-## Done pathway 8 : Viral myocarditis..
- spia_results
Name <chr> | ID <chr> | pSize <int> | NDE <int> | pNDE <dbl> | |
---|---|---|---|---|---|
Prolactin signaling pathway | 04917 | 74 | 2 | 0.04313748 | |
Herpes simplex virus 1 infection | 05168 | 308 | 2 | 0.40019323 | |
Epstein-Barr virus infection | 05169 | 206 | 1 | 0.60365976 |
plotP(spia_results, threshold = 0.3)
Report
-Now, we’ll create the HTML tables and signpost by running script: report.R
-##
-##
-## DEA for group 'groupmethylglyoxal':
-##
-##
-## processing file: dea_table_template.Rmd
-##
- |
- | | 0%
- |
- |................................................ | 50%
-## inline R code fragments
-##
-##
- |
- |.................................................................................................| 100%
-## label: unnamed-chunk-93 (with options)
-## List of 1
-## $ echo: logi FALSE
-## output file: dea_table_template.knit.md
-## /usr/lib/rstudio-server/bin/pandoc/pandoc +RTS -K512m -RTS dea_table_template.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output groupmethylglyoxal.html --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --standalone --section-divs --template /usr/local/lib/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --include-in-header /tmp/Rtmp1E0Z6U/rmarkdown-str4c0ceda1077.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
-##
-## Output created: groupmethylglyoxal.html
-##
-##
-## processing file: dea_signpost.Rmd
-##
- |
- | | 0%
- |
- |................................ | 33%
-## ordinary text without R code
-##
-##
- |
- |................................................................. | 67%
-## label: unnamed-chunk-1 (with options)
-## List of 1
-## $ echo: logi FALSE
-##
-##
- |
- |.................................................................................................| 100%
-## ordinary text without R code
-## output file: dea_signpost.knit.md
-## /usr/lib/rstudio-server/bin/pandoc/pandoc +RTS -K512m -RTS dea_signpost.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output dea_signpost.html --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --standalone --section-divs --template /usr/local/lib/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --include-in-header /tmp/Rtmp1E0Z6U/rmarkdown-str4c0cd323440.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
-##
-## Output created: dea_signpost.html
-