This page holds an overview and instructions for how GeneLab processes Agilent 1-channel microarray datasets*. Exact processing commands and GL-DPPD-7112 version used for specific datasets are provided with their processed data in the Open Science Data Repository (OSDR).
* The pipeline detailed below is currently used for animal studies only, it will be updated soon for processing plants and microbe microarray data.
Date: February 13, 2023
Revision: -
Document Number: GL-DPPD-7112
Submitted by:
Jonathan Oribello (GeneLab Data Processing Team)
Approved by:
Sylvain Costes (GeneLab Project Manager)
Samrawit Gebre (GeneLab Deputy Project Manager)
Amanda Saravia-Butler (GeneLab Data Processing Lead)
Lauren Sanders (acting GeneLab Project Scientist)
- Software used
- General processing overview with example commands
Program | Version | Relevant Links |
---|---|---|
R | 4.1.3 | https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ |
DT | 0.26 | https://github.com/rstudio/DT |
dplyr | 1.0.10 | https://dplyr.tidyverse.org |
stringr | 1.5.0 | https://stringr.tidyverse.org |
R.utils | 2.12.2 | https://github.com/HenrikBengtsson/R.utils |
limma | 3.50.3 | https://bioconductor.org/packages/3.14/bioc/html/limma.html |
glue | 1.6.2 | https://glue.tidyverse.org |
ggplot2 | 3.4.0 | https://ggplot2.tidyverse.org |
biomaRt | 2.50.0 | https://bioconductor.org/packages/3.14/bioc/html/biomaRt.html |
matrixStats | 0.63.0 | https://github.com/HenrikBengtsson/matrixStats |
statmod | 1.5.0 | https://github.com/cran/statmod |
dp_tools | 1.3.1 | https://github.com/J-81/dp_tools |
singularity | 3.9 | https://sylabs.io |
Quarto | 1.2.313 | https://quarto.org |
Exact processing commands and output files listed in bold below are included with each Microarray processed dataset in the Open Science Data Repository (OSDR).
Notes:
Rather than running the commands below to create the runsheet needed for processing, the runsheet may also be created manually by following the file specification.
These command line tools are part of the dp_tools program.
### Download the *ISA.zip file from the GeneLab Repository ###
dpt-get-isa-archive \
--accession OSD-###
### Parse the metadata from the *ISA.zip file to create a sample runsheet ###
dpt-isa-to-runsheet --accession OSD-### \
--config-type microarray \
--config-version Latest \
--isa-archive *ISA.zip
Parameter Definitions:
--accession OSD-###
- OSD accession ID (replace ### with the OSD number being processed), used to retrieve the urls for the ISA archive and raw expression files hosted on the GeneLab Repository--config-type
- Instructs the script to extract the metadata required formicroarray
processing from the ISA archive--config-version
- Specifies thedp-tools
configuration version to use, a value ofLatest
will specify the most recent version--isa-archive
- Specifies the *ISA.zip file for the respective GLDS dataset, downloaded in thedpt-get-isa-archive
command
Input Data:
- No input data required but the OSD accession ID needs to be indicated, which is used to download the respective ISA archive
Output Data:
-
*ISA.zip (compressed ISA directory containing Investigation, Study, and Assay (ISA) metadata files for the respective OSD dataset, used to define sample groups - the *ISA.zip file is located in the OSD repository under 'Study Files' -> 'metadata')
-
{OSD-Accession-ID}_microarray_v{version}_runsheet.csv (table containing metadata required for processing, version denotes the dp_tools schema used to specify the metadata to extract from the ISA archive)
Note: Steps 2 - 7 are done in R
### Install R packages if not already installed ###
install.packages("tidyverse")
install.packages("R.utils")
install.packages("glue")
install.packages("matrixStats")
install.packages("statmod")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.14")
BiocManager::install("limma")
BiocManager::install("biomaRt")
### Import libraries ###
library(tidyverse)
library(dplyr)
library(stringr)
library(R.utils)
library(glue)
library(matrixStats)
library(limma)
library(biomaRt)
library(statmod)
# Define path to runsheet
runsheet <- "/path/to/runsheet/{OSD-Accession-ID}_microarray_v{version}_runsheet.csv"
## Set up output structure
# Output Constants
DIR_RAW_DATA <- "00-RawData"
DIR_NORMALIZED_EXPRESSION <- "01-limma_NormExp"
DIR_DGE <- "02-limma_DGE"
dir.create(DIR_RAW_DATA)
dir.create(DIR_NORMALIZED_EXPRESSION)
dir.create(DIR_DGE)
# fileEncoding removes strange characters from the column names
df_rs <- read.csv(runsheet, check.names = FALSE, fileEncoding = 'UTF-8-BOM')
## Determines the organism specific annotation file to use based on the organism in the runsheet
fetch_organism_specific_annotation_file_path <- function(organism) {
# Uses the GeneLab GL-DPPD-7110_annotations.csv file to find the organism specific annotation file path
# Raises an exception if the organism does not have an associated annotation file yet
all_organism_table <- read.csv("https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv")
annotation_file_path <- all_organism_table %>% dplyr::filter(species == organism) %>% dplyr::pull(genelab_annots_link)
# Guard clause: Ensure annotation_file_path populated
# Else: raise exception for unsupported organism
if (length(annotation_file_path) == 0) {
stop(glue::glue("Organism supplied '{organism}' is not supported. See the following url for supported organisms: https://github.com/nasa/GeneLab_Data_Processing/blob/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv. Supported organisms will correspond to a row based on the 'species' column and include a url in the 'genelab_annots_link' column of that row"))
}
return(annotation_file_path)
}
annotation_file_path <- fetch_organism_specific_annotation_file_path(unique(df_rs$organism))
allTrue <- function(i_vector) {
if ( length(i_vector) == 0 ) {
stop(paste("Input vector is length zero"))
}
all(i_vector)
}
# Define paths to raw data files
runsheetPathsAreURIs <- function(df_runsheet) {
allTrue(stringr::str_starts(df_runsheet$`Array Data File Path`, "https"))
}
# Download raw data files
downloadFilesFromRunsheet <- function(df_runsheet) {
urls <- df_runsheet$`Array Data File Path`
destinationFiles <- df_runsheet$`Array Data File Name`
mapply(function(url, destinationFile) {
print(paste0("Downloading from '", url, "' TO '", destinationFile, "'"))
if ( file.exists(destinationFile ) ) {
warning(paste( "Using Existing File:", destinationFile ))
} else {
download.file(url, destinationFile)
}
}, urls, destinationFiles)
destinationFiles # Return these paths
}
if ( runsheetPathsAreURIs(df_rs) ) {
print("Determined Raw Data Locations are URIS")
local_paths <- downloadFilesFromRunsheet(df_rs)
} else {
print("Or Determined Raw Data Locations are local paths")
local_paths <- df_rs$`Array Data File Path`
}
# uncompress files if needed
if ( allTrue(stringr::str_ends(local_paths, ".gz")) ) {
print("Determined these files are gzip compressed... uncompressing now")
# This does the uncompression
lapply(local_paths, R.utils::gunzip, remove = FALSE, overwrite = TRUE)
# This removes the .gz extension to get the uncompressed filenames
local_paths <- vapply(local_paths,
stringr::str_replace, # Run this function against each item in 'local_paths'
FUN.VALUE = character(1), # Execpt an character vector as a return
USE.NAMES = FALSE, # Don't use the input to assign names for the returned list
pattern = ".gz$", # first argument for applied function
replacement = "" # second argument for applied function
)
}
df_local_paths <- data.frame(`Sample Name` = df_rs$`Sample Name`, `Local Paths` = local_paths, check.names = FALSE)
# Load raw data into R object
raw_data <- limma::read.maimages(df_local_paths$`Local Paths`,
source = "agilent", # Specify platform
green.only = TRUE, # Specify one-channel design
names = df_local_paths$`Sample Name` # Map column names as Sample Names (instead of default filenames)
)
# Handle raw data which lacks certain replaceable column data
## This likely arises as Agilent Feature Extraction (the process that generates the raw data files on OSDR)
## gives some user flexibilty in what probe column to ouput
## Missing ProbeUID "Unique integer for each unique probe in a design"
### Source: https://www.agilent.com/cs/library/usermanuals/public/GEN-MAN-G4460-90057.pdf Page 178
### Remedy: Assign unique integers for each probe
if ( !("ProbeUID" %in% colnames(raw_data$genes)) ) {
# Assign unique integers for each probe
print("Assigning `ProbeUID` as original files did not include them")
raw_data$genes$ProbeUID <- seq_len(nrow(raw_data$genes))
}
# Summarize raw data
print(paste0("Number of Arrays: ", dim(raw_data)[2]))
print(paste0("Number of Probes: ", dim(raw_data)[1]))
Input Data:
runsheet
(Path to runsheet, output from Step 1)
Output Data:
-
df_rs
(R dataframe containing information from the runsheet) -
raw_data
(R object containing raw microarray data)Note: The raw data R object will be used to generate quality assessment (QA) plots in the next step.
# Plot settings
par(
xpd = TRUE # Ensure legend can extend past plot area
)
number_of_sets = ceiling(dim(raw_data)[2] / 30) # Set of 30 samples, used to scale plot
limma::plotDensities(raw_data,
log = TRUE,
legend = FALSE)
legend("topright", legend = colnames(raw_data),
lty = 1, # Solid line
col = 1:ncol(raw_data), # Ensure legend color is in sync with plot
ncol = number_of_sets, # Set number of columns by number of sets
cex = max(0.5, 1 + 0.2 - (number_of_sets*0.2)) # Reduce scale by 20% for each column beyond 1, minimum of 0.5
)
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
- Plot containing the density of raw intensities for each array (raw intensity values with background intensity values subtracted; lack of overlap indicates a need for normalization)
agilentImagePlot <- function(eListRaw, transform_func = identity) {
# Adapted from this discussion: https://support.bioconductor.org/p/15523/
copy_raw_data <- eListRaw
copy_raw_data$genes$Block <- 1 # Agilent arrays only have one block
names(copy_raw_data$genes)[2] <- "Column"
copy_raw_data$printer <- limma::getLayout(copy_raw_data$genes)
r <- copy_raw_data$genes$Row
c <- copy_raw_data$genes$Column
nr <- max(r)
nc <- max(c)
y <- rep(NA,nr*nc)
i <- (r-1)*nc+c
for ( array_i in seq(colnames(copy_raw_data$E)) ) {
y[i] <- transform_func(copy_raw_data$E[,array_i])
limma::imageplot(y,copy_raw_data$printer, main = rownames(copy_raw_data$targets)[array_i])
}
}
agilentImagePlot(raw_data, transform_func = function(expression_matrix) log2(expression_matrix + 1))
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
- Pseudo images of each array before background correction and normalization
for ( array_i in seq(colnames(raw_data$E)) ) {
sample_name <- rownames(raw_data$targets)[array_i]
limma::plotMA(raw_data,array=array_i,xlab="Average log-expression",ylab="Expression log-ratio (this sample vs. others)", main = sample_name, status=raw_data$genes$ControlType)
}
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
- M (log ratio of the subject array vs a pseudo-reference, the mean of all other arrays) vs. A (average log expression) plot for each array before background correction and normalization (negative and positive control probes are in green and red, respectively)
for ( array_i in seq(colnames(raw_data$E)) ) {
sample_name <- rownames(raw_data$targets)[array_i]
limma::plotFB(raw_data, array = array_i, xlab = "log2 Background", ylab = "log2 Foreground", main = sample_name)
}
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
- Foreground vs. background expression plot for each array before background correction and normalization
boxplotExpressionSafeMargin <- function(data, transform_func = identity, xlab = "Log2 Intensity") {
# Basic box plot
df_data <- as.data.frame(transform_func(data$E))
ggplot2::ggplot(stack(df_data), ggplot2::aes(x=values, y=ind)) +
ggplot2::geom_boxplot() +
ggplot2::scale_y_discrete(limits=rev) +
ggplot2::labs(y= "Sample Name", x = xlab)
}
boxplotExpressionSafeMargin(raw_data, transform_func = log2)
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
- Boxplot of raw expression data for each array before background correction and normalization
background_corrected_data <- limma::backgroundCorrect(raw_data, method = "normexp")
Input Data:
raw_data
(raw data R object created in Step 2 above)
Output Data:
-
background_corrected_data
(R object containing background-corrected microarray data)Note: Background correction was performed using the
normexp
method as recommended by Ritchie, M.E., et al., which performs background correction and quantile normalization using the control probes by utilizing thenormexp.fit.control
function to estimate the parameters required by normal+exponential(normexp) convolution model with the help of negative control probes, followed by thenormexp.signal
function to perform the background correction.
# Normalize background-corrected data using the quantile method
norm_data <- limma::normalizeBetweenArrays(background_corrected_data, method = "quantile")
# Summarize background-corrected and normalized data
print(paste0("Number of Arrays: ", dim(norm_data)[2]))
print(paste0("Number of Probes: ", dim(norm_data)[1]))
Input Data:
background_corrected_data
(R object containing background-corrected microarray data created in Step 4 above)
Output Data:
-
norm_data
(R object containing background-corrected and normalized microarray data)Note: Normalization was performed using the
quantile
method, which forces the entire empirical distribution of all arrays to be identical
# Plot settings
par(
xpd = TRUE # Ensure legend can extend past plot area
)
number_of_sets = ceiling(dim(norm_data)[2] / 30) # Set of 30 samples, used to scale plot
limma::plotDensities(norm_data,
log = TRUE,
legend = FALSE)
legend("topright", legend = colnames(norm_data),
lty = 1, # Solid line
col = 1:ncol(norm_data), # Ensure legend color is in sync with plot
ncol = number_of_sets, # Set number of columns by number of sets
cex = max(0.5, 1 + 0.2 - (number_of_sets*0.2)) # Reduce scale by 20% for each column beyond 1, minimum of 0.5
)
Input Data:
norm_data
(R object containing background-corrected and normalized microarray data created in Step 5 above)
Output Data:
- Plot containing the density of background-corrected and normalized intensities for each array (near complete overlap is expected after normalization)
agilentImagePlot(norm_data,
transform_func = function(expression_matrix) log2(2**expression_matrix + 1) # Compute as log2 of normalized expression after adding a +1 offset to prevent negative values in the pseudoimage
)
Input Data:
norm_data
(R object containing background-corrected and normalized microarray data created in Step 5 above)
Output Data:
- Pseudo images of each array after background correction and normalization
for ( array_i in seq(colnames(norm_data$E)) ) {
sample_name <- rownames(norm_data$targets)[array_i]
limma::plotMA(norm_data,array=array_i,xlab="Average log-expression",ylab="Expression log-ratio (this sample vs. others)", main = sample_name, status=norm_data$genes$ControlType)
}
Input Data:
norm_data
(R object containing background-corrected and normalized microarray data created in Step 5 above)
Output Data:
- M (log ratio of the subject array vs a pseudo-reference, the mean of all other arrays) vs. A (average log expression) plot for each array after background correction and normalization (negative and positive control probes are in green and red, respectively)
boxplotExpressionSafeMargin(norm_data)
Input Data:
norm_data
(R object containing background-corrected and normalized microarray data created in Step 5 above)
Output Data:
- Boxplot of expression data for each array after background correction and normalization
shortenedOrganismName <- function(long_name) {
#' Convert organism names like 'Homo Sapiens' into 'hsapiens'
tokens <- long_name %>% stringr::str_split(" ", simplify = TRUE)
genus_name <- tokens[1]
species_name <- tokens[2]
short_name <- stringr::str_to_lower(paste0(substr(genus_name, start = 1, stop = 1), species_name))
return(short_name)
}
getBioMartAttribute <- function(df_rs) {
#' Returns resolved biomart attribute source from runsheet
# check if runsheet has 'biomart_attribute' column
if ( !is.null(df_rs$`biomart_attribute`) ) {
print("Using attribute name sourced from runsheet")
# Format according to biomart needs
formatted_value <- unique(df_rs$`biomart_attribute`) %>%
stringr::str_replace_all(" ","_") %>% # Replace all spaces with underscore
stringr::str_to_lower() # Lower casing only
return(formatted_value)
} else {
stop("ERROR: Could not find 'biomart_attribute' in runsheet")
}
}
get_ensembl_genomes_mappings_from_ftp <- function(organism, ensembl_genomes_portal, ensembl_genomes_version, biomart_attribute) {
#' Obtain mapping table directly from ftp. Useful when biomart live service no longer exists for desired version
request_url <- glue::glue("https://ftp.ebi.ac.uk/ensemblgenomes/pub/{ensembl_genomes_portal}/release-{ensembl_genomes_version}/mysql/{ensembl_genomes_portal}_mart_{ensembl_genomes_version}/{organism}_eg_gene__efg_{biomart_attribute}__dm.txt.gz")
print(glue::glue("Mappings file URL: {request_url}"))
# Create a temporary file name
temp_file <- tempfile(fileext = ".gz")
# Download the gzipped table file using the download.file function
download.file(url = request_url, destfile = temp_file, method = "libcurl") # Use 'libcurl' to support ftps
# Uncompress the file
uncompressed_temp_file <- tempfile()
gzcon <- gzfile(temp_file, "rt")
content <- readLines(gzcon)
writeLines(content, uncompressed_temp_file)
close(gzcon)
# Load the data into a dataframe
mapping <- read.table(uncompressed_temp_file, # Read the uncompressed file
# Add column names as follows: MAPID, TAIR, PROBEID
col.names = c("MAPID", "ensembl_gene_id", biomart_attribute),
header = FALSE, # No header in original table
sep = "\t") # Tab separated
# Clean up temporary files
unlink(temp_file)
unlink(uncompressed_temp_file)
return(mapping)
}
organism <- shortenedOrganismName(unique(df_rs$organism))
if (organism %in% c("athaliana")) {
ensembl_genomes_version = "54"
ensembl_genomes_portal = "plants"
print(glue::glue("Using ensembl genomes ftp to get specific version of probe id mapping table. Ensembl genomes portal: {ensembl_genomes_portal}, version: {ensembl_genomes_version}"))
expected_attribute_name <- getBioMartAttribute(df_rs)
df_mapping <- get_ensembl_genomes_mappings_from_ftp(
organism = organism,
ensembl_genomes_portal = ensembl_genomes_portal,
ensembl_genomes_version = ensembl_genomes_version,
biomart_attribute = expected_attribute_name
)
# TAIR from the mapping tables tend to be in the format 'AT1G01010.1' but the raw data has 'AT1G01010'
# So here we remove the '.NNN' from the mapping table where .NNN is any number
df_mapping$ensembl_gene_id <- stringr::str_replace_all(df_mapping$ensembl_gene_id, "\\.\\d+$", "")
} else {
# Use biomart from main Ensembl website which archives keep each release on the live service
# locate dataset
expected_dataset_name <- shortenedOrganismName(unique(df_rs$organism)) %>% stringr::str_c("_gene_ensembl")
print(paste0("Expected dataset name: '", expected_dataset_name, "'"))
# Specify Ensembl version used in current GeneLab reference annotations
ENSEMBL_VERSION <- '107'
print(glue::glue("Using Ensembl biomart to get specific version of mapping table. Ensembl version: {ENSEMBL_VERSION}"))
ensembl <- biomaRt::useEnsembl(biomart = "genes",
dataset = expected_dataset_name,
version = ENSEMBL_VERSION)
print(ensembl)
expected_attribute_name <- getBioMartAttribute(df_rs)
print(paste0("Expected attribute name: '", expected_attribute_name, "'"))
probe_ids <- unique(norm_data$genes$ProbeName)
# Create probe map
# Run Biomart Queries in chunks to prevent request timeouts
# Note: If timeout is occuring (possibly due to larger load on biomart), reduce chunk size
CHUNK_SIZE= 1500
probe_id_chunks <- split(probe_ids, ceiling(seq_along(probe_ids) / CHUNK_SIZE))
df_mapping <- data.frame()
for (i in seq_along(probe_id_chunks)) {
probe_id_chunk <- probe_id_chunks[[i]]
print(glue::glue("Running biomart query chunk {i} of {length(probe_id_chunks)}. Total probes IDS in query ({length(probe_id_chunk)})"))
chunk_results <- biomaRt::getBM(
attributes = c(
expected_attribute_name,
"ensembl_gene_id"
),
filters = expected_attribute_name,
values = probe_id_chunk,
mart = ensembl)
df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
Sys.sleep(10) # Slight break between requests to prevent back-to-back requests
}
}
# At this point, we have df_mapping from either the biomart live service or the ensembl genomes ftp archive depending on the organism
# Convert list of multi-mapped genes to string
listToUniquePipedString <- function(str_list) {
#! convert lists into strings denoting unique elements separated by '|' characters
#! e.g. c("GO1","GO2","GO2","G03") -> "GO1|GO2|GO3"
return(toString(unique(str_list)) %>% stringr::str_replace_all(pattern = stringr::fixed(", "), replacement = "|"))
}
unique_probe_ids <- df_mapping %>%
dplyr::group_by(!!sym(expected_attribute_name)) %>%
dplyr::summarise(
ENSEMBL = listToUniquePipedString(ensembl_gene_id)
) %>%
# Count number of ensembl IDS mapped
dplyr::mutate(
count_ENSEMBL_mappings = 1 + stringr::str_count(ENSEMBL, stringr::fixed("|"))
)
norm_data$genes <- norm_data$genes %>%
dplyr::left_join(unique_probe_ids, by = c("ProbeName" = expected_attribute_name ) ) %>%
dplyr::mutate( count_ENSEMBL_mappings = ifelse(is.na(ENSEMBL), 0, count_ENSEMBL_mappings) )
Input Data:
df_rs$organism
(organism specified in the runsheet created in Step 1)df_rs$'Array Design REF'
(array design reference specified in the runsheet created in Step 1)- ENSEMBL_VERSION (reference organism Ensembl version indicated in the
ensemblVersion
column of the GL-DPPD-7110_annotations.csv GeneLab Annotations file) norm_data$genes
(Manufacturer's probe metadata, including probe IDs and sequence position gene annotations associated with thenorm_data
R object containing background-corrected and normalized microarray data created in Step 5)
Output Data:
norm_data$genes
(Probe metadata, updated to include gene annotations specified by Biomart)
# Pie Chart with Percentages
slices <- c(
'Control probes' = nrow(norm_data$gene %>% dplyr::filter(ControlType != 0) %>% dplyr::distinct(ProbeName)),
'Unique Mapping' = nrow(norm_data$gene %>% dplyr::filter(ControlType == 0) %>% dplyr::filter(count_ENSEMBL_mappings == 1) %>% dplyr::distinct(ProbeName)),
'Multi Mapping' = nrow(norm_data$gene %>% dplyr::filter(ControlType == 0) %>% dplyr::filter(count_ENSEMBL_mappings > 1) %>% dplyr::distinct(ProbeName)),
'No Mapping' = nrow(norm_data$gene %>% dplyr::filter(ControlType == 0) %>% dplyr::filter(count_ENSEMBL_mappings == 0) %>% dplyr::distinct(ProbeName))
)
pct <- round(slices/sum(slices)*100)
chart_names <- names(slices)
chart_names <- glue::glue("{names(slices)} ({slices})") # add count to labels
chart_names <- paste(chart_names, pct) # add percents to labels
chart_names <- paste(chart_names,"%",sep="") # ad % to labels
pie(slices,labels = chart_names, col=rainbow(length(slices)),
main=glue::glue("Biomart Mapping to Ensembl Primary Keytype\n {nrow(norm_data$gene %>% dplyr::distinct(ProbeName))} Total Unique Probes")
)
original_mapping_rate = nrow(norm_data$gene %>% dplyr::filter(ControlType == 0) %>% dplyr::filter(ProbeName != SystematicName) %>% dplyr::distinct(ProbeName))
print(glue::glue("Original Manufacturer Reported Mapping Count: {original_mapping_rate}"))
print(glue::glue("Biomart Unique Mapping Count: {slices[['Unique Mapping']]}"))
Input Data:
norm_data$genes
(Probe metadata, updated to include gene annotations specified by Biomart, output from Step 7a above)
Output Data:
- A pie chart denoting the biomart mapping rates for each unique probe ID
- A printout denoting the count of unique mappings for both the original manufacturer mapping and the biomart mapping
# Pull all factors for each sample in the study from the runsheet created in Step 1
runsheetToDesignMatrix <- function(runsheet_path) {
df = read.csv(runsheet_path)
# get only Factor Value columns
factors = as.data.frame(df[,grep("Factor.Value", colnames(df), ignore.case=TRUE)])
colnames(factors) = paste("factor",1:dim(factors)[2], sep= "_")
# Load metadata from runsheet csv file
compare_csv = data.frame(sample_id = df[,c("Sample.Name")], factors)
# Create data frame containing all samples and respective factors
study <- as.data.frame(compare_csv[,2:dim(compare_csv)[2]])
colnames(study) <- colnames(compare_csv)[2:dim(compare_csv)[2]]
rownames(study) <- compare_csv[,1]
# Format groups and indicate the group that each sample belongs to
if (dim(study)[2] >= 2){
group<-apply(study,1,paste,collapse = " & ") # concatenate multiple factors into one condition per sample
} else{
group<-study[,1]
}
group_names <- paste0("(",group,")",sep = "") # human readable group names
group <- sub("^BLOCKER_", "", make.names(paste0("BLOCKER_", group))) # group naming compatible with R models, this maintains the default behaviour of make.names with the exception that 'X' is never prepended to group namesnames(group) <- group_names
names(group) <- group_names
# Format contrasts table, defining pairwise comparisons for all groups
contrast.names <- combn(levels(factor(names(group))),2) # generate matrix of pairwise group combinations for comparison
contrasts <- apply(contrast.names, MARGIN=2, function(col) sub("^BLOCKER_", "", make.names(paste0("BLOCKER_", stringr::str_sub(col, 2, -2)))))
contrast.names <- c(paste(contrast.names[1,],contrast.names[2,],sep = "v"),paste(contrast.names[2,],contrast.names[1,],sep = "v")) # format combinations for output table files names
contrasts <- cbind(contrasts,contrasts[c(2,1),])
colnames(contrasts) <- contrast.names
sampleTable <- data.frame(condition=factor(group))
rownames(sampleTable) <- df[,c("Sample.Name")]
condition <- sampleTable[,'condition']
names_mapping <- as.data.frame(cbind(safe_name = as.character(condition), original_name = group_names))
design <- model.matrix(~ 0 + condition)
design_data <- list( matrix = design, mapping = names_mapping, groups = as.data.frame( cbind(sample = df[,c("Sample.Name")], group = group_names) ), contrasts = contrasts )
return(design_data)
}
# Loading metadata from runsheet csv file
design_data <- runsheetToDesignMatrix(runsheet)
design <- design_data$matrix
# Write SampleTable.csv and contrasts.csv file
write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable_GLmicroarray.csv"), row.names = FALSE)
write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts_GLmicroarray.csv"))
Input Data:
runsheet
(Path to runsheet, output from Step 1)
Output Data:
design
(R object containing the limma study design matrix, indicating the group that each sample belongs to)- SampleTable_GLmicroarray.csv (table containing samples and their respective groups)
- contrasts_GLmicroarray.csv (table containing all pairwise comparisons)
lmFitPairwise <- function(norm_data, design) {
#' Perform all pairwise comparisons
#' Approach based on limma manual section 17.4 (version 3.52.4)
fit <- limma::lmFit(norm_data, design)
# Create Contrast Model
fit.groups <- colnames(fit$design)[which(fit$assign == 1)]
combos <- combn(fit.groups,2)
contrasts<-c(paste(combos[1,],combos[2,],sep = "-"),paste(combos[2,],combos[1,],sep = "-")) # format combinations for limma:makeContrasts
cont.matrix <- limma::makeContrasts(contrasts=contrasts,levels=design)
contrast.fit <- limma::contrasts.fit(fit, cont.matrix)
contrast.fit <- limma::eBayes(contrast.fit,trend=TRUE,robust=TRUE)
return(contrast.fit)
}
# Calculate results
res <- lmFitPairwise(norm_data, design)
# Print DE table, without filtering
limma::write.fit(res, adjust = 'BH',
file = "INTERIM.csv",
row.names = FALSE,
quote = TRUE,
sep = ",")
Input Data:
norm_data
(R object containing background-corrected and normalized microarray data created in Step 5)design
(R object containing the limma study design matrix, indicating the group that each sample belongs to, created in Step 7c above)
Output Data:
- INTERIM.csv (Statistical values from individual probe level DE analysis, including:
- Log2fc between all pairwise comparisons
- T statistic for all pairwise comparison tests
- P value for all pairwise comparison tests)
## Reformat Table for consistency across DE analyses tables within GeneLab ##
# Read in DE table
df_interim <- read.csv("INTERIM.csv")
# Reformat column names
reformat_names <- function(colname, group_name_mapping) {
new_colname <- colname %>%
stringr::str_replace(pattern = "^P.value.adj.condition", replacement = "Adj.p.value_") %>%
stringr::str_replace(pattern = "^P.value.condition", replacement = "P.value_") %>%
stringr::str_replace(pattern = "^Coef.condition", replacement = "Log2fc_") %>% # This is the Log2FC as per: https://rdrr.io/bioc/limma/man/writefit.html
stringr::str_replace(pattern = "^t.condition", replacement = "T.stat_") %>%
stringr::str_replace(pattern = stringr::fixed("Genes.ProbeName"), replacement = "ProbeName") %>%
stringr::str_replace(pattern = stringr::fixed("Genes.count_ENSEMBL_mappings"), replacement = "count_ENSEMBL_mappings") %>%
stringr::str_replace(pattern = stringr::fixed("Genes.ProbeUID"), replacement = "ProbeUID") %>%
stringr::str_replace(pattern = stringr::fixed("Genes.ENSEMBL"), replacement = "ENSEMBL") %>%
stringr::str_replace(pattern = ".condition", replacement = "v")
# remap to group names before make.names was applied
for ( i in seq(nrow(group_name_mapping)) ) {
safe_name <- group_name_mapping[i,]$safe_name
original_name <- group_name_mapping[i,]$original_name
new_colname <- new_colname %>% stringr::str_replace(pattern = stringr::fixed(safe_name), replacement = original_name)
}
return(new_colname)
}
df_interim <- df_interim %>% dplyr::rename_with( reformat_names, group_name_mapping = design_data$mapping )
# Concatenate expression values for each sample
df_interim <- df_interim %>% dplyr::bind_cols(norm_data$E)
## Add Group Wise Statistics ##
# Group mean and standard deviations for normalized expression values are computed and added to the table
unique_groups <- unique(design_data$group$group)
for ( i in seq_along(unique_groups) ) {
current_group <- unique_groups[i]
current_samples <- design_data$group %>%
dplyr::group_by(group) %>%
dplyr::summarize(
samples = sort(unique(sample))
) %>%
dplyr::filter(
group == current_group
) %>%
dplyr::pull()
print(glue::glue("Computing mean and standard deviation for Group {i} of {length(unique_groups)}"))
print(glue::glue("Group: {current_group}"))
print(glue::glue("Samples in Group: '{toString(current_samples)}'"))
df_interim <- df_interim %>%
dplyr::mutate(
"Group.Mean_{current_group}" := rowMeans(dplyr::select(., all_of(current_samples))),
"Group.Stdev_{current_group}" := matrixStats::rowSds(as.matrix(dplyr::select(., all_of(current_samples)))),
) %>%
dplyr::ungroup() %>%
as.data.frame()
}
all_samples <- design_data$group %>% dplyr::pull(sample)
df_interim <- df_interim %>%
dplyr::mutate(
"All.mean" := rowMeans(dplyr::select(., all_of(all_samples))),
"All.stdev" := matrixStats::rowSds(as.matrix(dplyr::select(., all_of(all_samples)))),
) %>%
dplyr::ungroup() %>%
as.data.frame()
print("Remove extra columns from final table")
# These columns are data mapped to column PROBEID as per the original Manufacturer and can be linked as needed
colnames_to_remove = c(
"Genes.Row",
"Genes.Col",
"Genes.Start",
"Genes.Sequence",
"Genes.ControlType",
"Genes.ProbeName",
"Genes.GeneName",
"Genes.SystematicName",
"Genes.Description",
"AveExpr" # Replaced by 'All.mean' column
# "Genes.count_ENSEMBL_mappings", Keep this
)
df_interim <- df_interim %>% dplyr::select(-any_of(colnames_to_remove))
## Concatenate annotations for genes (for uniquely mapped probes) ##
### Read in annotation table for the appropriate organism ###
annot <- read.table(
annotation_file_path,
sep = "\t",
header = TRUE,
quote = "",
comment.char = "",
)
# Join annotation table and uniquely mapped data
# Determine appropriate keytype
map_primary_keytypes <- c(
'Caenorhabditis elegans' = 'ENSEMBL',
'Danio rerio' = 'ENSEMBL',
'Drosophila melanogaster' = 'ENSEMBL',
'Rattus norvegicus' = 'ENSEMBL',
'Saccharomyces cerevisiae' = 'ENSEMBL',
'Homo sapiens' = 'ENSEMBL',
'Mus musculus' = 'ENSEMBL',
'Arabidopsis thaliana' = 'TAIR'
)
df_interim <- merge(
annot,
df_interim,
by.x = map_primary_keytypes[[unique(df_rs$organism)]],
by.y = "ENSEMBL",
# ensure all original dge rows are kept.
# If unmatched in the annotation database, then fill missing with NAN
all.y = TRUE
)
## Reorder columns before saving to file
ANNOTATIONS_COLUMN_ORDER = c(
map_primary_keytypes[[unique(df_rs$organism)]],
"SYMBOL",
"GENENAME",
"REFSEQ",
"ENTREZID",
"STRING_id",
"GOSLIM_IDS"
)
PROBE_INFO_COLUMN_ORDER = c(
"ProbeUID",
"ProbeName",
"count_ENSEMBL_mappings"
)
SAMPLE_COLUMN_ORDER <- all_samples
generate_prefixed_column_order <- function(subjects, prefixes) {
#' Return a vector of columns based on subject and given prefixes
#' Used for both contrasts and groups column name generation
# Track order of columns
final_order = c()
# For each contrast
for (subject in subjects) {
# Generate column names for each prefix and append to final_order
for (prefix in prefixes) {
final_order <- append(final_order, glue::glue("{prefix}{subject}"))
}
}
return(final_order)
}
STAT_COLUMNS_ORDER <- generate_prefixed_column_order(
subjects = colnames(design_data$contrasts),
prefixes = c(
"Log2fc_",
"T.stat_",
"P.value_",
"Adj.p.value_"
)
)
ALL_SAMPLE_STATS_COLUMNS_ORDER <- c(
"All.mean",
"All.stdev",
"F",
"F.p.value"
)
GROUP_MEAN_COLUMNS_ORDER <- generate_prefixed_column_order(
subjects = unique(design_data$groups$group),
prefixes = c(
"Group.Mean_"
)
)
GROUP_STDEV_COLUMNS_ORDER <- generate_prefixed_column_order(
subjects = unique(design_data$groups$group),
prefixes = c(
"Group.Stdev_"
)
)
FINAL_COLUMN_ORDER <- c(
ANNOTATIONS_COLUMN_ORDER,
PROBE_INFO_COLUMN_ORDER,
SAMPLE_COLUMN_ORDER,
STAT_COLUMNS_ORDER,
ALL_SAMPLE_STATS_COLUMNS_ORDER,
GROUP_MEAN_COLUMNS_ORDER,
GROUP_STDEV_COLUMNS_ORDER
)
## Assert final column order includes all columns from original table
if (!setequal(FINAL_COLUMN_ORDER, colnames(df_interim))) {
write.csv(FINAL_COLUMN_ORDER, "FINAL_COLUMN_ORDER.csv")
NOT_IN_DF_INTERIM <- paste(setdiff(FINAL_COLUMN_ORDER, colnames(df_interim)), collapse = ":::")
NOT_IN_FINAL_COLUMN_ORDER <- paste(setdiff(colnames(df_interim), FINAL_COLUMN_ORDER), collapse = ":::")
stop(glue::glue("Column reordering attempt resulted in different sets of columns than original. Names unique to 'df_interim': {NOT_IN_FINAL_COLUMN_ORDER}. Names unique to 'FINAL_COLUMN_ORDER': {NOT_IN_DF_INTERIM}."))
}
## Perform reordering
df_interim <- df_interim %>% dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
# Save to file
write.csv(df_interim, file.path(DIR_DGE, "differential_expression_GLmicroarray.csv"), row.names = FALSE)
### Generate and export PCA table for GeneLab visualization plots
## Only use positive expression values, negative values can make up a small portion ( < 0.5% ) of normalized expression values and cannot be log transformed
exp_raw <- log2(norm_data$E) # negatives get converted to NA
exp_raw <- na.omit(norm_data$E)
PCA_raw <- prcomp(t(exp_raw), scale = FALSE)
write.csv(PCA_raw$x,
file.path(DIR_DGE, "visualization_PCA_table_GLmicroarray.csv")
)
## Generate raw intensity matrix that includes annotations
raw_data_matrix <- background_corrected_data$genes %>%
dplyr::select(ProbeUID, ProbeName) %>%
dplyr::bind_cols(background_corrected_data$E) %>%
dplyr::left_join(unique_probe_ids, by = c("ProbeName" = expected_attribute_name ) ) %>%
dplyr::mutate( count_ENSEMBL_mappings = ifelse(is.na(ENSEMBL), 0, count_ENSEMBL_mappings) )
raw_data_matrix_annotated <- merge(
annot,
raw_data_matrix,
by.x = map_primary_keytypes[[unique(df_rs$organism)]],
by.y = "ENSEMBL",
# ensure all original dge rows are kept.
# If unmatched in the annotation database, then fill missing with NAN
all.y = TRUE
)
## Reorder columns to match DGE output
FINAL_COLUMN_ORDER <- c(
ANNOTATIONS_COLUMN_ORDER,
PROBE_INFO_COLUMN_ORDER,
SAMPLE_COLUMN_ORDER
)
raw_data_matrix_annotated <- raw_data_matrix_annotated %>%
dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
write.csv(raw_data_matrix_annotated, file.path(DIR_RAW_DATA, "raw_intensities_GLmicroarray.csv"), row.names = FALSE)
## Generate normalized expression matrix that includes annotations
norm_data_matrix <- norm_data$genes %>%
dplyr::select(ProbeUID, ProbeName) %>%
dplyr::bind_cols(norm_data$E) %>%
dplyr::left_join(unique_probe_ids, by = c("ProbeName" = expected_attribute_name ) ) %>%
dplyr::mutate( count_ENSEMBL_mappings = ifelse(is.na(ENSEMBL), 0, count_ENSEMBL_mappings) )
norm_data_matrix_annotated <- merge(
annot,
norm_data_matrix,
by.x = map_primary_keytypes[[unique(df_rs$organism)]],
by.y = "ENSEMBL",
# ensure all original dge rows are kept.
# If unmatched in the annotation database, then fill missing with NAN
all.y = TRUE
)
norm_data_matrix_annotated <- norm_data_matrix_annotated %>%
dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_GLmicroarray.csv"), row.names = FALSE)
Input Data:
- INTERIM.csv (Statistical values from individual probe level DE analysis, output from Step 7d above)
annotation_file_path
(Annotation file url from 'genelab_annots_link' column of GL-DPPD-7110_annotations.csv corresponding to the subject organism)primary_keytype
(Keytype to join annotation table and microarray probes, dependent on organism, e.g. mus musculus uses 'ENSEMBL')background_corrected_data
(R object containing background-corrected microarray data)norm_data
(R object containing background-corrected and normalized microarray data created in Step 5)
Output Data:
- differential_expression_GLmicroarray.csv (table containing normalized expression for each sample, group statistics, Limma probe DE results for each pairwise comparison, and gene annotations)
- visualization_PCA_table_GLmicroarray.csv (file used to generate GeneLab PCA plots)
- raw_intensities_GLmicroarray.csv (table containing the background corrected unnormalized intensity values for each sample including gene annotations)
- normalized_expression_GLmicroarray.csv (table containing the background corrected, normalized intensity values for each sample including gene annotations)
All steps of the Microarray pipeline are performed using R markdown and the completed R markdown is rendered (via Quarto) as an html file (NF_MAAgilent1ch_v*_GLmicroarray.html) and published in the Open Science Data Repository (OSDR) for the respective dataset.