Skip to content

Latest commit

 

History

History
182 lines (132 loc) · 12.1 KB

Tutorial_1.md

File metadata and controls

182 lines (132 loc) · 12.1 KB

Tutorial I, how to create your own coExpNetwork

With tutorial I we saw how to use co-expression networks. We know they are created from gene expression profiles with a decent number of samples (ideally more than 100). They are compound of modules. Each gene belongs to a module and we can get a module membership for the gene to that module (the correlation of its gene expression with the 1st principal component, the eigengene). Finally, we know modules are annotated with curated databases, including GO, KEGG and REACTOME. Besides, and specifically for brain tissues, we can annotate modules with signal enrichment from specific cell type markers (microglia, astrocyte, oligodendrocyte and obviously neuron).

With tutorial II we got to know that gene expression usually suffer from technical biases (batch effects) we have to alleviate (hopefully remove). We used ComBat for that (as implemented in the Bioconductor's SVA package). We checked as well whether the dat may contain outliers. We learned how to model hidden biases in the data (whether technical or biological) and a way to detect those we may want to remove from data. Finally, we saw how to employ simple linear regression to model what is left when (1) biological covariates and (2) hidden biases contribution to gene expression is removed. What is left is precissely what we want to use to model a co-expression network, assuming none of the biological covariates (e.g. time of death or gender) is of interest.

Step 1: Building your network

In order to build a new co-expression network we need a gene expression profiling for a given condition, or more than one condition. We may build networks of single conditions, e.g. control tissue like GTEx Whole Blood or Frontal Cortex. All samples are from healthy donors and they belong to a single tissue. But we may build networks from cases and controls. In this case we would have healthy and disease samples together and we build networks to study differences between cases and controls at the network level, as in ROS/MAP.

expr.data  = readRDS("~/tmp/rosmap_residuals.rds")
#We assume there is a /home/juanbot/tmp/results/ folder to
#store all results generated from there
setwd("~/tmp/")
cat("For ROS/MAP samples that we generated from tutorial II we have ",nrow(expr.data),
    " samples and ",ncol(expr.data)," genes\n")

This would be how we call the method to create the network

#We create a network with WGCNA and then process it with k-means using 20 iterations
#The network will be signed and the beta parameter will be automatically generated
#Everything will be stored at ~/tmp/results
net = CoExpNets::getDownstreamNetwork(tissue="MyRosMap",
		n.iterations=20,
		net.type = "signed",
		debug=F,
		expr.data=expr.data,
		job.path="results/")

Yes, this is a lot of debug information to keep you on track of what happend. Let us see which files do we have now available. First of all, we have the beta study for the ROSMAP residuals we generated, stored in the results folder under the name betastudyMyRosMap.signed.pdf and it is represented by figure below.

Smoothing parameter study

In that figure you see the beta study generated by WGCNA. The plot on the left shows how the different beta values (x-axis) approximate the linear regression model to a scale free topology (SFT) network with R-squared values. The horizontal red line marks the values of beta over which the network satisfies SFT property. We choose beta=4. The plot on the right shows the average connectivity of genes. You see as the beta increases, the genes are less connected which is good.

Then we have MyRosMap.9_mod_size.pdf which is the size of each module in number of genes before applying k-means and appears depicted in figure.

Figure for module size before applying k-means

you can see that turquoise has a remarkably high size in comparison with the other which is not good at all. Also you can see how the module eigengenes relate to each other, again before applying k-means, at MyRosMap.4_Eigengenes_clustering.pdf in figure below.

Figure for eigengenes clustering before applying k-means

You see that turquoise is far apart from the rest, possibly because of its size but also that, for example, tan and blue are similar and also black and midnightblue, and so on. We can again see, after applying k-means, what is the module size in terms of genes at the file netMyRosMap.4.it.20.rds_mod_size.pdf and figure below.

Figure for module size after applying k-means

Now we see that the relative size of the turquoise module seems much more reasonable than before. In consequence, the dendrogram for the module eigengenes has changed too, in file netFCortex.9.it.20.rds_eigengenes_clustering and figure below.

Figure for eigengenes clustering after applying k-means

Step 2, annotating the network

Now that we are happy with our network, we need to generate the corresponding annotation. As we already know, there are two sources of annotation we consider: functional annotation DBs (FADB) and cell type marker gene set DBs (CTDB). In order to generate a FADB annotation we do the following, assuming we are going to work with the network we just crated in the results folder

net.name="results/netMyRosMap.4.it.20.rds"
net = readRDS(net.name)
names(net$moduleColors) =  gsub("\\.[0-9]+","",names(net$moduleColors))
saveRDS(net,net.name)
go = CoExpNets::getGProfilerOnNet(net.file=net.name,
                             exclude.iea=F,
			out.file=paste0(net.name,"_gprof.csv"))

This makes use of the gProfileR package, cited above, to get the enrichment of each network's module. The parameter exclude.iea refers to whether we should use IEA (Inferred Electronic Annotations) which are computationally predicted annotations and may be not as reliable as manually curated one. Set it to FALSE if you prefer accurate rathern than abundant annotation. There is another parameter, correction.method="gSCS", useful to indicate how gProfileR should correct for multiple testing, we use the ad hoc test defined by the authors for such purpose. Finally, the parameter filter=c("GO","KEGG","REAC") indicated we are using these three annotation databases.

As we see from the results, we get 3681 different enrichment terms in total. So it is rather complex and rich enrichment. Also, a positive sign about the biological soundness of our network modules. If we have a look at how the enrichment is spread scross the network modules we have the turquoise module as the top most enriched.

e = read.csv("results/netMyRosMap.4.it.20.rds_gprof.csv",stringsAsFactors=F)
sort(table(e$query.number),decreasing=T)

So what is the turquoise module about? Where is the action happening of the processes in which genes at turquoise are involved?

turquoise = e[e$query.number == "turquoise" & e$domain == "CC",]
turquoise[order(turquoise$p.value)[1:10],c("term.name","domain","p.value"),]

This module is focused at the mitochondria probably. And now, what are the KEGG pathways for which genes in turquoise are enriched?

turquoise = e[e$query.number == "turquoise" & e$domain == "keg" ,]
turquoise[order(turquoise$p.value)[1:10],c("term.name","domain","p.value"),]

That is interesting. We are getting three brain related diseases, one of them the disease which is the target of ROS/MAP study, Alzheimer's.

And now is the turn for CTDB annotations. We generate them, for the newly created ROS/MAP network, as follows

net.name="netMyRosMap.4.it.20.rds"
library(CoExpNets)
library(WGCNA)
CoExpNets::initDb()
write.csv(CoExpNets::cellTypeByModule(return.processed=F,
          tissue="MyRosMap",
					which.one="new",
					plot.file=paste0("results/",net.name,".celltype.pdf"),
					net.in=paste0("results/",net.name),
					legend=paste0("ROS/MAP cell type signals")),
					paste0(net.name,".celltype.csv"))

This generates three different files. The file resuts/netMyRosMap.4.it.20.rds.USER_terms.csv is a list of enrichments generated by WGCNA alone which is useful for assigning cell types and brain areas to each module. Then the file results/netMyRosMap.4.it.20.rds.celltype.csv is the proper CTDB enrichment we generate. Its heat-map based pdf representation we find at the file results/netMyRosMap.4.it.20.rds.celltype.csv.pdf. And it looks like this heat-map above

Cell type signals for our ROS/MAP Cortex network

In which we clearly see the following interesting things

  • The grey60 module is enriched for multiple oligodendrocyte marker gene sets

    • More signals from different sources suggest more certainty in that cell type
  • Turquoise and royalblue modules are clearly enriched for neuron. As they are relatively different in terms of neuron signals this suggest they gather genes referring to different neuron types.

  • The lightgreen module is astrocytical

  • Black and salmon are clearly microglial and black recapitulates endothelial cell type possibly gathering genes working at blood vessels within the brain.

Step 3, correlation of covariates with network modules

Besides functional annotation and cell type specificity we know from tutorial I that we can correlate covariates with modules to check whether there is any relationship with the module's gene expression and biological covariates, e.g. age. So we do

#We whole network has all samples and some modules will capture differences between
#genes in cases and controls, lets see the whole picture
library(WGCNA)
CoExpROSMAP::initDb()
covs = CoExpROSMAP::getCovariates(tissue="allsamples")
expr.data = getExprDataFromTissue(tissue="~/tmp/rosmap_residuals.rds",which.one="new")
net = getNetworkFromTissue(tissue="results/netMyRosMap.4.it.20.rds",which.one="new")
#common = intersect(rownames(covs),rownames(expr.data))
covs = covs[match(rownames(expr.data),rownames(covs)),]
stopifnot(identical(rownames(expr.data),rownames(covs)))
CoExpNets::corWithCatTraits(which.one="new",tissue="results/netMyRosMap.4.it.20.rds",covs=covs)

As we expected, there is no module's eigengene correlated with those biological covariates we corrected. However, it is of interest that purple, red and gree modules correlated with cognitive decline, a manually assessed score per sample that was not used in creating the network. If we check whether these modules are enriched for any cell type, the answer is now. Do it by yourself. But what is the functional annotation for these modules?

e = read.csv("results/netMyRosMap.4.it.20.rds_gprof.csv",stringsAsFactors=F)
sort(table(e$query.number),decreasing=T)
green = e[e$query.number == "green" & e$domain == "keg" ,]
green[order(green$p.value),c("term.name","domain","p.value"),]
green = e[e$query.number == "green" & e$domain == "BP" ,]
green[order(green$p.value),c("term.name","domain","p.value"),]
green = e[e$query.number == "green" & e$domain == "CC" ,]
green[order(green$p.value),c("term.name","domain","p.value"),]

So for green, if we focus on KEGG, there are a number of signals that are related with mainly disease/cancer related pathways. Is this suggesting that cognitive decline is related in some way with cancer of some subtypes of it? By looking at BP and CC, the genes implied are those implied in transcription from DNA to mRNA.

What is the purple module saying to us in relation to cognitive decline?

purple = e[e$query.number == "purple" & e$domain == "keg" ,]
purple[order(purple$p.value),c("term.name","domain","p.value"),]
purple = e[e$query.number == "purple" & e$domain == "BP" ,]
purple[order(purple$p.value),c("term.name","domain","p.value"),]
purple = e[e$query.number == "purple" & e$domain == "CC" ,]
purple[order(purple$p.value),c("term.name","domain","p.value"),]

Not surprisingly, it is telling us the same story.