-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
executable file
·161 lines (110 loc) · 6.34 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
output: rmarkdown::github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include=FALSE, echo=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(fig.path="./man/figures/", message=FALSE, collapse = TRUE, comment="")
# Load SummarizedExperiment
library(SummarizedExperiment)
# Load CaDrA
library(devtools)
load_all()
```
# CaDrA
![build](https://github.com/montilab/cadra/workflows/rcmdcheck/badge.svg)
![Gitter](https://img.shields.io/gitter/room/montilab/cadra)
![GitHub issues](https://img.shields.io/github/issues/montilab/cadra)
![GitHub last commit](https://img.shields.io/github/last-commit/montilab/cadra)
**Ca**ndidate **Dr**ivers **A**nalysis: Multi-Omic Search for Candidate Drivers of Functional Signatures
**CaDrA** is an R package that supports a heuristic search framework aimed at identifying candidate drivers of a molecular phenotype of interest.
The main function takes two inputs:
i) A binary multi-omics dataset, which can be represented as a matrix of binary features or a **SummarizedExperiment** class object where the rows are 1/0 vectors indicating the presence/absence of ‘omics’ features (e.g. somatic mutations, copy number alterations, epigenetic marks, etc.), and the columns are the samples.
ii) A molecular phenotype of interest which can be represented as a vector of continuous scores (e.g. protein expression, pathway activity, etc.)
Based on these two inputs, **CaDrA** implements a forward and/or backward search algorithm to find a set of features that together is maximally associated with the observed input scores, based on one of several scoring functions (*Kolmogorov-Smirnov*, *Conditional Mutual Information*, *Wilcoxon*, or *custom-defined scoring function*), making it useful to find complementary omics features likely driving the input molecular phenotype.
Please see our [documentation](https://montilab.github.io/CaDrA/) for additional examples.
# Web Interface
We developed an R Shiny Dashboard that would allow users to interact with **CaDrA** directly without the need to install or maintain the package.
See our web portal at [https://cadra.bu.edu/](https://cadra.bu.edu/)
# Installation
- Using `devtools` package
```r
library(devtools)
devtools::install_github("montilab/CaDrA")
```
- Using `BiocManager` package
```r
# Install BiocManager
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install CaDrA
BiocManager::install("CaDrA")
# Install SummarizedExperiment
BiocManager::install("SummarizedExperiment")
```
# Usage
Here, we are using a dataset of somatic mutations and CNAs extracted from the TCGA Breast Cancer Dataset. We will query this Feature Set based on an Input Score that measures the per-sample activity of YAP/TAZ (two important regulators of the hippo pathway). This score represents the projection on the TCGA BrCa dataset of a gene expression signature of YAP/TAZ knockdown derived in breast cancer cell lines. Our question of interest: what is the combination of genetic features (mutations and copy number alterations) that best “explain” the YAP/TAZ activity?
## (i) Load R packages
```r
library(CaDrA)
library(SummarizedExperiment)
```
## (ii) Format and filter data inputs
```{r load.data}
## Read in BRCA GISTIC+Mutation object
utils::data(BRCA_GISTIC_MUT_SIG)
eset_mut_scna <- BRCA_GISTIC_MUT_SIG
## Read in input score
utils::data(TAZYAP_BRCA_ACTIVITY)
input_score <- TAZYAP_BRCA_ACTIVITY
## Samples to keep based on the overlap between the two inputs
overlap <- base::intersect(base::names(input_score), base::colnames(eset_mut_scna))
eset_mut_scna <- eset_mut_scna[, overlap]
input_score <- input_score[overlap]
## Binarize FS to only have 0's and 1's
SummarizedExperiment::assay(eset_mut_scna)[SummarizedExperiment::assay(eset_mut_scna) > 1] <- 1.0
## Pre-filter FS based on occurrence frequency
eset_mut_scna_flt <- CaDrA::prefilter_data(
FS = eset_mut_scna,
max_cutoff = 0.6, # max event frequency (60%)
min_cutoff = 0.03 # min event frequency (3%)
)
```
## (iii) Run CaDrA
Here, we repeat the candidate search starting from each of the top 'N' features and report the combined results as a heatmap (to summarize the number of times each feature is selected across repeated runs).
**IMPORTANT NOTE**: The legacy function `topn_eval()` is equivalent to the new recommended `candidate_search()` function.
```{r cadra}
topn_res <- CaDrA::candidate_search(
FS = eset_mut_scna_flt,
input_score = input_score,
method = "ks_pval", # Use Kolmogorow-Smirnow scoring function
method_alternative = "less", # Use one-sided hypothesis testing
weights = NULL, # If weights is provided, perform a weighted-KS test
search_method = "both", # Apply both forward and backward search
top_N = 7, # Evaluate top 7 starting points for each search
max_size = 7, # Maximum size a meta-feature matrix can extend to
do_plot = FALSE, # Plot after finding the best features
best_score_only = FALSE # Return all results from the search
)
```
## (iv) Visualize the results
### Meta-feature plot
This plot produces 3 graphics stacked on top of each other:
1. A density diagram of observed input scores sorted from highest to lowest
2. A tile plot for the top meta-features that associated with a molecular phenotype of interest (e.g. input_score)
3. A KS enrichment plot of the meta-feature set (this correspond to the logical OR of the features)
```{r visualize.best}
## Fetch the meta-feature set corresponding to its best scores over top N features searches
topn_best_meta <- CaDrA::topn_best(topn_res)
# Visualize the best results with the meta-feature plot
CaDrA::meta_plot(topn_best_list = topn_best_meta, input_score_label = "YAP/TAZ Activity")
```
### Top-N plot
This plot is a heatmap of overlapping meta-features by repeating `candidate_search` over top N feature searches.
```{r summarize}
# Evaluate results across top N features you started from
CaDrA::topn_plot(topn_res)
```
# Additional Guides
- [How to run CaDrA within a Docker environment](https://montilab.github.io/CaDrA/articles/docker.html)
# Acknowledgements
This project is funded in part by the [NIH/NIDCR](https://www.nidcr.nih.gov/) (3R01DE030350-01A1S1, R01DE031831), [Find the Cause Breast Cancer Foundation](https://findthecausebcf.org), and [NIH/NIA](https://www.nia.nih.gov/) (UH3 AG064704).