forked from Bioconductor/BioC2016Introduction
-
Notifications
You must be signed in to change notification settings - Fork 0
/
L4-bioc-annotation.Rmd
412 lines (337 loc) · 17.1 KB
/
L4-bioc-annotation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
---
title: "Lab 1.4: Annotation Resources"
output:
BiocStyle::html_document:
toc: true
vignette: >
% \VignetteIndexEntry{Lab 4: Annotation Resources}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
```{r setup, echo=FALSE, warning=FALSE}
options(max.print=1000, width=100)
knitr::opts_chunk$set(cache=TRUE)
suppressPackageStartupMessages({
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicRanges)
library(biomaRt)
library(rtracklayer)
library(Gviz)
library(AnnotationHub)
})
```
Authors: Valerie Obenchain (<a
href="mailto:[email protected]">[email protected]</a>),
Lori Shepherd (<a
href="mailto:[email protected]">[email protected]</a>),
Martin Morgan (<a
href="mailto:[email protected]">[email protected]</a>)
<br />
Date: 25 June, 2016<br />
# Gene annotation
## Data packages
Organism-level ('org') packages contain mappings between a central
identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank
or Uniprot accession number, RefSeq id, etc.). The name of an org
package is always of the form `org.<Sp>.<id>.db`
(e.g. [org.Sc.sgd.db][]) where `<Sp>` is a 2-letter abbreviation of
the organism (e.g. `Sc` for *Saccharomyces cerevisiae*) and `<id>` is
an abbreviation (in lower-case) describing the type of central
identifier (e.g. `sgd` for gene identifiers assigned by the
*Saccharomyces* Genome Database, or `eg` for Entrez gene ids). The
"How to use the '.db' annotation packages" vignette in the
[AnnotationDbi][] package (org packages are only one type of ".db"
annotation packages) is a key reference. The '.db' and most other
Bioconductor annotation packages are updated every 6 months.
Annotation packages usually contain an object named after the package
itself. These objects are collectively called `AnnotationDb` objects,
with more specific classes named `OrgDb`, `ChipDb` or `TranscriptDb`
objects. Methods that can be applied to these objects include
`cols()`, `keys()`, `keytypes()` and `select()`. Common operations
for retrieving annotations are summarized in the table.
| Category | Function | Description |
|------------|---------------------------------------|------------------------------------------------------------------|
| Discover | `columns()` | List the kinds of columns that can be returned |
| | `keytypes()` | List columns that can be used as keys |
| | `keys()` | List values that can be expected for a given keytype |
| | `select()` | Retrieve annotations matching `keys`, `keytype` and `columns` |
| Manipulate | `setdiff()`, `union()`, `intersect()` | Operations on sets |
| | `duplicated()`, `unique()` | Mark or remove duplicates |
| | `%in%`, `match()` | Find matches |
| | `any()`, `all()` | Are any `TRUE`? Are all? |
| | `merge()` | Combine two different \Robject{data.frames} based on shared keys |
| `GRanges*` | `transcripts()`, `exons()`, `cds()` | Features (transcripts, exons, coding sequence) as `GRanges`. |
| | `transcriptsBy()` , `exonsBy()` | Features group by gene, transcript, etc., as `GRangesList`. |
| | `cdsBy()` | |
## Internet resources
A short summary of select Bioconductor packages enabling web-based
queries is in following Table.
| Package | Description |
|-----------------------------------------------------|-------------------------------------------|
| [AnnotationHub][] | Ensembl, Encode, dbSNP, UCSC data objects |
| [biomaRt](http://biomart.org) | Ensembl and other annotations |
| [PSICQUIC](https://code.google.com/p/psicquic) | Protein interactions |
| [uniprot.ws](http://uniprot.org) | Protein annotations |
| [KEGGREST](http://www.genome.jp/kegg) | KEGG pathways |
| [SRAdb](http://www.ncbi.nlm.nih.gov/sra) | Sequencing experiments. |
| [rtracklayer](http://genome.ucsc.edu) | genome tracks. |
| [GEOquery](http://www.ncbi.nlm.nih.gov/geo/) | Array and other data |
| [ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) | Array and other data |
*Using biomaRt*
The [biomaRt][] package offers access to the online
[biomart](http://www.biomart.org) resource. this consists of several
data base resources, referred to as 'marts'. Each mart allows access
to multiple data sets; the [biomaRt][] package provides methods for
mart and data set discovery, and a standard method `getBM()` to
retrieve data.
## Exercises
**Exercise**: This exercise illustrates basic use of the `select'
interface to annotation packages.
1. What is the name of the org package for *Homo sapiens*? Load it.
Display the `OrgDb` object for the [org.Hs.eg.db][] package. Use
the `columns()` method to discover which sorts of annotations can
be extracted from it.
2. Use the `keys()` method to extract ENSEMBL identifiers and then
pass those keys in to the `select()` method in such a way that you
extract the SYMBOL (gene symbol) and GENENAME information for
each. Use the following ENSEMBL ids.
```{r select-setup}
ensids <- c("ENSG00000130720", "ENSG00000103257", "ENSG00000156414",
"ENSG00000144644", "ENSG00000159307", "ENSG00000144485")
```
**Solution** The `OrgDb` object is named `org.Hs.eg.db`.
```{r select}
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
columns(org.Hs.eg.db)
cols <- c("SYMBOL", "GENENAME")
select(org.Hs.eg.db, keys=ensids, columns=cols, keytype="ENSEMBL")
```
**Exercise**
<font color="red">Internet access required for this exercise</font>
1. Load the [biomaRt][] package and list the available marts. Choose
the *ensembl* mart and list the datasets for that mart. Set up a
mart to use the *ensembl* mart and the *hsapiens gene ensembl*
dataset.
2. A [biomaRt][] dataset can be accessed via `getBM()`. In addition to
the mart to be accessed, this function takes filters and attributes
as arguments. Use `filterOptions()` and `listAttributes()` to
discover values for these arguments. Call `getBM()` using filters
and attributes of your choosing.
**Solution**
```{r biomaRt1, eval=FALSE, results="hide"}
## NEEDS INTERNET ACCESS !!
library(biomaRt)
head(listMarts(), 3) ## list the marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <- ## fully specified mart
useMart("ensembl", dataset = "hsapiens_gene_ensembl")
head(listFilters(ensembl), 3) ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3) ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")
## assemble and query the mart
res <- getBM(attributes = myAttributes, filters = myFilter,
values = myValues, mart = ensembl)
```
**Exercise**
As an optional exercise to be completed after Tuesday's lab, annotate
the genes that are differentially expressed in the DESeq2 laboratory,
e.g., find the *GENENAME* associated with the five most differentially
expressed genes. Do these make biological sense? Can you `merge()` the
annotation results with the `top table' results to provide a
statistically and biologically informative summary?
# Genome annotation
There are a diversity of packages and classes available for
representing large genomes. Several include:
- 'TxDb.*' For transcript and other genome / coordinate annotation.
- [BSgenome][] For whole-genome representation. See
`available.genomes()` for pre-packaged genomes, and the vignette
'How to forge a BSgenome data package' in the
- [Homo.sapiens][] For integrating 'TxDb*' and 'org.*' packages.
- 'SNPlocs.*' For model organism SNP locations derived from dbSNP.
- `FaFile()` ([Rsamtools][]) for accessing indexed FASTA files.
- [ensemblVEP][] Variant effect scores.
## Transcript annotation packages
Genome-centric packages are very useful for annotations involving
genomic coordinates. It is straight-forward, for instance, to discover
the coordinates of coding sequences in regions of interest, and from
these retrieve corresponding DNA or protein coding sequences. Other
examples of the types of operations that are easy to perform with
genome-centric annotations include defining regions of interest for
counting aligned reads in RNA-seq experiments and retrieving DNA
sequences underlying regions of interest in ChIP-seq analysis, e.g.,
for motif characterization.
## _rtracklayer_
The [rtracklayer][] package allows us to query the UCSC genome
browser, as well as providing `import()` and `export()` functions for
common annotation file formats like GFF, GTF, and BED. The exercise
below illustrates some of the functionality of [rtracklayer][].
## Exercises
**Exercise**
This exercise uses annotation resources to go from a gene symbol
'BRCA1' through to the genomic coordinates of each transcript
associated with the gene, and finally to the DNA sequences of the
transcripts.
1. Use the [org.Hs.eg.db][] package to map from the gene symbol
'BRCA1' to its Entrez identifier. Do this using the `select`
command.
2. Use the [TxDb.Hsapiens.UCSC.hg19.knownGene][] package to retrieve
the transcript names (`TXNAME`) corresponding to the BRCA1 Entrez
identifier. (The 'org\*' packages are based on information from
NCBI, where Entrez identifiers are labeled ENTREZID; the 'TxDb*'
package we are using is from UCSC, where Entrez identifiers are
labelled GENEID).
3. Use the `cdsBy()` function to retrieve the genomic coordinates of
all coding sequences grouped by transcript, and select the
transcripts corresponding to the identifiers we're interested
in. The coding sequences are returned as an `GRangesList`, where
each element of the list is a `GRanges` object representing the
exons in the coding sequence. As a sanity check, ensure that the
sum of the widths of the exons in each coding sequence is evenly
divisble by 3 (the R 'modulus' operator `%%` returns the remainder
of the division of one number by another, and might be helpful in
this case).
4. Visualize the transcripts in genomic coordinates using the [Gviz][]
package to construct a `GeneRegionTrack`, and plotting it using
`plotTracks()`.
5. Use the [Bsgenome.Hsapiens.UCSC.hg19][] package and
`extractTranscriptSeqs()` function to extract the DNA sequence of
each transcript.
**Solution**
Retrieve the Entrez identifier corresponding to the BRCA1 gene symbol
```{r symbol-to-entrez}
library(org.Hs.eg.db)
eid <- select(org.Hs.eg.db, "BRCA1", "ENTREZID", "SYMBOL")[["ENTREZID"]]
```
Map from Entrez gene identifier to transcript name
```{r entrez-to-tx, message=FALSE}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txid <- select(txdb, eid, "TXNAME", "GENEID")[["TXNAME"]]
```
Retrieve all coding sequences grouped by transcript, and select those
matching the transcript ids of interest, verifying that each coding
sequence width is a multiple of 3
```{r tx-to-cds-coords}
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
brca1cds <- cds[names(cds) %in% txid]
class(brca1cds)
length(brca1cds)
brca1cds[[1]] # exons in cds
cdswidth <- width(brca1cds) # width of each exon
all((sum(cdswidth) %% 3) == 0) # sum within cds, modulus 3
```
Visualize the BRCA1 transcirpts using [Gviz][] (this package has an
excellent vignette, `vignette("Gviz")`)
```{r Gviz, message=FALSE}
library(Gviz)
grt <- GeneRegionTrack(txdb)
plotTracks(list(GenomeAxisTrack(), grt), chromosome="chr17",
from=min(start(unlist(brca1cds))),
to=max(end(unlist(brca1cds))))
```
Extract the coding sequences of each transcript
```{r cds-to-seq}
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
tx_seq <- extractTranscriptSeqs(genome, brca1cds)
tx_seq
```
Intron coordinates can be identified by first calculating the range of
the genome (from the start of the first exon to the end of the last
exon) covered by each transcript, and then taking the (algebraic) set
difference between this and the genomic coordinates covered by each
exon
```{r introns}
introns <- psetdiff(unlist(range(brca1cds)), brca1cds)
```
Retrieve the intronic sequences with `getSeq()` (these are *not*
assembled, the way that `extractTranscriptSeqs()` assembles exon
sequences into mature transcripts); note that introns start and end
with the appropriate acceptor and donor site sequences.
```{r intron-seqs}
seq <- getSeq(genome, introns)
names(seq)
seq[["uc010whl.2"]] # 21 introns
```
**Exercise**
<font color="red">Internet access required for this exercise</font>
Here we use [rtracklayer][] to retrieve estrogen receptor binding
sites identified across cell lines in the ENCODE project. We focus on
binding sites in the vicinity of a particularly interesting region of
interest.
1. Define our region of interest by creating a `GRanges` instance with
appropriate genomic coordinates. Our region corresponds to 10Mb up-
and down-stream of a particular gene.
2. Create a session for the UCSC genome browser
3. Query the UCSC genome browser for ENCODE estrogen receptor
ERalpha<sub>a</sub> transcription marks; identifying the
appropriate track, table, and transcription factor requires
biological knowledge and detective work.
4. Visualize the location of the binding sites and their scores;
annotate the mid-point of the region of interest.
**Solution**
Define the region of interest
```{r rtracklayer-roi}
library(GenomicRanges)
roi <- GRanges("chr10", IRanges(92106877, 112106876, names="ENSG00000099194"))
```
Create a session
```{r rtracklayer-session, eval=FALSE}
library(rtracklayer)
session <- browserSession()
```
Query the UCSC for a particular track, table, and transcription
factor, in our region of interest
```{r rtracklayer-marks, eval=FALSE}
trackName <- "wgEncodeRegTfbsClusteredV2"
tableName <- "wgEncodeRegTfbsClusteredV2"
trFactor <- "ERalpha_a"
ucscTable <- getTable(ucscTableQuery(session, track=trackName,
range=roi, table=tableName, name=trFactor))
```
Visualize the result
```{r rtracklayer-plot, fig.height=3, eval=FALSE}
plot(score ~ chromStart, ucscTable, pch="+")
abline(v=start(roi) + (end(roi) - start(roi) + 1) / 2, col="blue")
```
# AnnotationHub
[AnnotationHub][] is a data base of large-scale whole-genome
resources, e.g., regulatory elements from the Roadmap Epigenomics
project, Ensembl GTF and FASTA files for model and other organisms,
and the NHLBI [grasp2db][] data base of GWAS results. There are many interesting ways in which these resources can be used. Examples include
- Easily access and import Roadmap Epigenomics files.
- 'liftOver' genomic range-based annotations from one coordinate
system (e.g, hg19) to another (e.g., GRCh 38);
- Create TranscriptDb and BSgenome-style annotation resources 'on the
fly' for a diverse set of organisms.
- Programmatically access the genomic coordiantes of clinically
relevant variants cataloged in dbSNP.
Unfortunately, [AnnotationHub][] makes extensive use of internet
resources and so we will not pursue it in this course; see the
vignettes that come with the pacakge, for instance
[_AnnotationHub_ HOW-TOs](http://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/AnnotationHub-HOWTO.html).
# Resources
Acknowledgements
The material for this lab was taken from a presentation given by Martin
Morgan at CSAMA 2015.
[AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi
[AnnotationHub]: http://bioconductor.org/packages/AnnotationHub
[BSgenome]: http://bioconductor.org/packages/release/BSgenome
[Bsgenome.Hsapiens.UCSC.hg19]: http://bioconductor.org/packages/Bsgenome.Hsapiens.UCSC.hg19
[grasp2db]: http://bioconductor.org/packages/release/grasp2db
[Gviz]: http://bioconductor.org/packages/release/Gviz
[Homo.sapiens]: http://bioconductor.org/packages/release/Homo.sapiens
[Rsamtools]: http://bioconductor.org/packages/Rsamtools
[TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[biomaRt]: http://bioconductor.org/packages/biomaRt
[org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db
[org.Sc.sgd.db]: http://bioconductor.org/packages/org.Sc.sgd.db
[rtracklayer]: http://bioconductor.org/packages/release/rtracklayer