Skip to content

Latest commit

 

History

History
166 lines (115 loc) · 230 KB

index.md

File metadata and controls

166 lines (115 loc) · 230 KB
title subtitle author job framework highlighter hitheme widgets mode
Barcode visualizations using R
Coloring ATCG-sequences in knitr/slidify reports
Markus Skyttner
io2012
highlight.js
tomorrow
selfcontained

Reference samples

Samples kept at the Swedish Museum for Natural History of the European Roller using Cat. id. NRM 20106015 - depicted in the figure to the left - and the [Eurasian Woodcock] (http://naturarv.se/?param=dnakey&catalogNumber=20046331) using Cat. id. NRM 20046331 - the figure to the right - from which some DNA data has been sequenced.

European Roller Eurasian Woodcock
alt text alt text
This European Roller flew astray, it is from Ramsberg, north of Lindesberg. This Eurasian Woodcock originates from the Fiby lake outside Uppsala.

The data behind a DNA barcode visualization

DNA sequence data from a European Roller can be expressed like this in text format:

CTAATTTTTGGGGCCTGAGCGGGCATGGTTGGAACCGCCCTCAGCCTGCTCATTCGCGCAGAA
CTCGGTCAACCAGGAACCCTACTAGGAGACGACCAGATCTACAACGTAATCGTCACTGCCCAT
GCCTTCGTAATAATCTTCTTTATAGTCATACCAATCATAATCGGGGGCTTTGGAAACTGACTA
GTCCCCCTTATAATCGGCGCCCCAGACATAGCGTTCCCCCGTATAAATAACATAAGCTTCTGA
CTACTCCCCCCATCCTTCCTTCTCCTACTAGCCTCCTCCACCGTAGAAGCTGGTGCTGGTACA
GGGTGAACAGTCTACCCCCCTCTAGCTGGTAATCTGGCCCACGCCGGAGCTTCTGTAGACCTA
GCCATCTTCTCCCTACACCTCGCTGGAGTCTCATCAATCCTAGGTGCAATCAACTTCATCACT
ACTGCCATTAACATAAAGCCCCCGGCCCTATCTCAATACCAAACCCCCCTATTCGTATGATCC
GTACTAATCACAGCCGTCCTACTATTACTTTCACTGCCCGTCCTCGCTGCCGGCATTACAATG
CTCCTCACAGACCGAAACCTAAACACCACATTCTTTGACCCAGCCGGAGGAGGAGACCCAGTC
CTATACCAACACCTATTC

The problem with this presentation format is that humans are very slow at processing this type of data - we use sequential processing which heavily taxes our working memory, when we could use pre-attentive processing to speed up our understanding of this abstract data.


Traditional barcode visualization

Traditionally, DNA sequenced data is therefore displayed in a colorful format using thin bars of four different colors representing the A, C, T and G symbols in the DNA sequence data. That way, an illusive similarity with product barcodes is constructed.

Such a classic traditional barcode depiction looks like this for these two sample sequences:

plot of chunk unnamed-chunk-2 plot of chunk unnamed-chunk-2

This presentation format can compress a lot of data into one line, provided there are enough pixels available. However, it sacrifices clear display of individual symbols, because bars are so thin that they can barely be distinguished. And what happens when the sequence length is greater than available pixel width?

Can you think of alternative ways to display the same data that fixes some of the problems above?












(red) = A
(blue) = C
(green) = T
(yellow) = G
(unknown) = N


Looking at 3rd position symbols only

This is a classic barcode illustration over symbols in the 3rd position only. It so happens that a lot of differences between sequences happen in this 3rd position.

The illustration below emphasizes the big picture overview but it makes it hard to spot exactly where indvidual differences occur:

plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-3


European Roller:




Eurasian woodcock:




(red) = A
(blue) = C
(green) = T
(yellow) = G

As you can see, when we use a multi-line display it is hard to spot the differences because positions are still not easily aligned so comparisons become slow and cognitively difficult to make. How can we support that task in a better way?


Using a pairwise multi-line row-wrapped display

Light gray markings are used to accentuate pairwise differences pre-attentively:















With this technique no heavy cognitive hit is required to spot where the differences occur.


Non-position-3-symbols

Shown in the traditional way:

plot of chunk unnamed-chunk-4 plot of chunk unnamed-chunk-4

In this display we see that the barcodes are quite similar.

Maybe we even wonder if there are any differences at all there? But we cannot say, or can we?

As a side remark on colors: Colors used for the traditional barcode display are not well chosen. It is better to use perceptually friendly colors - and to avoid "RGB corners". Look at Color Brewer [http://www.colorbrewer2.org] for guidance!


Non-position-3 data

Displayed as separate color-coded row-wrapped multi-line paragraphs













Still quite impossible to see whether there are any differences, right?




















Now we can see where the differences are!


Pos3data - to mute similarities or differences?















However, as you can see in this example, this technique using muting can be a little bit confusing when the foreground and the background are more or less equally represented - muting works less well in that case and color pairs can be harder to distinguish.


Numerical similarity measures

The Levenshtein-distance (the "edit distance" measuring least number of edit operations necessary to go from one string to another) is 99. For symbols in the third position only, the same measure is 89. The measure for remaining symbols (ie in non-3rd-positions) is 7.

Similarity measure

The Levenshtein similarity measure can be calculated and is defined on the interval [0,1] where 0 indicates the highest level of dissimilarity and consequently 1 denotes highest possible similarity between two strings of symbols.

For symbols in the 3rd position, we get the measure 0.588. For symbols in other positions, we get a significantly higher similarity mesaure: 0.9838