-
Notifications
You must be signed in to change notification settings - Fork 2
/
GGplotCase.Rmd
535 lines (379 loc) · 16.6 KB
/
GGplotCase.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
---
title: "Why GGplot2"
author: "Jakob"
date: "16/6/2017"
output: pdf_document
latex_engine: xelatex
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##Why ggplot2?
Advantages of ggplot2
* consistent underlying grammar of graphics (Wilkinson, 2005) * plot
specification at a high level of abstraction * very flexible * theme system for
polishing plot appearance * mature and complete graphics system * many users,
active mailing list * That said, there are some things you cannot (or should
not) do With ggplot2:
3-dimensional graphics (see the rgl package) Graph-theory type graphs
(nodes/edges layout; see the igraph package) Interactive graphics (see the ggvis
package)
\newpage ##What Is The Grammar Of Graphics?
The basic idea: independently specify plot building blocks and combine them to
create just about any kind of graphical display you want. Building blocks of a
graph include:
* data * aesthetic mapping * geometric object * statistical transformations *
scales * coordinate system * position adjustments * faceting
Lets load data:
```{r D1}
setwd("~/Dropbox/ubehjælpelige forsøg på programmering/Assignment1")
housing <- read.csv("Rgraphics/dataSets/landdata-states.csv")
head(housing[1:5])
```
\newpage
##ggplot2 VS Base Graphics
Compared to base graphics, ggplot2
* is more verbose for simple / canned graphics * is less verbose for complex /
custom graphics * does not have methods (data should always be in a data.frame)
* uses a different system for adding plot elements * ggplot2 VS Base for simple
graphs
Base graphics histogram example:
```{r D2}
hist(housing$Home.Value)
```
\newpage
```{r D3}
library(ggplot2)
ggplot(housing, aes(x = Home.Value)) +
geom_histogram()
```
\newpage
Base colored scatter plot example:
```{r D4}
plot(Home.Value ~ Date, data=subset(housing, State == "MA"))
points(Home.Value ~ Date, col="red", data=subset(housing, State == "TX"))
legend(1975, 400000, c("MA", "TX"), title="State", col=c("black", "red"), pch=c(1, 1))
```
\newpage ggplot2 colored scatter plot example:
```{r }
ggplot(subset(housing,
State %in% c("MA", "TX")), aes(x=Date, y=Home.Value, color=State))+ geom_point()
```
ggplot2 wins!
\newpage
##Geometric Objects And Aesthetics
###Aesthetic Mapping
In ggplot land aesthetic means "something you can see". Examples include:
* position (i.e., on the x and y axes) * color ("outside" color) * fill
("inside" color) * shape (of points) * linetype * size
Each type of geom accepts only a subset of all aesthetics–refer to the geom help
pages to see what mappings each geom accepts. Aesthetic mappings are set with
the aes() function.
##Geometic Objects (geom)
###Geometric objects are the actual marks we put on a plot. Examples include:
* points (geom_point, for scatter plots, dot plots, etc) * lines (geom_line, for
time series, trend lines, etc) boxplot (geom_boxplot, for, well, boxplots!)
A plot must have at least one geom; there is no upper limit. You can add a geom
to a plot using the + operator
You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")
or simply type geom_<tab> in any good R IDE (such as Rstudio or ESS) to see a
list of functions starting with geom_.
\newpage ##Points (Scatterplot)
Now that we know about geometric objects and aesthetic mapping, we can make a
ggplot. geom_point requires mappings for x and y, all others are optional.
```{r}
hp2001Q1 <- subset(housing, Date == 2001.25)
ggplot(hp2001Q1, aes(y = Structure.Cost, x = Land.Value)) + geom_point()
ggplot(hp2001Q1, aes(y = Structure.Cost, x = log(Land.Value))) + geom_point()
```
\newpage ##Lines (Prediction Line)
A plot constructed with ggplot can have more than one geom. In that case the
mappings established in the ggplot() call are plot defaults that can be added to
or overridden. Our plot could use a regression line:
```{r}
hp2001Q1$pred.SC <-
predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))
p1 + geom_point(aes(color = Home.Value)) + geom_line(aes(y = pred.SC))
```
\newpage ##Smoothers
Not all geometric objects are simple shapes–the smooth geom includes a line and
a ribbon.
```{r}
p1 + geom_point(aes(color = Home.Value)) + geom_smooth()
```
\newpage
##Text (Label Points)
Each geom accepts a particualar set of mappings–for example geom_text() accepts
a labels mapping.
```{r}
p1 + geom_text(aes(label=State), size = 3)
```
## install.packages("ggrepel")
```{r}
library("ggrepel")
p1 + geom_point() + geom_text_repel(aes(label=State), size = 3)
```
\newpage ##Aesthetic Mapping VS Assignment
Note that variables are mapped to aesthetics with the aes() function, while
fixed aesthetics are set outside the aes() call. This sometimes leads to
confusion, as in this example:
```{r}
p1 + geom_point(aes(size = 2),# incorrect! 2 is not a variable
color="red") # this is fine -- all points red
```
\newpage ##Mapping Variables To Other Aesthetics
Other aesthetics are mapped in the same way as x and y in the previous example.
```{r}
p1 + geom_point(aes(color=Home.Value, shape = region))
```
\newpage ##Excercise 1
The data for the exercises is available in the dataSets/EconomistData.csv file.
Read it in with
dat <- read.csv("dataSets/EconomistData.csv") head(dat)
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) + geom_point()
dat <- read.csv("dataSets/EconomistData.csv")
These data consist of Human Development Index and Corruption Perception Index
scores for several countries.
Create a scatter plot with CPI on the x axis and HDI on the y axis. Color the
points blue. Map the color of the the points to Region. Make the points bigger
by setting size to 2 Map the size of the points to HDI.Rank
\newpage
##Statistical Transformations
Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:
* for a boxplot the y values must be transformed to the median and 1.5(IQR)
* for a smoother smother the y values must be transformed into predicted values
Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar is stat_bin:
```{r }
args(geom_histogram)
args(stat_bin)
```
\newpage
##etting Statistical Transformation Arguments
Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.
For example, here is the default histogram of Home.Value:
```{r}
p2 <- ggplot(housing, aes(x = Home.Value))
p2 + geom_histogram()
```
reasonable by default, but we can change it by passing the binwidth argument to the stat_bin function:
```{r}
p2 + geom_histogram(stat = "bin", binwidth=4000)
```
\newpage
##Changing The Statistical Transformation
Sometimes the default statistical transformation is not what you need. This is often the case with pre-summarized data:
```{r}
housing.sum <- aggregate(housing["Home.Value"], housing["State"], FUN=mean)
rbind(head(housing.sum), tail(housing.sum))
```
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar()
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar()
Error: stat_count() must not be used with a y aesthetic.
What is the problem with the previous plot? Basically we take binned and summarized data and ask ggplot to bin and summarize it again (remember, geom_bar defaults to stat = stat_count); obviously this will not work. We can fix it by telling geom_bar to use a different statistical transformation function:
```{r}
ggplot(housing.sum, aes(x=State, y=Home.Value)) +
geom_bar(stat="identity")
```
\newpage
##Exercise II
Re-create a scatter plot with CPI on the x axis and HDI on the y axis (as you did in the previous exercise).
Overlay a smoothing line on top of the scatter plot using geom_smooth.
Overlay a smoothing line on top of the scatter plot using geom_smooth, but use a linear model for the predictions. Hint: see ?stat_smooth.
Overlay a smoothing line on top of the scatter plot using geom_line. Hint: change the statistical transformation.
BONUS: Overlay a smoothing line on top of the scatter plot using the default loess method, but make it less smooth. Hint: see ?loess.
\newpage
##Scales: Controlling Aesthetic Mapping
Aesthetic mapping (i.e., with aes()) only says that a variable should be mapped to an aesthetic. It doesn't say how that should happen. For example, when mapping a variable to shape with aes(shape = x) you don't say what shapes should be used. Similarly, aes(color = z) doesn't say what colors should be used. Describing what colors/shapes/sizes etc. to use is done by modifying the corresponding scale. In ggplot2 scales include
* position
* color and fill
* size
* shape
* line type
Scales are modified with a series of functions using a scale_<aesthetic>_<type> naming scheme. Try typing scale_<tab> to see a list of scale modification functions.
Common Scale Arguments
The following arguments are common to most scales in ggplot2:
* name
the first argument gives the axis or legend title
* limits
the minimum and maximum of the scale
* breaks
the points along the scale where labels should appear
* labels
the labels that appear at each break
Specific scale functions may have additional arguments; for example, the scale_color_continuous function has arguments low and high for setting the colors at the low and high end of the scale.
\newpage
##Scale Modification Examples
Start by constructing a dotplot showing the distribution of home values by Date and State.
```{r}
p3 <- ggplot(housing,
aes(x = State,
y = Home.Price.Index)) +
theme(legend.position="top",
axis.text=element_text(size = 6))
(p4 <- p3 + geom_point(aes(color = Date),
alpha = 0.5,
size = 1.5,
position = position_jitter(width = 0.25, height = 0)))
```
Now modify the breaks for the x axis and color scales
```{r}
p4 + scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"))
```
Next change the low and high values to blue and red:
```{r}
p4 +
scale_x_discrete(name="State Abbreviation") +
scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = "blue", high = "red")
```
```{r}
library(scales)
p4 + scale_color_continuous(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = muted("blue"), high = muted("red"))
```
\newpage
##Using different color scales
ggplot2 has a wide variety of color scales; here is an example using scale_color_gradient2 to interpolate between three different colors.
```{r}
p4 +
scale_color_gradient2(name="",
breaks = c(1976, 1994, 2013),
labels = c("'76", "'94", "'13"),
low = muted("blue"),
high = muted("red"),
mid = "gray60",
midpoint = 1994)
```
Exercise III
Create a scatter plot with CPI on the x axis and HDI on the y axis. Color the points to indicate region.
Modify the x, y, and color scales so that they have more easily-understood names (e.g., spell out "Human development Index" instead of "HDI").
Modify the color scale to use specific values of your choosing. Hint: see ?scale_color_manual.
##Faceting
Faceting
Faceting is ggplot2 parlance for small multiples
The idea is to create separate graphs for subsets of data
ggplot2 offers two functions for creating small multiples:
facet_wrap(): define subsets as the levels of a single grouping variable
facet_grid(): define subsets as the crossing of two grouping variables
Facilitates comparison among plots, not just of geoms within a plot
What is the trend in housing prices in each state?
Start by using a technique we already know–map State to color:
```{r}
p5 <- ggplot(housing, aes(x = Date, y = Home.Value))
p5 + geom_line(aes(color = State))
```
There are two problems here–there are too many states to distinguish each one by color, and the lines obscure one another.
We can remedy the deficiencies of the previous plot by faceting by state rather than mapping state to color.
```{r}
(p5 <- p5 + geom_line() +
facet_wrap(~State, ncol = 10))
```
There is also a facet_grid() function for faceting in two dimensions.
\newpage
##Themes
Themes
The ggplot2 theme system handles non-data plot elements such as
Axis labels
Plot background
Facet label backround
Legend appearance
Built-in themes include:
theme_gray() (default)
theme_bw()
theme_classc()
```{r}
p5 + theme_linedraw()
```
```{r}
p5 + theme_light()
```
\newpage
##Overriding theme defaults
Specific theme elements can be overridden using theme(). For example:
```{r}
p5 + theme_minimal() +
theme(text = element_text(color = "turquoise"))
```
All theme options are documented in ?theme.
\newpage
##Creating and saving new themes
You can create new themes, as in the following example:
### this code proved difficult and it appears to be because of problems with the built in letter types```{r}
theme_new <- theme_bw() + theme(plot.background = element_rect(size = 1, color = "blue", fill = "black"),
text=element_text(size = 12, family = "Serif", color = "ivory"),
axis.text.y = element_text(colour = "purple"),
axis.text.x = element_text(colour = "red"),
panel.background = element_rect(fill = "pink"),
strip.background = element_rect(fill = muted("orange")))
p5 + theme_new
``
##The #1 FAQ
Map Aesthetic To Different Columns
The most frequently asked question goes something like this: I have two variables in my data.frame, and I'd like to plot them as separate points, with different color depending on which variable it is. How do I do that?
##Wrong
```{r}
housing.byyear <- aggregate(cbind(Home.Value, Land.Value) ~ Date, data = housing, mean)
ggplot(housing.byyear,
aes(x=Date)) +
geom_line(aes(y=Home.Value), color="red") +
geom_line(aes(y=Land.Value), color="blue")
```
##right
```{r}
library(tidyr)
home.land.byyear <- gather(housing.byyear,
value = "value",
key = "type",
Home.Value, Land.Value)
ggplot(home.land.byyear,
aes(x=Date,
y=value,
color=type)) +
geom_line()
```
\newpage
##Excercise 1 SOLVED
The data for the exercises is available in the dataSets/EconomistData.csv file.
Read it in with
```{r }
dat <- read.csv("Rgraphics/dataSets/EconomistData.csv")
head(dat)
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) + geom_point()
dat <- read.csv("Rgraphics/dataSets/EconomistData.csv")
```
These data consist of Human Development Index and Corruption Perception Index
scores for several countries.
Create a scatter plot with CPI on the x axis and HDI on the y axis. Color the
points blue. Map the color of the the points to Region. Make the points bigger
by setting size to 2 Map the size of the points to HDI.Rank
```{r}
ggplot(dat, aes(x = CPI, y = HDI, size = HDI.Rank)) + geom_point(aes(color = Region))
```
##Excercise 2 solved
Re-create a scatter plot with CPI on the x axis and HDI on the y axis (as you did in the previous exercise).
Overlay a smoothing line on top of the scatter plot using geom_smooth.
Overlay a smoothing line on top of the scatter plot using geom_smooth, but use a linear model for the predictions. Hint: see ?stat_smooth.
Overlay a smoothing line on top of the scatter plot using geom_line. Hint: change the statistical transformation.
* I don´t understand the purpose of the question above
BONUS: Overlay a smoothing line on top of the scatter plot using the default loess method, but make it less smooth. Hint: see ?loess.
\newpage
Putting It All Together
Challenge: Recreate This Economist Graph
images/Economist1.pdf
Graph source: http://www.economist.com/node/21541178
Building off of the graphics you created in the previous exercises, put the finishing touches to make it as close as possible to the original economist graph.
http://tutorials.iq.harvard.edu/R/Rgraphics/images/Economist1.png
```{r}
ggplot(dat, aes(x = CPI, y = HDI, size = 3)) + scale_shape_identity()+ geom_point(aes(color = Region), shape=1)+geom_smooth()+geom_smooth(method = "loess", size = 1.5)+theme_bw()+ geom_text_repel(aes(label=Country), size = 3)
```