forked from jennybc/gapminder
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
215 lines (153 loc) · 8.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
output:
github_document:
toc: TRUE
---
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.594018.svg)](https://doi.org/10.5281/zenodo.594018)
[![CRAN version](http://www.r-pkg.org/badges/version/gapminder)](http://cran.r-project.org/package=gapminder) ![](http://cranlogs.r-pkg.org/badges/grand-total/gapminder)
gapminder
=========
```{r setup, include = FALSE, cache = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
dpi = 300,
out.width = "100%",
comment = "#>",
fig.path = "man/figures/README-"
)
## so jittered figs don't always appear to be changed
set.seed(1)
```
Excerpt from the [Gapminder](http://www.gapminder.org/data/) data. The main object in this package is the `gapminder` data frame or "tibble". There are other goodies, such as the data in tab delimited form, a larger unfiltered dataset, premade color schemes for the countries and continents, and ISO 3166-1 country codes.
The `gapminder` data frames include six variables, ([Gapminder.org documentation page](http://www.gapminder.org/data/documentation/)):
| variable | meaning |
|:------------|:-------------------------|
| country | |
| continent | |
| year | |
| lifeExp | life expectancy at birth |
| pop | total population |
| gdpPercap | per-capita GDP |
Per-capita GDP (Gross domestic product) is given in units of [international dollars](http://en.wikipedia.org/wiki/Geary%E2%80%93Khamis_dollar), "a hypothetical unit of currency that has the same purchasing power parity that the U.S. dollar had in the United States at a given point in time" -- 2005, in this case.
Package contains two main data frames or tibbles:
* `gapminder`: 12 rows for each country (1952, 1955, ..., 2007). It's a subset of ...
* `gapminder_unfiltered`: more lightly filtered and therefore about twice as many rows.
**Note: this package exists for the purpose of teaching and making code examples. It is an excerpt of data found in specific spreadsheets on Gapminder.org circa 2010. It is not a definitive source of socioeconomic data and I don't update it. Use other data sources if it's important to have the current best estimate of these statistics.**
### Install and test drive
Install `gapminder` from CRAN:
```{r eval = FALSE}
install.packages("gapminder")
```
Or you can install `gapminder` from GitHub:
```{r eval = FALSE}
devtools::install_github("jennybc/gapminder")
```
Load it and test drive with some data aggregation and plotting:
```{r test-drive, message = FALSE, warning = FALSE}
library("gapminder")
aggregate(lifeExp ~ continent, gapminder, median)
library("dplyr")
gapminder %>%
filter(year == 2007) %>%
group_by(continent) %>%
summarise(lifeExp = median(lifeExp))
library("ggplot2")
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
```
### Color schemes for countries and continents
`country_colors` and `continent_colors` are provided as character vectors where elements are hex colors and the names are countries or continents.
```{r}
head(country_colors, 4)
head(continent_colors)
```
```{r echo = FALSE}
knitr::include_graphics("man/figures/gapminder-color-scheme-ggplot2.png")
```
The country scheme is available in this repo as
* [PNG](data-raw/gapminder-color-scheme-ggplot2.png) or [PDF](data-raw/gapminder-color-scheme-base.pdf)
* [`continent-colors.tsv`](inst/extdata/continent-colors.tsv) and [`country-colors.tsv`](inst/extdata/country-colors.tsv)
### How to use color scheme in `ggplot2`
Provide `country_colors` to `scale_color_manual()` like so:
```{r scale-color-manual, eval = FALSE}
... + scale_color_manual(values = country_colors) + ...
```
```{r demo-country-colors-ggplot2}
library("ggplot2")
ggplot(subset(gapminder, continent != "Oceania"),
aes(x = year, y = lifeExp, group = country, color = country)) +
geom_line(lwd = 1, show.legend = FALSE) + facet_wrap(~ continent) +
scale_color_manual(values = country_colors) +
theme_bw() + theme(strip.text = element_text(size = rel(1.1)))
```
### How to use color scheme in base graphics
```{r demo-country-colors-base}
# for convenience, integrate the country colors into the data.frame
gap_with_colors <-
data.frame(gapminder,
cc = I(country_colors[match(gapminder$country,
names(country_colors))]))
# bubble plot, focus just on Africa and Europe in 2007
keepers <- with(gap_with_colors,
continent %in% c("Africa", "Europe") & year == 2007)
plot(lifeExp ~ gdpPercap, gap_with_colors,
subset = keepers, log = "x", pch = 21,
cex = sqrt(gap_with_colors$pop[keepers]/pi)/1500,
bg = gap_with_colors$cc[keepers])
```
### ISO 3166-1 country codes
The `country_codes` data frame provides ISO 3166-1 country codes for all the countries in the `gapminder` and `gapminder_unfiltered` data frames. This can be used to practice joining or merging.
```{r message = FALSE}
library(dplyr)
gapminder %>%
filter(year == 2007, country %in% c("Kenya", "Peru", "Syria")) %>%
select(country, continent) %>%
left_join(country_codes)
```
### What is `gapminder` good for?
I have used this excerpt in [STAT 545](http://stat545-ubc.github.io) since 2008 and, more recently, in [R-flavored Software Carpentry Workshops](http://jennybc.github.io/2014-05-12-ubc/) and a [`ggplot2` tutorial](https://github.com/jennybc/ggplot2-tutorial). `gapminder` is very useful for teaching novices data wrangling and visualization in R.
Description:
* `r nrow(gapminder)` observations; fills a size niche between `iris` (150 rows) and the likes of `diamonds` (54K rows)
* `r ncol(gapminder)` variables
- `country` a factor with `r nlevels(gapminder$country)` levels
- `continent`, a factor with `r nlevels(gapminder$continent)` levels
- `year`: going from 1952 to 2007 in increments of 5 years
- `pop`: population
- `gdpPercap`: GDP per capita
- `lifeExp`: life expectancy
There are 12 rows for each country in `gapminder`, i.e. complete data for 1952, 1955, ..., 2007.
The two factors provide opportunities to demonstrate factor handling, in aggregation and visualization, for factors with very few and very many levels.
The four quantitative variables are generally quite correlated with each other and these trends have interesting relationships to `country` and `continent`, so you will find that simple plots and aggregations tell a reasonable story and are not completely boring.
Visualization of the temporal trends in life expectancy, by country, is particularly rewarding, since there are several countries with sharp drops due to political upheaval. This then motivates more systematic investigations via data aggregation to proactively identify all countries whose data exhibits certain properties.
### How this sausage was made
<blockquote class="twitter-tweet" lang="en"><p>Data cleaning code cannot be clean. It's a sort of sin eater.</p>— Stat Fact (@StatFact) <a href="https://twitter.com/StatFact/status/492753200190341120">July 25, 2014</a></blockquote>
The [`data-raw`](data-raw/) directory contains the Excel spreadsheets downloaded from [Gapminder](http://www.gapminder.org) in 2008 and 2009 and all the scripts necessary to create everything in this package, in raw and "compiled notebook" form.
### Plain text delimited files
If you want to practice importing from file, various tab delimited files are included:
* [`gapminder.tsv`](inst/extdata/gapminder.tsv): the same dataset available via `library("gapminder"); gapminder`
* [`gapminder-unfiltered.tsv`](inst/extdata/gapminder-unfiltered.tsv): the larger dataset available via `library("gapminder"); gapminder_unfiltered`.
* [`continent-colors.tsv`](inst/extdata/continent-colors.tsv) and [`country-colors.tsv`](inst/extdata/country-colors.tsv): color schemes
Here in the source, these delimited files can be found:
* in the [`inst/extdata/`](inst/extdata/) sub-directory
Once you've installed the `gapminder` package they can be found locally and used like so:
```{r}
gap_tsv <- system.file("extdata", "gapminder.tsv", package = "gapminder")
gap_tsv <- read.delim(gap_tsv)
str(gap_tsv)
gap_tsv %>% # Bhutan did not make the cut because data for only 8 years :(
filter(country == "Bhutan")
gap_bigger_tsv <-
system.file("extdata", "gapminder-unfiltered.tsv", package = "gapminder")
gap_bigger_tsv <- read.delim(gap_bigger_tsv)
str(gap_bigger_tsv)
gap_bigger_tsv %>% # Bhutan IS here though! :)
filter(country == "Bhutan")
```
## License
Gapminder's data is released under the Creative Commons Attribution 3.0 Unported license. See their [terms of use](https://docs.google.com/document/pub?id=1POd-pBMc5vDXAmxrpGjPLaCSDSWuxX6FLQgq5DhlUhM).
## Citation
Run this command to get info on how to cite this package. If you've installed gapminder from CRAN, the year will be populated and populated correctly (unlike below).
```{r warning = FALSE}
citation("gapminder")
```