-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathch01.Rmd
373 lines (232 loc) · 13 KB
/
ch01.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
R Basics {#CHAPTER-R-BASICS}
========
This chapter covers the basics: installing and using packages and loading data.
Most of the recipes in this book require the ggplot2, dplyr, and gcookbook packages to be installed on your computer. (The gcookbook package contains data sets used in some of the examples, but is not necessary for doing your real work.) If you want to get started quickly, run:
```{r, eval=FALSE}
install.packages("tidyverse")
install.packages("gcookbook")
```
Then, in each R session, before running the examples in this book, you can load them with:
```{r, eval=FALSE}
library(tidyverse)
library(gcookbook)
```
Running `library(tidyverse)` will load ggplot2, dplyr, and a number of other packages. If you want to keep your R session more streamlined and load only the packages that are strictly needed, you can load ggplot2 and dplyr packages individually:
```{r eval=FALSE}
library(ggplot2)
library(dplyr)
library(gcookbook)
```
> **Note**
>
> If you want a deeper understanding of how ggplot2 works, see Appendix \@ref(CHAPTER-GGPLOT2), which explains the concepts behind ggplot2.
Packages in R are collections of functions and/or data that are bundled up for easy distribution, and installing a package will extend the functionality of R on your computer. If an R user creates a package and thinks that it might be useful for others, that user can distribute it through a package repository. The primary repository for distributing R packages is called CRAN (the Comprehensive R Archive Network), but there are others, such as Bioconductor, which specializes in packages related to genomic data.
If you have spent much time learning R, you may have heard of the *tidyverse*, which is a collection of R packages that share common ideas of how data should be structured and manipulated. This is in contrast to *base R*, which is the set of packages that are included when you just download and install R. The tidyverse is a set of add-ons for R, which make it easier to do many operations related to data manipulation and visualization. This book mostly uses the tidyverse, as I believe that it provides a quicker and simpler (but not less powerful!) way to work with data.
If you haven't used the tidyverse before, there is one recipe in particular that you should read that will help you understand a foreign-looking bit of syntax: `%>%`, also known as the pipe operator. This is Recipe \@ref(RECIPE-R-BASICS-PIPE) in this chapter.
Installing a Package
--------------------
### Problem
You want to install a package from CRAN.
### Solution
Use `install.packages()` and give it the name of the package you want to install. To install ggplot2, run:
```{r, eval=FALSE}
install.packages("ggplot2")
```
At this point you may be prompted to select a download mirror. It's usually best to use the first choice, https://cloud.r-project.org/, as it is a cloud-based mirror with endpoints all over the world.
### Discussion
If you want to install multiple packages at once, you can pass it a vector of package names. For example, this will install most of the packages used in this book:
```{r, eval=FALSE}
install.packages(c("ggplot2", "gcookbook", "MASS", "dplyr"))
```
When you tell R to install a package, it will automatically install any other packages that the first package depends on.
CRAN (the Comprehensive R Archive Network) is a repository of packages for R, and it is mirrored on many servers around the world. It is the default repository system used by R. There are other package repositories; Bioconductor, for example, is a repository of packages related to analyzing genomic data.
Loading a Package
-----------------
### Problem
You want to load an installed package.
### Solution
Use `library()` and give it the name of the package you want to install. To load ggplot2, run:
```{r, eval=FALSE}
library(ggplot2)
```
The package must already be installed on the computer.
### Discussion
Most of the recipes in this book require loading a package before running the code, either for the graphing capabilities (as in the ggplot2 package) or for example data sets (as in the MASS and gcookbook packages).
One of R's quirks is the package/library terminology. Although you use the `library()` function to load a package, a package is not a library, and some longtime R users will get irate if you call it that.
A *library* is a directory that contains a set of packages. You might, for example, have a system-wide library as well as a library for each user.
Upgrading Packages
------------------
### Problem
You want to upgrade a package that is already installed.
### Solution
Run `update.packages()`:
```{r, eval=FALSE}
update.packages()
```
It will prompt you for each package that can be upgraded. If you want it to upgrade all packages without asking, use `ask = FALSE`:
```{r, eval=FALSE}
update.packages(ask = FALSE)
```
### Discussion
Over time, package authors will release new versions of packages with bug fixes and new features, and it's usually a good idea to keep up-to-date. However, keep in mind that occasionally new versions of packages will introduce bugs or have slightly changed behavior.
Loading a Delimited Text Data File
----------------------------------
### Problem
You want to load data from a delimited text file.
### Solution
The most common way to read in a file is to use comma-separated values (CSV) data:
```{r, eval=FALSE}
data <- read.csv("datafile.csv")
```
Alternatively, you can use the `read_csv()` function (note the underscore instead of period) from the readr package. This function is significantly faster than `read.csv()`, and
### Discussion
Since data files have many different formats, there are many options for loading them. For example, if the data file does *not* have headers in the first row:
```{r, eval=FALSE}
data <- read.csv("datafile.csv", header = FALSE)
```
The resulting data frame will have columns named `V1`, `V2`, and so on, and you will probably want to rename them manually:
```{r, eval=FALSE}
# Manually assign the header names
names(data) <- c("Column1", "Column2", "Column3")
```
You can set the delimiter with sep. If it is space-delimited, use `sep = " "`. If it is tab-delimited, use `\t`, as in:
```{r, eval=FALSE}
data <- read.csv("datafile.csv", sep = "\t")
```
By default, strings in the data are treated as factors. Suppose this is your data file, and you read it in using `read.csv()`:
```
"First","Last","Sex","Number"
"Currer","Bell","F",2
"Dr.","Seuss","M",49
"","Student",NA,21
```
The resulting data frame will store `First` and `Last` as *factors*, though it makes more sense in this case to treat them as strings (or *character vectors* in R terminology). To differentiate this, use `stringsAsFactors = FALSE`. If there are any columns that should be treated as factors, you can then convert them individually:
```{r, eval=FALSE}
data <- read.csv("datafile.csv", stringsAsFactors = FALSE)
# Convert to factor
data$Sex <- factor(data$Sex)
str(data)
#> 'data.frame': 3 obs. of 4 variables:
#> $ First : chr "Currer" "Dr." ""
#> $ Last : chr "Bell" "Seuss" "Student"
#> $ Sex : Factor w/ 2 levels "F","M": 1 2 NA
#> $ Number: int 2 49 21
```
Alternatively, you could load the file with strings as factors, and then convert individual columns from factors to characters.
### See Also
`read.csv()` is a convenience wrapper function around `read.table()`. If you need more control over the input, see `?read.table`.
Loading Data from an Excel File
-------------------------------
### Problem
You want to load data from an Excel file.
### Solution
The readxl package has the function `read_excel()` for reading .xls and .xlsx files from Excel. This will read the first sheet of an Excel spreadsheet:
```{r, eval=FALSE}
# Only need to install once
install.packages("readxl")
library(readxl)
data <- read_excel("datafile.xlsx", 1)
```
### Discussion
With `read_excel()`, you can load from other sheets by specifying a number for sheetIndex or a name for sheetName:
```{r, eval=FALSE}
data <- read_excel("datafile.xls", sheet = 2)
data <- read_excel("datafile.xls", sheet = "Revenues")
```
`read_excel()` uses the first row of the spreadsheet for column names. If you don't want to use that row for column names, use `col_names = FALSE`. The columns will instead be named `X1`, `X2`, and so on.
By default, `read_excel()` will infer the type of each column, but if you want to specify the type of each column, you can use the `col_types` argument. You can also drop columns if you specify the type as `"blank"`.
```{r, eval=FALSE}
# Drop the first column, and specify the types of the next three columns
data <- read_excel("datafile.xls", col_types = c("blank", "text", "date", "numeric"))
```
### See Also
See `?read_excel` for more options controlling the reading of these files.
There are other packages for reading Excel files. The gdata package has a function `read.xls()` for reading in .xls files, and the xlsx package has a function `read.xlsx()` for reading in .xlsx files. They require external software to be installed on your computer: `read.xls()` requires Java, and `read.xlsx()` requires Perl.
Loading Data from SPSS/SAS/Stata Files
--------------------------------------
### Problem
You want to load data from a SPSS file, or from other programs like SAS or Stata.
### Solution
The haven package has the function `read_sav()` for reading SPSS files. To load data from an SPSS file:
```{r, eval=FALSE}
# Only need to install the first time
install.packages("haven")
library(haven)
data <- read_sav("datafile.sav")
```
### Discussion
The haven package also includes functions to read from other formats:
* `read_sas()`: SAS
* `read_dta()`: Stata
An alternative to haven is the foreign package. It also supports SPSS and Stata files, but it is not as up-to-date as the functions from haven. For example, it only supports Stata files up to version 12, while haven supports up to version 14 (the current version as of this writing).
The foreign package does support some other formats, including:
* `read.octave()`: Octave and MATLAB
* `read.systat()`:SYSTAT
* `read.xport()`: SAS XPORT
* `read.dta()`: Stata
* `read.spss()`: SPSS
### See Also
Run `ls("package:foreign")` for a full list of functions in the foreign package.
Chaining Functions Together With `%>%`, the Pipe Operator {#RECIPE-R-BASICS-PIPE}
---------------------------------------------------------
### Problem
You want to call one function, then pass the result to another function, and another, in a way that is easily readable.
### Solution
Use `%>%`, the pipe operator. For example:
```{r}
library(dplyr) # The pipe is provided by dplyr
morley # Look at the morley data set
morley %>%
filter(Expt == 1) %>%
summary()
```
This takes the `morley` data set, passes it to the `filter()` function from dplyr, keeping only the rows of the data where `Expt` is equal to 1. Then that result is passed to the `summary()` function, which calculates some summary statistics on the data.
Without the pipe operator, the code above would be written like this:
```{r results='hide'}
summary(filter(morley, Expt == 1))
```
In this code, function calls are processed from the inside outward. From a mathematical viewpoint, this makes perfect sense, but from a readability viewpoint, this can be confusing and hard to read, especially when there are many nested function calls.
### Discussion
This pattern, with the `%>%` operator, is widely used with tidyverse packages, because they contain many functions that do relatively small things. The idea is that these functions are building blocks that allow user to compose the function calls together to produce the desired result.
To illustrate what's going on, here's a simpler example of two equivalent pieces of code:
```{r eval=FALSE}
f(x)
# Equivalent to:
x %>% f()
```
The pipe operator in essence takes the thing that's on the left, and places it as the first argument of the function call that's on the right.
It can be used for multiple function calls, in a *chain*:
```{r eval=FALSE}
h(g(f(x)))
# Equivalent to:
x %>%
f() %>%
g() %>%
h()
```
In a function chain, the lexical ordering of the function calls is the same as the order in which they're computed.
If you want to store the final result, you can use the `<-` operator at the beginning. For example, this will replace the original `x` with the result of the function chain:
```{r eval=FALSE}
x <- x %>%
f() %>%
g() %>%
h()
```
If there are additional arguments for the function calls, they will be shifted to the right when the pipe operator is used. Going back to code from the first example, these two are equivalent:
```{r eval=FALSE}
filter(morley, Expt == 1)
morley %>% filter(Expt == 1)
````
The pipe operator is actually from the magrittr package, but dplyr imports it and makes it available when you call `library(dplyr)`
### See Also
For many more examples of how to use `%>%` in data manipulation, see Chapter \@ref(CHAPTER-DATAPREP).