-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathch15.Rmd
1530 lines (1011 loc) · 54 KB
/
ch15.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
# This block needs cache=FALSE to set fig.width and fig.height, and have those
# persist across cached builds.
source("utils.R", local = TRUE)
knitr::opts_chunk$set(
fig.width = 3.5,
fig.height = 3.5,
# Print less for the examples in this chapter
print_df_rows = c(2, 2)
)
```
Getting Your Data into Shape {#CHAPTER-DATAPREP}
============================
When it comes to making data graphics, half the battle occurs before you call any plotting commands. Before you pass your data to the plotting functions, it must first be read in and given the correct structure. The data sets provided with R are ready to use, but when dealing with real-world data, this usually isn't the case: you'll have to clean up and restructure the data before you can visualize it.
The recipes in this chapter will often use packages from the *tidyverse*. For a little background about the tidyverse, see the introduction section of Chapter \@ref(CHAPTER-R-BASICS). I will also show how to do many of the same tasks using base R, because in some situations it is important to minimize the number of packages you use, and because it is useful to be able to understand code written for base R.
> **Note**
>
> The `%>%` symbol, also known as the pipe operator, is used extensively in this chapter. If you are not familiar with it, see Recipe \@ref(RECIPE-R-BASICS-PIPE).
Most of the tidyverse functions used in this chapter are from the dplyr package, and in this chapter, I'll assume that dplyr is already loaded. You can load it with either `library(tidyverse)` as shown above, or, if you want to keep things more streamlined, you can load dplyr directly:
```{r eval=FALSE}
library(dplyr)
```
Data sets in R are most often stored in data frames. They're typically used as two-dimensional data structures, with each row representing one case and each column representing one variable. Data frames are essentially lists of vectors and factors, all of the same length, where each vector or factor represents one column.
Here's the `heightweight` data set:
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
```
It consists of five columns, with each row representing one case: a set of information about a single person. We can get a clearer idea of how it's structured by using the `str()` function:
```{r}
str(heightweight)
```
The first column, `sex`, is a factor with two levels, `"f"` and `"m"`, and the other four columns are vectors of numbers (one of them, `ageMonth`, is specifically a vector of integers, but for the purposes here, it behaves the same as any other numeric vector).
Factors and character vectors behave similarly in ggplot -- the main difference is that with character vectors, items will be displayed in lexicographical order, but with factors, items will be displayed in the same order as the factor levels, which you can control.
Creating a Data Frame {#RECIPE-DATAPREP-CREATE-DATAFRAME}
---------------------
### Problem
You want to create a data frame from vectors.
### Solution
You can put vectors together in a data frame with `data.frame()`:
```{r}
# Two starting vectors
g <- c("A", "B", "C")
x <- 1:3
dat <- data.frame(g, x)
dat
```
### Discussion
A data frame is essentially a list of vectors and factors. Each vector or factor can be thought of as a column in the data frame.
If your vectors are in a list, you can convert the list to a data frame with the `as.data.frame()` function:
```{r}
lst <- list(group = g, value = x) # A list of vectors
dat <- as.data.frame(lst)
```
The tidyverse way of creating a data frame is to use `data_frame()` or `as_data_frame()` (note the underscores instead of periods). This returns a special kind of data frame -- a *tibble* -- which behaves like a regular data frame in most contexts, but prints out more nicely and is specifically designed to play well with the tidyverse functions.
```{r}
data_frame(g, x)
```
```{r eval=FALSE}
# Convert the list of vectors to a tibble
as_data_frame(lst)
```
A regular data frame can be converted to a tibble using `as_tibble()`:
```{r}
as_tibble(dat)
```
Getting Information About a Data Structure {#RECIPE-DATAPREP-INFO-DATA}
------------------------------------------
### Problem
You want to find out information about an object or data structure.
### Solution
Use the `str()` function:
```{r}
str(ToothGrowth)
```
This tells us that `ToothGrowth` is a data frame with three columns, `len`, `supp`, and `dose`. `len` and `dose` contain numeric values, while `supp` is a factor with two levels.
Another useful function is the `summary()` function:
```{r}
summary(ToothGrowth)
```
Instead of showing you the first few values of each column as `str()` does, `summary()` provides basic descriptive statistics (the minimum, maximum, median, mean, and first & third quartile values) for numeric variables, and tells you the number of values corresponding to each character value or factor level if it is a character or factor variable.
### Discussion
The `str()` function is very useful for finding out more about data structures. One common source of problems is a data frame where one of the columns is a character vector instead of a factor, or vice versa. This can cause puzzling issues with analyses or graphs.
When you print out a data frame the normal way, by just typing the name at the prompt and pressing Enter, factor and character columns appear exactly the same. The difference will be revealed only when you run `str()` on the data frame, or print out the column by itself:
```{r}
tg <- ToothGrowth
tg$supp <- as.character(tg$supp)
str(tg)
```
```{r}
# Print out the columns by themselves
# From old data frame (factor)
ToothGrowth$supp
# From new data frame (character)
tg$supp
```
Adding a Column to a Data Frame {#RECIPE-DATAPREP-ADD-COL}
-------------------------------
### Problem
You want to add a column to a data frame.
### Solution
Use `mutate()` from dplyr to add a new column and assign values to it. This returns a new data frame, which you'll typically want save over the original.
If you assign a single value to the new column, the entire column will be filled with that value. This adds a column named `newcol`, filled with `NA`:
```{r}
library(dplyr)
ToothGrowth %>%
mutate(newcol = NA)
```
You can also assign a vector to the new column:
```{r}
# Since ToothGrowth has 60 rows, we must create a new vector that has 60 rows
vec <- rep(c(1, 2), 30)
ToothGrowth %>%
mutate(newcol = vec)
```
Note that the vector being added to the data frame must either have one element, or the same number of elements as the data frame has rows. In the example above we created a new vector that had 60 rows by repeating the values `c(1, 2)` thirty times.
### Discussion
Each column of a data frame is a vector. R handles columns in data frames slightly differently from standalone vectors because all the columns in a data frame must have the same length.
To add a column using base R, you can simply assign values into the new column like so:
```{r}
# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth
# Assign NA's for the whole column
ToothGrowth2$newcol <- NA
# Assign 1 and 2, automatically repeating to fill
ToothGrowth2$newcol <- c(1, 2)
```
With base R, the vector being assigned into the data frame will automatically be repeated to fill the number of rows in the data frame.
Deleting a Column from a Data Frame {#RECIPE-DATAPREP-DELETE-COL}
-----------------------------------
### Problem
You want to delete a column from a data frame. This returns a new data frame, which you'll typically want save over the original.
### Solution
Use `select()` from dplyr and specify the columns you want to drop by using `-` (a minus sign).
```{r eval=FALSE}
# Remove the len column
ToothGrowth %>%
select(-len)
```
### Discussion
You can list multiple columns that you want to drop at the same time, or conversely specify only the columns that you want to keep. The following two pieces of code are thus equivalent:
```{r eval=FALSE}
# Remove both len and supp from ToothGrowth
ToothGrowth %>%
select(-len, -supp)
# This keeps just dose, which has the same effect for this data set
ToothGrowth %>%
select(dose)
```
To remove a column using base R, you can simply assign `NULL` to that column.
```{r eval=FALSE}
ToothGrowth$len <- NULL
```
### See Also
Recipe \@ref(RECIPE-DATAPREP-SUBSET) for more on getting a subset of a data frame.
See `?select` for more ways to drop and keep columns.
Renaming Columns in a Data Frame {#RECIPE-DATAPREP-RENAME-COL}
--------------------------------
### Problem
You want to rename the columns in a data frame.
### Solution
Use `rename()` from dplyr. This returns a new data frame:
```{r eval=FALSE}
ToothGrowth %>%
rename(length = len)
```
### Discussion
You can rename multiple columns within the same call to `rename()`:
```{r}
ToothGrowth %>%
rename(
length = len,
supplement_type = supp
)
```
Renaming a column using base R is a bit more verbose. It uses the `names()` function on the left side of the `<-` operator.
```{r}
# Make a copy of ToothGrowth for this example
ToothGrowth2 <- ToothGrowth
names(ToothGrowth2) # Print the names of the columns
# Rename "len" to "length"
names(ToothGrowth2)[names(ToothGrowth2) == "len"] <- "length"
names(ToothGrowth)
```
### See Also
See `?select` for more ways to rename columns within a data frame.
Reordering Columns in a Data Frame {#RECIPE-DATAPREP-REORDER-COL}
----------------------------------
### Problem
You want to change the order of columns in a data frame.
### Solution
Use the `select()` from dplyr.
```{r}
ToothGrowth %>%
select(dose, len, supp)
```
The new data frame will contain the columns you specified in `select()`, in the order you specified. Note that `select()` returns a new data frame, so if you want to change the original variable, you'll need to save the new result over it.
### Discussion
If you are only reordering a few variables and want to keep the rest of the variables in order, you can use `everything()` as a placeholder:
```{r}
ToothGrowth %>%
select(dose, everything())
```
See `?select_helpers` for other ways to select columns. You can, for example, select columns by matching parts of the name.
Using base R, you can also reorder columns by their name or numeric position. This returns a new data frame, which can be saved over the original.
```{r eval=FALSE}
ToothGrowth[c("dose", "len", "supp")]
ToothGrowth[c(3, 1, 2)]
```
In these examples, I used list-style indexing. A data frame is essentially a list of vectors, and indexing into it as a list will return another data frame. You can get the same effect with matrix-style indexing:
```{r eval=FALSE}
ToothGrowth[c("dose", "len", "supp")] # List-style indexing
ToothGrowth[, c("dose", "len", "supp")] # Matrix-style indexing
```
In this case, both methods return the same result, a data frame. However, when retrieving a single column, list-style indexing will return a data frame, while matrix-style indexing will return a vector:
```{r}
ToothGrowth["dose"]
ToothGrowth[, "dose"]
```
You can use `drop=FALSE` to ensure that it returns a data frame:
```{r}
ToothGrowth[, "dose", drop=FALSE]
```
Getting a Subset of a Data Frame {#RECIPE-DATAPREP-SUBSET}
--------------------------------
### Problem
You want to get a subset of a data frame.
### Solution
Use `filter()` to get the rows, and `select()` to get the columns you want. These operations can be chained together using the `%>%` operator. These functions return a new data frame, so if you want to change the original variable, you'll need to save the new result over it.
We'll use the `climate` data set for the examples here:
```{r}
library(gcookbook) # Load gcookbook for the climate data set
climate
```
Let's that say that only want to keep rows where `Source` is `"Berkeley"` and where the year is inclusive of and between 1900 and 2000. You can do so with the `filter()` function:
```{r eval=FALSE}
climate %>%
filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000)
```
If you want only the `Year` and `Anomaly10y` columns, use `select()`, as we did in \@ref(RECIPE-DATAPREP-DELETE-COL):
```{r}
climate %>%
select(Year, Anomaly10y)
```
These operations can be chained together using the `%>%` operator:
```{r}
climate %>%
filter(Source == "Berkeley" & Year >= 1900 & Year <= 2000) %>%
select(Year, Anomaly10y)
```
### Discussion
The `filter()` function picks out rows based on a condition. If you want to pick out rows based on their numeric position, use the `slice()` function:
```{r eval=FALSE}
slice(climate, 1:100)
```
I generally recommend indexing using names rather than numbers when possible. It makes the code easier to understand when you're collaborating with others or when you come back to it months or years after writing it, and it makes the code less likely to break when there are changes to the data, such as when columns are added or removed.
With base R, you can get a subset of rows like this:
```{r}
climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, ]
```
Notice that we needed to prefix each column name with `climate$`, and that there's a comma after the selection criteria. This indicates that we're getting rows, not columns.
This row filtering can also be combined with the column selection from \@ref(RECIPE-DATAPREP-DELETE-COL):
```{r}
climate[climate$Source == "Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
c("Year", "Anomaly10y")]
```
Changing the Order of Factor Levels {#RECIPE-DATAPREP-FACTOR-REORDER}
-----------------------------------
### Problem
You want to change the order of levels in a factor.
### Solution
Pass the factor to `factor()`, and give it the levels in the order you want. This returns a new factor, so if you want to change the original variable, you'll need to save the new result over it.
```{r}
# By default, levels are ordered alphabetically
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes
factor(sizes, levels = c("small", "medium", "large"))
```
The order can also be specified with `levels` when the factor is first created:
```{r eval=FALSE}
factor(c("small", "large", "large", "small", "medium"),
levels = c("small", "medium", "large"))
```
### Discussion
There are two kinds of factors in R: ordered factors and regular factors. (In practice, ordered levels are not commonly used.) In both types, the levels are arranged in *some* order; the difference is that the order is meaningful for an ordered factor, but it is arbitrary for a regular factor -- it simply reflects how the data is stored. For plotting data, the distinction between ordered and regular factors is generally unimportant, and they can be treated the same.
The order of factor levels affects graphical output. When a factor variable is mapped to an aesthetic property in ggplot, the aesthetic adopts the ordering of the factor levels. If a factor is mapped to the x-axis, the ticks on the axis will be in the order of the factor levels, and if a factor is mapped to color, the items in the legend will be in the order of the factor levels.
To reverse the level order, you can use `rev(levels())`:
```{r eval=FALSE}
factor(sizes, levels = rev(levels(sizes)))
```
The tidyverse function for reordering factors is `fct_relevel()` from the forcats package. It has a syntax similar to the `factor()` function from base R.
```{r}
# Change the order of levels
library(forcats)
fct_relevel(sizes, "small", "medium", "large")
```
### See Also
To reorder a factor based on the value of another variable, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE).
Reordering factor levels is useful for controlling the order of axes and legends. See Recipes Recipe \@ref(RECIPE-AXIS-ORDER) and Recipe \@ref(RECIPE-LEGEND-ORDER) for more information.
Changing the Order of Factor Levels Based on Data Values {#RECIPE-DATAPREP-FACTOR-REORDER-VALUE}
--------------------------------------------------------
### Problem
You want to change the order of levels in a factor based on values in the data.
### Solution
Use `reorder()` with the factor that has levels to reorder, the values to base the reordering on, and a function that aggregates the values:
```{r}
# Make a copy of the InsectSprays data set since we're modifying it
iss <- InsectSprays
iss$spray
iss$spray <- reorder(iss$spray, iss$count, FUN = mean)
iss$spray
```
Notice that the original levels were `ABCDEF`, while the reordered levels are `CEDABF`. What we've done is reorder the levels of `spray` based on the mean value of `count` for each level of `spray`.
### Discussion
The usefulness of `reorder()` might not be obvious from just looking at the raw output. Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE) shows three plots made with `reorder()`. In these plots, the order in which the items appear is determined by their values.
```{r FIG-DATAPREP-FACTOR-REORDER-VALUE, echo=FALSE, fig.show="hold", fig.cap="Original data (left); Reordered by the mean of each group (middle); Reordered by the median of each group (right)", fig.height=2.5, fig.width=3}
ggplot(InsectSprays, aes(spray, count)) +
geom_boxplot()
ggplot(InsectSprays, aes(reorder(spray, count, FUN = mean), count)) +
geom_boxplot()
ggplot(InsectSprays, aes(reorder(spray, count, FUN = median), count)) +
geom_boxplot()
```
In the middle plot in Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE), the boxes are sorted by the mean. The horizontal line that runs across each box represents the *median* of the data. Notice that these values do not increase strictly from left to right. That's because with this particular data set, sorting by the mean gives a different order than sorting by the median. To make the median lines increase from left to right, as in the plot on the right in Figure \@ref(fig:FIG-DATAPREP-FACTOR-REORDER-VALUE), we used the `median()` function in `reorder()`.
The tidyverse function for reordering factors is `fct_reorder()`, and it is used the same way as `reorder()`. These do the same thing:
```{r eval=FALSE}
reorder(iss$spray, iss$count, FUN = mean)
fct_reorder(iss$spray, iss$count, .fun = mean)
```
### See Also
Reordering factor levels is also useful for controlling the order of axes and legends. See Recipes \@ref(RECIPE-AXIS-ORDER) and \@ref(RECIPE-LEGEND-ORDER) for more information.
Changing the Names of Factor Levels {#RECIPE-DATAPREP-FACTOR-RENAME}
-----------------------------------
### Problem
You want to change the names of levels in a factor.
### Solution
Use `fct_recode()` from the forcats package
```{r}
sizes <- factor(c( "small", "large", "large", "small", "medium"))
sizes
# Pass it a named vector with the mappings
fct_recode(sizes, S = "small", M = "medium", L = "large")
```
### Discussion
If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.
```{r}
old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(old, new)
mappings
# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
# Use do.call to call fct_recode with the arguments
do.call(fct_recode, args)
```
Or, more concisely, we can do all of that in one go:
```{r}
do.call(
fct_recode,
c(list(sizes), setNames(c("small", "medium", "large"), c("S", "M", "L")))
)
```
For a more traditional (and clunky) base R method for renaming factor levels, use the `levels()<-` function:
```{r}
sizes <- factor(c( "small", "large", "large", "small", "medium"))
# Index into the levels and rename each one
levels(sizes)[levels(sizes) == "large"] <- "L"
levels(sizes)[levels(sizes) == "medium"] <- "M"
levels(sizes)[levels(sizes) == "small"] <- "S"
sizes
```
If you are renaming *all* your factor levels, there is a simpler method. You can pass a list to `levels()<-`:
```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes) <- list(S = "small", M = "medium", L = "large")
sizes
```
With this method, all factor levels must be specified in the list; if any are missing, they will be replaced with `NA`.
It's also possible to rename factor levels by position, but this is somewhat inelegant:
```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
levels(sizes)[1] <- "L"
sizes
# Rename all levels at once
levels(sizes) <- c("L", "M", "S")
sizes
```
It's safer to rename factor levels by name rather than by position, since you will be less likely to make a mistake (and mistakes here may be hard to detect). Also, if your input data set changes to have more or fewer levels, the numeric positions of the existing levels could change, which could cause serious but nonobvious problems for your analysis.
### See Also
If, instead of a factor, you have a character vector with items to rename, see Recipe \@ref(RECIPE-DATAPREP-CHARACTER-RENAME).
Removing Unused Levels from a Factor {#RECIPE-DATAPREP-FACTOR-DROPLEVELS}
------------------------------------
### Problem
You want to remove unused levels from a factor.
### Solution
Sometimes, after processing your data you will have a factor that contains levels that are no longer used. Here's an example:
```{r}
sizes <- factor(c("small", "large", "large", "small", "medium"))
sizes <- sizes[1:3]
sizes
```
To remove them, use `droplevels()`:
```{r}
droplevels(sizes)
```
### Discussion
The `droplevels()` function preserves the order of factor levels. You can use the `except` parameter to keep particular levels.
The tidyverse way: Use `fct_drop()` from the forcats package:
```{r}
fct_drop(sizes)
```
Changing the Names of Items in a Character Vector {#RECIPE-DATAPREP-CHARACTER-RENAME}
-------------------------------------------------
### Problem
You want to change the names of items in a character vector.
### Solution
Use `recode()` from the dplyr package:
```{r}
library(dplyr)
sizes <- c("small", "large", "large", "small", "medium")
sizes
# With recode(), pass it a named vector with the mappings
recode(sizes, small = "S", medium = "M", large = "L")
# Can also use quotes -- useful if there are spaces or other strange characters
recode(sizes, "small" = "S", "medium" = "M", "large" = "L")
```
### Discussion
If you want to use two vectors, one with the original levels and one with the new ones, use `do.call()` with `fct_recode()`.
```{r}
old <- c("small", "medium", "large")
new <- c("S", "M", "L")
# Create a named vector that has the mappings between old and new
mappings <- setNames(new, old)
mappings
# Create a list of the arguments to pass to fct_recode
args <- c(list(sizes), mappings)
# Look at the structure of the list
str(args)
# Use do.call to call fct_recode with the arguments
do.call(recode, args)
```
Or, more concisely, we can do all of that in one go:
```{r}
do.call(
recode,
c(list(sizes), setNames(c("S", "M", "L"), c("small", "medium", "large")))
)
```
Note that for `recode()`, the name and value of the arguments is reversed, compared to the `fct_recode()` function from the forcats package. With `recode()`, you would use `small="S"`, whereas for `fct_recode()`, you would use `S="small"`.
A more traditional R method is to use square-bracket indexing to select the items and rename them:
```{r}
sizes <- c("small", "large", "large", "small", "medium")
sizes[sizes == "small"] <- "S"
sizes[sizes == "medium"] <- "M"
sizes[sizes == "large"] <- "L"
sizes
```
### See Also
If, instead of a character vector, you have a factor with levels to rename, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).
Recoding a Categorical Variable to Another Categorical Variable {#RECIPE-DATAPREP-RECODE-CATEGORICAL}
---------------------------------------------------------------
### Problem
You want to recode a categorical variable to another variable.
### Solution
For the examples here, we'll use a subset of the `PlantGrowth` data set:
```{r}
# Work on a subset of the PlantGrowth data set
pg <- PlantGrowth[c(1,2,11,21,22), ]
pg
```
In this example, we'll recode the categorical variable group into another categorical variable, treatment. If the old value was `"ctrl"`, the new value will be `"No"`, and if the old value was `"trt1"` or `"trt2"`, the new value will be `"Yes"`.
This can be done with the `recode()` function from the dplyr package:
```{r}
library(dplyr)
recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```
You can assign it as a new column in the data frame:
```{r eval=FALSE}
pg$treatment <- recode(pg$group, ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```
Note that since the input was a factor, it returns a factor. If you want to get a character vector instead, use `as.character()`:
```{r}
recode(as.character(pg$group), ctrl = "No", trt1 = "Yes", trt2 = "Yes")
```
### Discussion
You can also use the `fct_recode()` function from the forcats package. It works the same, except the names and values are swapped, which may be a little more intuitive:
```{r}
library(forcats)
fct_recode(pg$group, No = "ctrl", Yes = "trt1", Yes = "trt2")
```
Another difference is that `fct_recode()` will always return a factor, whereas `recode()` will return a character vector if it is given a character vector, and will return a factor if it is given a factor. (Although dplyr does have a `recode_factor()` function which also always returns a factor.)
Using base R, recoding can be done with the `match()` function:
```{r}
oldvals <- c("ctrl", "trt1", "trt2")
newvals <- factor(c("No", "Yes", "Yes"))
newvals[ match(pg$group, oldvals) ]
```
It can also be done by indexing in the vectors:
```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```
```{r}
pg$treatment[pg$group == "ctrl"] <- "No"
pg$treatment[pg$group == "trt1"] <- "Yes"
pg$treatment[pg$group == "trt2"] <- "Yes"
# Convert to a factor
pg$treatment <- factor(pg$treatment)
pg
```
Here, we combined two of the factor levels and put the result into a new column. If you simply want to rename the levels of a factor, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).
The coding criteria can also be based on values in multiple columns, by using the `&` and `|` operators:
```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```
```{r}
pg$newcol[pg$group == "ctrl" & pg$weight < 5] <- "no_small"
pg$newcol[pg$group == "ctrl" & pg$weight >= 5] <- "no_large"
pg$newcol[pg$group == "trt1"] <- "yes"
pg$newcol[pg$group == "trt2"] <- "yes"
pg$newcol <- factor(pg$newcol)
pg
```
It's also possible to combine two columns into one using the interaction() function, which appends the values with a `.` in between. This combines the `weight` and `group` columns into a new column, `weightgroup`:
```{r echo=FALSE}
# Reset the data
pg <- PlantGrowth[c(1,2,11,21,22), ]
```
```{r}
pg$weightgroup <- interaction(pg$weight, pg$group)
pg
```
### See Also
For more on renaming factor levels, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-RENAME).
See Recipe \@ref(RECIPE-DATAPREP-RECODE-CONTINUOUS) for recoding continuous values to categorical values.
Recoding a Continuous Variable to a Categorical Variable {#RECIPE-DATAPREP-RECODE-CONTINUOUS}
--------------------------------------------------------
### Problem
You want to recode a continuous variable to another variable.
### Solution
Use the `cut()` function. In this example, we'll use the `PlantGrowth` data set and recode the continuous variable `weight` into a categorical variable, `wtclass`, using the `cut()` function:
```{r}
pg <- PlantGrowth
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf))
pg
```
### Discussion
For three categories we specify four bounds, which can include `Inf` and `-Inf`. If a data value falls outside of the specified bounds, it's categorized as `NA`. The result of `cut()` is a factor, and you can see from the example that the factor levels are named after the bounds.
To change the names of the levels, set the labels:
```{r}
pg$wtclass <- cut(pg$weight, breaks = c(0, 5, 6, Inf),
labels = c("small", "medium", "large"))
pg
```
As indicated by the factor levels, the bounds are by default *open* on the left and *closed* on the right. In other words, they don't include the lowest value, but they do include the highest value. For the smallest category, you can have it include both the lower and upper values by setting `include.lowest=TRUE`. In this example, this would result in 0 values going into the small category; otherwise, 0 would be coded as `NA`.
If you want the categories to be closed on the left and open on the right, set right = FALSE:
```{r}
cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
```
### See Also
To recode a categorical variable to another categorical variable, see Recipe \@ref(RECIPE-DATAPREP-RECODE-CATEGORICAL).
Calculating New Columns From Existing Columns {#RECIPE-DATAPREP-CALCULATE}
-----------------------
### Problem
You want to calculate a new column of values in a data frame.
### Solution
Use `mutate()` from the dplyr package.
```{r}
library(gcookbook) # Load gcookbook for the heightweight data set
heightweight
```
This will convert `heightIn` to centimeters and store it in a new column, `heightCm`:
```{r}
library(dplyr)
heightweight %>%
mutate(heightCm = heightIn * 2.54)
```
This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.
### Discussion
You can use `mutate()` to transform multiple columns at once:
```{r}
heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204
)
```
It is also possible to calculate a new column based on multiple columns:
```{r eval=FALSE}
heightweight %>%
mutate(bmi = weightKg / (heightCm / 100)^2)
```
With `mutate()`, the columns are added sequentially. That means that we can reference a newly-created column when calculating a new column:
```{r}
heightweight %>%
mutate(
heightCm = heightIn * 2.54,
weightKg = weightLb / 2.204,
bmi = weightKg / (heightCm / 100)^2
)
```
With base R, calculating a new colum can be done by referencing the new column with the `$` operator and assigning some values to it:
```{r, eval=FALSE}
heightweight$heightCm <- heightweight$heightIn * 2.54
```
### See Also
See Recipe \@ref(RECIPE-DATAPREP-CALCULATE-GROUP) for how to perform group-wise transformations on data.
Calculating New Columns by Groups {#RECIPE-DATAPREP-CALCULATE-GROUP}
-------------------------------
### Problem
You want to create new columns that are the result of calculations performed on groups of data, as specified by a grouping column.
### Solution
Use `group_by()` from the dplyr package to specify the grouping variable, and then specify the operations in `mutate()`:
```{r}
library(MASS) # Load MASS for the cabbages data set
library(dplyr)
cabbages %>%
group_by(Cult) %>%
mutate(DevWt = HeadWt - mean(HeadWt))
```
This returns a new data frame, so if you want to replace the original variable, you will need to save the result over it.
### Discussion
Let's take a closer look at the `cabbages` data set. It has two grouping variables (factors): `Cult`, which has levels `c39` and `c52`, and `Date`, which has levels `d16`, `d20`, and `d21.` It also has two measured numeric variables, `HeadWt` and `VitC`:
```{r}
cabbages
```
Suppose we want to find, for each case, the deviation of `HeadWt` from the overall mean. All we have to do is take the overall mean and subtract it from the observed value for each case:
```{r}
mutate(cabbages, DevWt = HeadWt - mean(HeadWt))
```
You'll often want to do separate operations like this for each group, where the groups are specified by one or more grouping variables. Suppose, for example, we want to normalize the data within each group by finding the deviation of each case from the mean *within the group*, where the groups are specified by `Cult`. In these cases, we can use `group_by()` and `mutate()` together:
```{r}
cb <- cabbages %>%
group_by(Cult) %>%
mutate(DevWt = HeadWt - mean(HeadWt))
```
First it groups cabbages based on the value of `Cult`. There are two levels of `Cult`, `c39` and `c52`. It then applies the `mutate()` function to each data frame.
The before and after results are shown in Figure \@ref(fig:FIG-DATAPREP-CALCULATE-GROUP):
```{r FIG-DATAPREP-CALCULATE-GROUP, fig.show="hold", fig.cap="Before normalizing (left); After normalizing (right)"}