Data frames, which are analogous to Excel spreadsheets for which the entries in each column are of the same "type" and each column has the same number of rows, are a fundamental way of handling data in R.
I'll first review various ways of extracting subsets of a data frame, and then review several ways to construct a data frame from multiple vectors or from smaller data frames. This is a fairly long article - it is not intended to be read all at one time.
R often has several equivalent ways of doing something. Perhaps?? this is from R having been developed in a collaborative fashion by several people with their own favorite ways of doing various programming constructs, so they all got included. One can choose one's favorite ways of doing things, but one needs to be familiar with all the commonly used constructs in order to be able to understand code written by others (and, importantly, to understand code in instructions and examples for R packages one wants to use).
This material is taken/modified from a pinned post of mine "Examples of extracting as a vector a single column from a data frame" in the Week 2 Discussion Forum for the R Programming Course in the Johns Hopkins Data Science Specialization on Coursera.
If you are taking this course, then note in the Week 1 pinned posts in the Discussion Forum, that Leonard Greski has written a very good more general article on getting row and column subsets from a data frame "Forms of the Extract Operator in R" (this article also contains some more advanced material covered later in the R course, so read what is relevant to where you are in the R Programming course and refer back later as you learn more R); and one is also well advised to read Al Warren's pinned post in the Week 2 Discussion Forum "Subsetting with bracket notation".
Getting (often referred to as extracting) a single column from a data frame is a common step in an R function, and one usually will want to get the column in the form of a vector, not as a data frame with that one column.
Let's see how to get, as a vector, for example the sulfate column of a simple example data frame;
df <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
To get the sulfate column as vector you can do either of the following 5 equivalent statements:
df[["sulfate"]] # double brackets
[1] 4.79 1.46 4.28 NA
# or
df$sulfate # note sulfate does not need to be in quotes for the $ form of extraction
# (but a text string with blanks in it would need to be)
[1] 4.79 1.46 4.28 NA
# or
df[, "sulfate"] # single brackets but note the comma so we get all the rows in the sulfate column
[1] 4.79 1.46 4.28 NA
# or one could use the column number
df[[1]]
[1] 4.79 1.46 4.28 NA
df[, 1]
[1] 4.79 1.46 4.28 NA
# Note that df["sulfate"] # single brackets, no comma,
# is a 1 column data frame containing the sulfate column; if you are getting
# 1 column from a data frame you will usually want it as a vector
df["sulfate"] # single brackets gives a data frame
sulfate
1 4.79
2 1.46
3 4.28
4 NA
class(df["sulfate"])
[1] "data.frame"
# Note for example the mean function "expects" a vector and
# will return NA and give a not very informative message if you
# "feed it" a data frame
mean(df["sulfate"])
[1] NA
Warning message:
In mean.default(df["sulfate"]) :
argument is not numeric or logical: returning NA
# If pollutant is an R variable containing the text string "sulfate"
# then these will work to extract the column as a vector
pollutant <- "sulfate"
df[[pollutant]]
[1] 4.79 1.46 4.28 NA
# or
df[, pollutant]
[1] 4.79 1.46 4.28 NA
# BUT NOT
df$pollutant
NULL
df$pollutant
does NOT work since pollutant is NOT an actual column name; it is a variable containing
the text string sulfate which is not acceptable for the $ form of getting/extracting a column from a data frame as a vector (those are the R "rules" and we have to live with them). And note R does NOT even warn you about this type of mistake - it just cheerfully gives back NULL which can lead to v e r y mysterious bugs. Similarly, mistyping the name of a column in the following example commands results in NULL (with NO warning): df$sulfffate
and also df[["sulffffate"]]
. Programming requires very careful attention to details - one might be tempted to think R should be able to "figure out" what you meant, but recall what type of mischief an "auto correct" in a word processor or message app can create - and in a programming language you wouldn't even get to view in real time what the "compiler" had done to your code. Better to know that if you program correctly exactly what you want, some "gremlin" won't be changing it!
If v is a vector of row indices (that are in the range of the number of rows of the data frame df) one can get the rows of df corresponding to v
df <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v <- c(1, 3, 2, 2, 2)
df[v, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
2 1.46 NA
2.1 1.46 NA
2.2 1.46 NA
# note reordering and repeats are allowed
# note the indication of repeats in the row numbers R generates (R "does not like"
# duplicate row names and so does modifications to make them unique)
Note R does not issue warnings or errors for indices that are "out of range", it just fills in NA's
v <- c(1, 3, 2, 6, 2) # 6 is out of the range of the number of rows of df
df[v, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
2 1.46 NA
NA NA NA
2.1 1.46 NA
One can use negative row indices to exclude those rows:
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v <- c(-1, -3) # exclude rows 1, 3 and keep the rest
df[v, ]
sulfate nitrate
2 1.46 NA
4 NA 3.56
One can also specify desired columns (and repeats of columns)
v <- c(1, 3, 2, 2, 2)
# if we just want the first column (sulfate) with these rows
# we can do
df[v, 1]
[1] 4.79 4.28 1.46 1.46 1.46
# or
df[v, "sulfate"]
[1] 4.79 4.28 1.46 1.46 1.46
# or
df[v, ]$sulfate
[1] 4.79 4.28 1.46 1.46 1.46
# we can also specify several columns
w <- c(1, 2, 1, 2)
df[v, w]
sulfate nitrate sulfate.1 nitrate.1
1 4.79 0.299 4.79 0.299
3 4.28 4.280 4.28 4.280
2 1.46 NA 1.46 NA
2.1 1.46 NA 1.46 NA
2.2 1.46 NA 1.46 NA
Note R "does not like" duplicate column names and so does modifications to make them unique.
Also one can use a logical vector V having the same number of rows as df; rows where the corresponding entry of V is TRUE are kept, rows where the corresponding entry of V is FALSE are not kept.
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
logicalVector <- c(T, F, F, T)
df[logicalVector, ]
sulfate nitrate
1 4.79 0.299
4 NA 3.560
If one has a vector V, one can ask for which rows of V is some logical condition TRUE; R's which function does this: the conceptual description is
which(some logical condition on each entry of V)
returns the vector of the indices of V for which the the condition is TRUE (Any entries of V that are NA will be considered to yield FALSE, so those indices of V will not be included in the result.) If there are no indices for which the condition is TRUE, the which function returns an empty integer vector (integer(0))
For example
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
result <- which(df[["sulfate"]] > 2)
result
[1] 1 3
# any entries of V that are NA are considered to yield FALSE
# one can then use result to get the rows of df for which the condition was TRUE
df[result, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
# Note the result of having an NA involved in the following:
df # repeating what df is
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
V <- df[["sulfate"]] > 2 # a logical vector with a value for each row of df
V
[1] TRUE FALSE TRUE NA
df[V, ] # keeps rows of the data frame where V is TRUE, but note the effect of the NA
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
NA NA NA
I like the conceptual viewpoint of the which function. Apparently it is rather universal in that, for example, IDL has the corresponding function called where and MATLAB has a corresponding function called find
The %in% function addresses the question of whether or not each entry of some vector v occurs in another vector w. It returns a logical vector z with z[k] being TRUE if v[k] is equal to some entry in w, and z[k] FALSE if v[k] is not equal some entry in w. This can be used to obtain a logical vector for use in selecting rows of a data frame
An example of how %in% behaves:
?"%in%" # look at the help on %in% Note because of the special
# character % one needs to "protect" %in% by enclosing it in
# either quotes or apostrophes
v <- c(1, 2, 3, 4, 5, NA)
w <- c(12, 3, 8, 22, 4)
v %in% w
[1] FALSE FALSE TRUE TRUE FALSE FALSE
v <- c(1, 2, 3, 4, 5, NA)
w <- c(12, 3, 8, 22, 4, NA)
v %in% w
[1] FALSE FALSE TRUE TRUE FALSE TRUE
# note %in% will declare a match for an NA in v if there is an NA in w
As noted above, data frames are a fundamental way that R handles data. Many data files that are text files (as opposed to binary files) are naturally suitable for reading in as a data frame using for example read.csv (where the column separator (delimiter) is a comma), or more generally read.table The options for read.table also apply for read.csv but note some of the important default choices are different, in particular the default for "telling R" whether there is a header line containing column names is header = TRUE for read.csv and header = FALSE for read.table, and the default column separator for read.csv is sep = ","
while for read.table one should usually specify it since the default is "white space"; a common column separator other than comma is a tab, which is specified by sep = "\t"
(the backslash "tells" R to interpret the t in a special way).
To start with, when learning R, there are 2 other options one should be aware of: stringsAsFactors = FALSE "tells" R to read in character columns as character data, not as factors which is the default (unless a column is to be used as a factor in a statistical analysis it is likely better to have it read in as character data).
The second option one should be aware of when starting to learn read.csv and read.table is na.strings This option lets one specify the character string (or several character strings) that should be interpreted as NA (the default is "NA"
). For example na.strings = c("NA", "data is missing", "not available", "the experimenter dropped the sample", "the experimenter was texting when the data should have been measured")
If there is character data other than NA signifying missing data in a column that should be read in as numeric data, and R is not "informed" about this, then R will read in that column as factor or character data, which can lead to "issues" that are best avoided by properly reading in the data.
Another option to be aware of if one is dealing with a file that has "non-standard for R" column names is check.names which if set equal TRUE (the default), then R will modify read in column names to conform with what R considers standard. That means blank spaces and many characters that are not letters will get replaced by a period. I find this rather annoying since I like to use long descriptive column headers in files I create, and as long as I am not having R use the column names other than to write them back out after I have done some analysis on the data, it is OK to instruct R to leave the column names alone (by setting check.names = FALSE).
Creating a data frame from smaller data frames and/or vectors and matrices: A. the data.frame function
Looking over the R help on data.frame, one sees that it can combine objects that are or can be converted to be data frames into one combined data frame. As with read.csv and read.table, one may well want to use the option stringsAsFactors = FALSE (unless one needs to have one or more factor columns). (The data frame function will do recycling on rows but I would recommend having the number of rows in objects being combined into a data frame all be the same.) Note the R rep (replicate) function can be used to replicate "patterns", for example
rep(c(1,2), times = 4) # repeat the pattern 4 times
[1] 1 2 1 2 1 2 1 2
# rep can also be used this way:
rep(c(1,2), each = 4)
[1] 1 1 1 1 2 2 2 2
df1 <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df1 # our continuing example data frame
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v1 <- c(1,2,3,4)
v2 <- c(TRUE, FALSE, TRUE, TRUE)
v3 <- c("a", "b", "c", "d")
m1 <- matrix(1:12, nrow = 4, ncol = 3)
m1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
df <- data.frame(df1, v1, v2, m1, v3, stringsAsFactors = FALSE)
df
sulfate nitrate v1 v2 X1 X2 X3 v3
1 4.79 0.299 1 TRUE 1 5 9 a
2 1.46 NA 2 FALSE 2 6 10 b
3 4.28 4.280 3 TRUE 3 7 11 c
4 NA 3.560 4 TRUE 4 8 12 d
# The row names of df are the row names of the first argument of data.frame, i.e.,
# the first of the objects being combined into one data frame
# If one wants to change some column names one could do, for example,
colnames(df)[c(5, 6, 7)] <- c("m1", "m2", "m3")
df
sulfate nitrate v1 v2 m1 m2 m3 v3
1 4.79 0.299 1 TRUE 1 5 9 a
2 1.46 NA 2 FALSE 2 6 10 b
3 4.28 4.280 3 TRUE 3 7 11 c
4 NA 3.560 4 TRUE 4 8 12 d
# and similarly with row names
rownames(df) <- c("r1", "r2", "r3", "r4")
df
sulfate nitrate v1 v2 m1 m2 m3 m4
r1 4.79 0.299 1 TRUE 1 5 9 a
r2 1.46 NA 2 FALSE 2 6 10 b
r3 4.28 4.280 3 TRUE 3 7 11 c
r4 NA 3.560 4 TRUE 4 8 12 d
If one has several data frames that have the same number of columns AND the same column names, then one can "stack them vertically" using the rbind (row bind) function. For example with 2 data frames df1 and df2
df1 <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df1
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
df2 <- data.frame(sulfate = c(24.79, 21.46, 24.28, NA), nitrate = c(2.299, NA, 2.280, 2.560))
df2
sulfate nitrate
1 24.79 2.299
2 21.46 NA
3 24.28 2.280
4 NA 2.560
# then one can do
df <- rbind(df1, df2)
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
5 24.79 2.299
6 21.46 NA
7 24.28 2.280
8 NA 2.560
# The column names for all the items being "rbinded" must be the same
# (except for an important exception described below).
# will get an error message if the column names don't match: for example below I have
# the column names in df2 not matching those in df1
colnames(df2)[2] <- "another.name"
df2
sulfate another.name
1 24.79 2.299
2 21.46 NA
3 24.28 2.280
4 NA 2.560
df <- rbind(df1, df2) # gets error message
Error in match.names(clabs, names(xi)) :
names do not match previous names
Creating a data frame from smaller data frames and/or vectors and matrices: C. Using rbind in a loop
In some circumstances one might be reading in or constructing a succession of data frames, each with the same number of columns and the same column names, and want to combine them vertically. One can do this in a loop if one initializes an empty data frame via df <- data.frame()
One can rbind any data.frame to this empty data frame; this is the exception to the rule on same number of columns and column names so then this "conceptual" for loop will work:
df <- data.frame() # initialize an empty data frame
for (i in some.set) {
read in or derive a data frame dfi corresponding to i (each dfi must have the same
number of columns and the same column names)
df <- rbind(df, dfi)
}
# after this loop the data frame df will consist of all the data frames dfi stacked vertically
The cbind (column bind) function can combine data frames or combine objects that are or can be converted to be data frames. cbind is the same as data.frame except that the default for cbind is check.names = FALSE
Hope this review was informative. The next set of exercises will get into practicing using and creating data frames.
= = = = = = = = = = = = = = = = = = = = = = = =
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. There is a full version of this license at this web site: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Some of the material above (Review of how to get specified subsets of a data frame: getting a single column) was taken/modified from a post of mine in the Discussion Forum for the R Programming Course in the Johns Hopkins Data Science Specialization on Coursera, as noted above. As such Coursera and Coursera authorized Partners retain additional rights to that material as described in their "Terms of Use" https://www.coursera.org/about/terms
Note the reader should not infer any endorsement or recommendation or approval for the material in this article from any of the sources or persons cited above or any other entities mentioned in this article.