This tutorial introduces basic techniques in data wrangling and visualization in R. Specifically, we will cover some basic tools using out-of-the-box R commands, then introduce the powerful framework of the "tidyverse" (both in wrangling and visualizing data), and finally gain some understanding of the philosophy of this framework to set up deeper exploration of our data. Throughout, we will be using a publicly available dataset of AirBnB listings.
This was originally presented as part of a month-long course in Software Tools for Optimization and Analytics given by the Operations Research Center at MIT. Here is a link to some of the other sessions and notes, if you're interested.
-
Before beginning, ensure you have RStudio installed. This provides a graphical user interface (GUI) or integrated development environment (IDE) for programming in R, and is free at the RStudio site. This installation will come with an installation of the base libraries of R itself.
-
Next step, make sure you have the course materials. The easiest way to ensure you have all the materials for this class is clone this repository. On a Mac, you can open a Terminal, navigate to a directory of your choice, and run
$ git clone https://github.com/stmorse/intro-tidyverse.git
Another way is to simply download all the course material as a .zip file (in the "Clone or download" dropdown menu).
These materials are summarized in an easy-to-digest form in the online session notes.
The materials consist of a script (script.R
) and corresponding exercises for each section (exercises.R
). Maybe the best way to self-teach this material is to open the session notes above in a browser window, and the two R scripts in RStudio, and work your way through them, doing the code yourself, flipping back and forth as necessary.
The R scripts with all code filled in are also provided, in script_full.R
and exercises_solved.R
.
(The master.Rmd
and master.html
files creating the online session notes can be ignored.)
The data is publicly available at Kaggle as the Boston Airbnb dataset, but we also provide it in this repository for convenience.
We will use three libraries for this session: tidyr
, dplyr
, and ggplot2
. Before beginning, ensure that you install them, and are able to load them into an R session in RStudio. You can install them by executing the following commands in the RStudio console:
install.packages('dplyr')
install.packages('tidyr')
install.packages('ggplot2')
You should test that the libraries will load by then running
library(dplyr)
library(tidyr)
library(ggplot2)
Then test that dplyr/tidyr work by executing the command:
data.frame(name=c('Ann', 'Bob'), number=c(3.141, 2.718)) %>% gather(type, favorite, -name)
which should output something like this
name type favorite
1 Ann number 3.141
2 Bob number 2.718
Finally, test that ggplot works by executing the command
data.frame(x=rnorm(1000), y=rnorm(1000)) %>% ggplot(aes(x,y)) + geom_point()
which should produce a cloud of points centered around the origin.
Now you're ready to begin!
dplyr
and tidyr
are well-established packages within the R
community, and there are many resources to use for reference and further learning. Some of our favorites are below.
- Tutorials by Hadley Wickham for
dplyr
basics, advanced grouped operations, and database interface. - Third-party tutorial (including docs and a video) for using
dplyr
- Principles and practice of tidy data using
tidyr
- (Detailed) cheatsheet for
dplyr
andtidyr
- A useful cheatsheet for
dplyr
joins - Comparative discussion of
dplyr
anddata.table
, an alternative package with higher performance but more challenging syntax.
Some of the infinitude of visualization subjects we did not cover are: heatmaps and 2D histograms, statistical functions, plot insets, ... And even within the Tidyverse, don't feel you need to limit yourself to ggplot
. Here's a good overview of some 2d histogram techniques, a discussion on overlaying a normal curve over a histogram, a workaround to fit multiple plots in one giant chart.
For other datasets and applications, one place to start is data hosting and competition websites like Kaggle, and there many areas like sports analytics, political forecasting, historical analysis, and countless others that have clean, open, and interesting data just waiting for you to read.csv
.