forked from hadley/tidy-data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
tidy-data.tex
571 lines (394 loc) · 51.7 KB
/
tidy-data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
\documentclass[article]{jss}
\usepackage[utf8]{inputenc}
% jss.cls ----------------------------------------------------------------------
%% almost as usual
\author{Hadley Wickham\\RStudio}
\title{Tidy Data}
\Plainauthor{Hadley Wickham}
\Plaintitle{Tidy Data}
%% an abstract and keywords
\Abstract{
A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.
}
\Keywords{data cleaning, data tidying, relational databases, \proglang{R}}
\Plainkeywords{data cleaning, data tidying, relational databases, R}
%% \Volume{50}
%% \Issue{9}
%% \Month{June}
%% \Year{2012}
%% \Submitdate{2012-06-04}
%% \Acceptdate{2012-06-04}
\Address{
Hadley Wickham\\
Chief Scientist, RStudio\\
Adjunct Professor, Rice University\\
E-mail: \email{[email protected]}\\
URL: \url{http://had.co.nz}
}
% jss.cls ----------------------------------------------------------------------
\DeclareGraphicsExtensions{.png,.pdf,.jpg}
\usepackage[format=plain,font=small]{caption}
\usepackage{booktabs}
\usepackage{fancyvrb}
\renewcommand{\FancyVerbFormatLine}[1]{R> #1}
\usepackage{epigraph}
\usepackage[subrefformat=parens]{subfig}
\begin{document}
\maketitle
\section{Introduction}
It is often said that 80\% of data analysis is spent on the process of cleaning and preparing the data \citep{dasu:2003}. Data preparation is not just a first step, but must be repeated many over the course of analysis as new problems come to light or new data is collected. Despite the amount of time it takes, there has been surprisingly little research on how to clean data well. Part of the challenge is the breadth of activities it encompasses: from outlier checking, to date parsing, to missing value imputation. To get a handle on the problem, this paper focusses on a small, but important, aspect of data cleaning that I call data \textbf{tidying}: structuring datasets to facilitate analysis.
The principles of tidy data provide a standard way to organise data values within a dataset. A standard makes initial data cleaning easier because you don't need to start from scratch and reinvent the wheel every time. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together. Current tools often require translation. You have to spend time munging the output from one tool so you can input it into another. Tidy datasets and tidy tools work hand in hand to make data analysis easier, allowing you to focus on the interesting domain problem, not on the uninteresting logistics of data.
The principles of tidy data are closely tied to those of relational databases and Codd's relational algebra \citep{codd:1990}, but are framed in a language familiar to statisticians. Computer scientists have also contributed much to the study of data cleaning. For example, \citet{lakshmanan:1996} define an extension to SQL to allow it to operate on messy datasets, \citet{raman:2001} provide a framework for cleaning datasets, and \citet{kandel:2011} develop an interactive tool with a friendly user interface which automatically creates code to clean data. These tools are useful but they are presented in a language foreign to most statisticians, they fail to give much advice on how datasets should be structured, and they lack connections to the tools of data analysis.
The development of tidy data has been driven by my experience working with real-world datasets. With few, if any, constraints on their organisation, such datasets are often constructed in bizarre ways. I have spent countless hours struggling to get such datasets organised in a way that makes data analysis possible, let alone easy. I have also struggled to impart these skills to my students so they could tackle real-world datasets on their own. In the course of these struggles I developed the \pkg{reshape} and \pkg{reshape2} \citep{wickham:2007b} packages. While I could intuitively use the tools and teach them through examples, I lacked the framework to make my intuition explicit. This paper provides that framework. It provides a comprehensive ``philosophy of data'': one that underlies my work in the \pkg{plyr} \citep{me:plyr} and \pkg{ggplot2} \citep{me:ggplot2} packages.
The paper proceeds as follows. Section~\ref{sec:defining} begins by defining the three characteristics that make a dataset tidy. Since most real world datasets are not tidy, Section~\ref{sec:tidying} describes the operations needed to make messy datasets tidy, and illustrates the techniques with a range of real examples. Section~\ref{sec:tidy-tools} defines tidy tools, tools that input and output tidy datasets, and discusses how tidy data and tidy tools together can make data analysis easier. These principles are illustrated with a small case study in Section~\ref{sec:case-study}. Section~\ref{sec:discussion} concludes with a discussion of what this framework misses and what other approaches might be fruitful to pursue.
\section{Defining tidy data}
\label{sec:defining}
\epigraph{Happy families are all alike; every unhappy family is unhappy in its own way}{Leo Tolstoy}
Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). In this section, I'll provide some standard vocabulary for describing the structure and semantics of a dataset, and then use those definitions to define tidy data.
\subsection{Data structure}
Most statistical datasets are rectangular tables made up of \textbf{rows} and \textbf{columns}. The columns are almost always labelled and the rows are sometimes labelled. Table~\ref{tbl:preg-raw-1} provides some data about an imaginary experiment in a format commonly seen in the wild. The table has two columns and three rows, and both rows and columns are labelled.
\begin{table}[htbp]
\centering
\input{data/preg-raw-1.tex}
\caption{Typical presentation dataset.}
\label{tbl:preg-raw-1}
\end{table}
There are many ways to structure the same underlying data. Table~\ref{tbl:preg-raw-2} shows the same data as Table~\ref{tbl:preg-raw-1}, but the rows and columns have been transposed. The data is the same, but the layout is different. Our vocabulary of rows and columns is simply not rich enough to describe why the two tables represent the same data. In addition to appearance, we need a way to describe the underlying semantics, or meaning, of the values displayed in table.
\begin{table}[htbp]
\centering
\input{data/preg-raw-2.tex}
\caption{The same data as in Table~\ref{tbl:preg-raw-1} but structured differently.}
\label{tbl:preg-raw-2}
\end{table}
\subsection{Data semantics}
A dataset is a collection of \textbf{values}, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a \textbf{variable} and an \textbf{observation}. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Table~\ref{tbl:preg-tidy} reorganises Table~\ref{tbl:preg-raw-1} to make the values, variables and obserations more clear. The dataset contains 18 values representing three variables and six observations. The variables are:
\begin{enumerate}
\item \code{person}, with three possible values (John, Mary, and Jane).
\item \code{treatment}, with two possible values (a and b).
\item \code{result}, with five or six values depending on how you think of the missing value (-, 16, 3, 2, 11, 1).
\end{enumerate}
The experimental design tells us more about the structure of the observations. In this experiment, every combination of of \code{person} and \code{treatment} was measured, a completely crossed design. The experimental design also determines whether or not missing values can be safely dropped. In this experiment, the missing value represents an observation that should have been made, but wasn't, so it's important to keep it. Structural missing values, which represent measurements that can't be made (e.g., the count of pregnant males) can be safely removed.
\begin{table}[htbp]
\centering
\input{data/preg-tidy.tex}
\caption{The same data as in Table~\ref{tbl:preg-raw-1} but with variables in columns and observations in rows.}
\label{tbl:preg-tidy}
\end{table}
For a given dataset, it's usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general. For example, if the columns in the Table~\ref{tbl:preg-raw-1} were \code{height} and \code{weight} we would have been happy to call them variables. If the columns were \code{height} and \code{width}, it would be less clear cut, as we might think of height and width as values of a \code{dimension} variable. If the columns were \code{home phone} and \code{work phone}, we could treat these as two variables, but in a fraud detection environment we might want variables \code{phone number} and \code{number type} because the use of one phone number for multiple people might suggest fraud. A general rule of thumb is that it is easier to describe functional relationships between variables (e.g., \code{z} is a linear combination of \code{x} and \code{y}, \code{density} is the ratio of \code{weight} to \code{volume}) than between rows, and it is easier to make comparisons between groups of observations (e.g., average of group a vs.\ average of group b) than between groups of columns.
In a given analysis, there may be multiple levels of observation. For example, in a trial of new allergy medication we might have three observational types: demographic data collected from each person (\code{age}, \code{sex}, \code{race}), medical data collected from each person on each day (\code{number of sneezes}, \code{redness of eyes}), and meterological data collected on each day (\code{temperature}, \code{pollen count}).
\subsection{Tidy data}
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In \textbf{tidy data}:
\begin{enumerate}
\item Each variable forms a column.
\item Each observation forms a row.
\item Each type of observational unit forms a table.
\end{enumerate}
\noindent This is Codd's 3rd normal form \citep{codd:1990}, but with the constraints framed in statistical language, and the focus put on a single dataset rather than the many connected datasets common in relational databases. \textbf{Messy data} is any other other arrangement of the data.
Table~\ref{tbl:preg-tidy} is the tidy version of Table~\ref{tbl:preg-raw-1}. Each row represents an observation, the \code{result} of one \code{treatment} on one \code{person}, and each column is a variable.
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. Compare Table~\ref{tbl:preg-tidy} to Table~\ref{tbl:preg-raw-1}: in Table~\ref{tbl:preg-raw-1} you need to use different strategies to extract different variables. This slows analysis and invites errors. If you consider how many data analysis operations involve all of the values in a variable (every aggregation function), you can see how important it is to extract these values in a simple, standard way. Tidy data is particularly well suited for vectorised programming languages like \proglang{R}, because the layout ensures that values of different variables from the same observation are always paired.
While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values. One way of organising variables is by their role in the analysis: are values fixed by the design of the data collection, or are they measured during the course of the experiment? Fixed variables describe the experimental design and are known in advance. Computer scientists often call fixed variables dimensions, and statisticians usually denote them with subscripts on random variables. Measured variables are what we actually measure in the study. Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fixed) variables. This is the convention adopted by all tabular displays in this paper.
\section{Tidying messy datasets}
\label{sec:tidying}
Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analysing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies:
\begin{itemize}
\item Column headers are values, not variable names.
\item Multiple variables are stored in one column.
\item Variables are stored in both rows and columns.
\item Multiple types of observational units are stored in the same table.
\item A single observational unit is stored in multiple tables.
\end{itemize}
Surprisingly, most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting. The following sections illustrate each problem with a real dataset that I have encountered, and show how to tidy them. The complete datasets and the \proglang{R} code used to tidy them are available online at \url{https://github.com/hadley/tidy-data}, and in the online supplementary materials for this paper.
\subsection{Column headers are values, not variable names}
A common type of messy dataset is tabular data designed for presentation, where variables form both the rows and columns, and column headers are values, not variable names. While I would call this arrangement messy, in some cases it can be extremely useful. It provides efficient storage for completely crossed designs, and it can lead to extremely efficient computation if desired operations can be expressed as matrix operations. This issue is discussed in depth in Section~\ref{sec:discussion}.
Table~\ref{tbl:pew-raw} shows a subset of a typical dataset of this form. This dataset explores the relationship between income and religion in the US. It comes from a report\footnote{\url{http://religions.pewforum.org/pdf/comparison-Income\%20Distribution\%20of\%20Religious\%20Traditions.pdf}} produced by the Pew Research Center, an American think-tank that collects data on attitudes to topics ranging from religion to the internet, and produces many reports that contain datasets in this format.
\begin{table}[htbp]
\centering
\input{data/pew-raw.tex}
\caption{The first ten rows of data on income and religion from the Pew Forum. Three columns, \code{\$75-100k}, \code{\$100-150k} and \code{>150k}, have been omitted}
\label{tbl:pew-raw}
\end{table}
This dataset has three variables, \code{religion}, \code{income} and \code{frequency}. To tidy it, we need to \textbf{melt}, or stack it. In other words, we need to turn columns into rows. While this is often described as making wide datasets long or tall, I will avoid those terms because they are imprecise. Melting is parameterised by a list of columns that are already variables, or \textbf{colvar}s for short. The other columns are converted into two variables: a new variable called \code{column} that contains repeated column headings and a new variable called \code{value} that contains the concatenated data values from the previously separate columns. This is illustrated in Table~\ref{tbl:melt} with a toy dataset. The result of melting is a \textbf{molten} dataset.
\begin{table}
\centering
\subfloat[Raw data]{\label{tbl:melt-raw} \input{data/melt-raw.tex}}%
\hspace{2em}%
\subfloat[Molten data]{\label{tbl:melt-molten}\input{data/melt-output.tex}}
\caption{A simple example of melting. (a) is melted with one colvar, row, yielding the molten dataset (b). The information in each table is exactly the same, just stored in a different way.}
\label{tbl:melt}
\end{table}
The Pew dataset has one colvar, \code{religion}, and melting yields Table~\ref{tbl:pew-clean}. To better reflect their roles in this dataset, the \code{variable} column has been renamed to \code{income}, and the \code{value} column to \code{freq}. This form is tidy because each column represents a variable and each row represents an observation, in this case a demographic unit corresponding to a combination of \code{religion} and \code{income}.
\begin{table}[htbp]
\centering
\input{data/pew-clean.tex}
\caption{The first ten rows of the tidied Pew survey dataset on income and religion. The \code{column} has been renamed to \code{income}, and \code{value} to \code{freq}.}
\label{tbl:pew-clean}
\end{table}
Another common use of this data format is to record regularly spaced observations over time. For example, the Billboard dataset shown in Table~\ref{tbl:billboard-raw} records the date a song first entered the Billboard Top 100. It has variables for \code{artist}, \code{track}, \code{date.entered}, \code{rank} and \code{week}. The rank in each week after it enters the top 100 is recorded in 75 columns, \code{wk1} to \code{wk75}. If a song is in the Top 100 for less than 75 weeks the remaining columns are filled with missing values. This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated. This issue will be discussed in more depth in Section~\ref{sub:multiple-types}.
\begin{table}[htbp]
\centering
\input{data/billboard-raw.tex}
\caption{The first eight Billboard top hits for 2000. Other columns not shown are \code{wk4}, \code{wk5}, ..., \code{wk75}.}
\label{tbl:billboard-raw}
\end{table}
This dataset has colvars \code{year}, \code{artist}, \code{track}, \code{time}, and \code{date.entered}. Melting yields Table~\ref{tbl:billboard-clean}. I have also done a little cleaning as well as tidying: \code{column} has been converted to \code{week} by extracting the number, and \code{date} has been computed from \code{date.entered} and \code{week}.
\begin{table}[htbp]
\centering
\input{data/billboard-clean.tex}
\caption{First fifteen rows of the tidied billboard dataset. The \code{date} column does not appear in the original table, but can be computed from \code{date.entered} and \code{week}.}
\label{tbl:billboard-clean}
\end{table}
\subsection{Multiple variables stored in one column}
After melting, the \code{column} variable names often becomes a combination of multiple underlying variable names. This is illustrated by the tuberculosis (TB) dataset, a sample of which is shown in Table~\ref{tbl:tb-raw}. This dataset comes from the World Health Organisation, and records the counts of confirmed tuberculosis cases by \code{country}, \code{year}, and demographic group. The demographic groups are broken down by \code{sex} (m, f) and \code{age} (0--14, 15--25, 25--34, 35--44, 45--54, 55--64, unknown).
\begin{table}[htbp]
\centering
\input{data/tb-raw.tex}
\caption{Original TB dataset. Corresponding to each `m' column for males, there is also an `f' column for females, \code{f1524}, \code{f2534} and so on. These are not shown to conserve space. Note the mixture of 0s and missing values (---). This is due to the data collection process and the distinction is important for this dataset.}
\label{tbl:tb-raw}
\end{table}
Column headers in this format are often separated by some character (\code{.}, \code{-}, \code{\_}, \code{:}). While the string can be broken into pieces using that character as a divider, in other cases, such as for this dataset, more careful string processing is required. For example, the variable names can be matched to a lookup table that converts single compound value into multiple component values.
Table~\subref*{tbl:tb-molten} shows the results of melting the TB dataset, and Table~\subref*{tbl:tb-tidy} shows the results of splitting the single column \code{column} into two real variables: \code{age} and \code{sex}.
\begin{table}[htbp]
\centering
\subfloat[Molten data]{\label{tbl:tb-molten}\input{data/tb-clean-1.tex}}%
\hspace{2em}%
\subfloat[Tidy data]{\label{tbl:tb-tidy}\input{data/tb-clean-2.tex}}
\caption{Tidying the TB dataset requires first melting, and then splitting the \code{column} column into two variables: \code{sex} and \code{age}.}
\label{tbl:tb-clean}
\end{table}
Storing the values in this form resolves another problem in the original data. We want to compare rates, not counts. But to compute rates, we need to know the population. In the original format, there is no easy way to add a population variable. It has to be stored in a separate table, which makes it hard to correctly match populations to counts. In tidy form, adding variables for population and rate is easy. They are just additional columns.
\subsection{Variables are stored in both rows and columns}
The most complicated form of messy data occurs when variables are stored in both rows and columns. Table~\ref{tbl:weather-raw} shows daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for five months in 2010. It has variables in individual columns (\code{id}, \code{year}, \code{month}), spread across columns (\code{day}, d1--d31) and across rows (\code{tmin}, \code{tmax}) (minimum and maximum temperature). Months with less than 31 days have structural missing values for the last day(s) of the month. The \code{element} column is not a variable; it stores the names of variables.
To tidy this dataset we first melt it with colvars \code{id}, \code{year}, \code{month} and the column that contains variable names, \code{element}. This yields Table~\subref*{tbl:weather-molten}. For presentation, we have dropped the missing values, making them implicit rather than explicit. This is permissible because we know how many days are in each month and can easily reconstruct the explicit missing values.
This dataset is mostly tidy, but we have two variables stored in rows: \code{tmin} and \code{tmax}, the type of observation. Not shown in this example are the other meteorological variables \code{prcp} (precipitation) and \code{snow} (snowfall). Fixing this requires the cast, or unstack, operation. This performs the inverse of melting by rotating the \code{element} variable back out into the columns (Table~\subref*{tbl:weather-tidy}). This form is tidy. There is one variable in each column, and each row represents a day's observations. The cast operation is described in depth in \citet{wickham:2007b}.
\begin{table}[htbp]
\centering
\input{data/weather-raw.tex}
\caption{Original weather dataset. There is a column for each possible day in the month. Columns \code{d9} to \code{d31} have been omitted to conserve space.}
\label{tbl:weather-raw}
\end{table}
\begin{table}[htbp]
\centering
\subfloat[Molten data]%
{\label{tbl:weather-molten}\input{data/weather-clean-1.tex}}%
\hspace{2em}%
\subfloat[Tidy data]%
{\label{tbl:weather-tidy}\input{data/weather-clean-2.tex}}%
\caption{(a) Molten weather dataset. This is almost tidy, but instead of values, the \code{element} column contains names of variables. Missing values are dropped to conserve space. (b) Tidy weather dataset. Each row represents the meteorological measurements for a single day. There are two measured variables, minimum (\code{tmin}) and maximum (\code{tmax}) temperature; all other variables are fixed.}
\label{tbl:weather-clean}
\end{table}
\subsection{Multiple types in one table}
\label{sub:multiple-types}
Datasets often involve values collected at multiple levels, on different types of observational units. During tidying, each type of observational unit should be stored in its own table. This is closely related to the idea of database normalisation, where each fact is expressed in only one place. If this is not done, it's possible for inconsistencies to occur.
%If you're not familiar with normalisation, it can be worthwhile to learn a little about it. There are many good tutorials available online - I found \url{http://phlonx.com/resources/nf3/} after a few minutes of searching. You certainly don't need to become an expert, as most statistical databases only need a small amount of normalisation, but it is extremely helpful for identifying inconsistencies in your data.
The Billboard dataset described in Table~\ref{tbl:billboard-clean} actually contains observations on two types of observational units: the song and its rank in each week. This manifests itself through the duplication of facts about the song: \code{artist} and \code{time} are repeated for every song in each week. The billboard dataset needs to be broken down into two datasets: a song dataset which stores \code{artist}, \code{song name} and \code{time}, and a ranking dataset which gives the \code{rank} of the \code{song} in each \code{week}. Table~\ref{tbl:billboard-normal} shows these two datasets. You could also imagine a week dataset which would record background information about the week, maybe the total number of songs sold or similar demographic information.
\begin{table}
\centering
\input{data/billboard-song.tex}\hspace{1em}%
\input{data/billboard-rank.tex}
\caption{Normalised billboard dataset split up into song dataset (left) and rank dataset (right). First 15 rows of each dataset shown; \code{genre} omitted from song dataset, \code{week} omitted from rank dataset.}
\label{tbl:billboard-normal}
\end{table}
Normalisation is useful for tidying and eliminating inconsistencies. However, there are few data analysis tools that work directly with relational data, so analysis usually also requires denormalisation or the merging the datasets back into one table.
% Multiple-choice/check all that apply data
\subsection{One type in multiple tables}
It's also common to find data values about a single type of observational unit spread out over multiple tables or files. These tables and files are often split up by another variable, so that each represents a single year, person, or location. As long as the format for individual records is consistent, this is an easy problem to fix:
\begin{enumerate}
\item Read the files into a list of tables.
\item For each table, add a new column that records the original file name (because the file name is often the value of an important variable).
\item Combine all tables into a single table.
\end{enumerate}
The \pkg{plyr} package makes this a straightforward task in \proglang{R}. The following code generates a vector of file names in a directory (\code{data/}) which match a regular expression (ends in \code{.csv}). Next we name each element of the vector with the name of the file. We do this because \pkg{plyr} will preserve the names in the following step, ensuring that each row in the final data frame is labelled with its source. Finally, \code{ldply()} loops over each path, reading in the csv file and combining the results into a single data frame.
\begin{Verbatim}
paths <- dir("data", pattern = "\\.csv$", full.names = TRUE)
names(paths) <- basename(paths)
ldply(paths, read.csv, stringsAsFactors = FALSE)
\end{Verbatim}
Once you have a single table, you can perform additional tidying as needed. An example of this type of cleaning can be found at \url{https://github.com/hadley/data-baby-names} which takes 129 yearly baby name tables provided by the US Social Security Administration and combines them into a single file.
A more complicated situation occurs when the dataset structure changes over time. For example, the datasets may contain different variables, the same variables with different names, different file formats, or different conventions for missing values. This may require you to tidy each file to individually (or, if you're lucky, in small groups) and then combine them once tidied. An example of this type of tidying is illustrated in \url{https://github.com/hadley/data-fuel-economy}, which shows the tidying of {\sc epa} fuel economy data for over 50,000 cars from 1978 to 2008. The raw data is available online, but each year is stored in a separate file and there are four major formats with many minor variations, making tidying this dataset a considerable challenge.
\section{Tidy tools}
\label{sec:tidy-tools}
Once you have a tidy dataset, what can you do with it? Tidy data is only worthwhile if it makes analysis easier. This section discusses tidy tools, tools that take tidy datasets as input and return tidy datasets as output. Tidy tools are useful because the output of one tool can be used as the input to another. This allows you to simply and easily compose multiple tools to solve a problem. Tidy data also ensures that variables are stored in a consistent, explicit manner. This makes each tool simpler, because it doesn't need a Swiss Army knife of parameters for dealing with different dataset structures.
Tools can be messy for two reasons: either they take messy datasets as input (messy-input tools) or they produce messy datasets as output (messy-output tools). Messy-input tools are typically more complicated than tidy-input tools because they need to include some parts of the tidying process. This can be useful for common types of messy datasets, but it typically makes the function more complex, harder to use and harder to maintain. Messy-output tools are frustrating and slow down analysis because they can not be easily composed and you must constantly think about how to convert from one format to another. We'll see examples of both in the following sections.
Next, I give examples of tidy and messy tools for three important components of analysis: data manipulation, visualisation and modelling. I will focus particularly on tools provided by \proglang{R} \citep{R}, because it has many existing tidy tools, but I will also touch on other statistical programming environments.
\subsection{Manipulation}
Data manipulation includes variable-by-variable transformation (e.g., \code{log} or \code{sqrt}), as well as aggregation, filtering and reordering. In my experience, these are the four fundamental verbs of data manipulation:
\begin{itemize}
\item Filter: subsetting or removing observations based on some
condition.
\item Transform: adding or modifying variables. These modifications can
involve either a single variable (e.g., log-transformation), or multiple
variables (e.g., computing density from weight and volume).
\item Aggregate: collapsing multiple values into a single value (e.g., by
summing or taking means).
\item Sort: changing the order of observations.
\end{itemize}
All these operations are made easier when there is a consistent way to refer to variables. Tidy data provides this because each variable resides in its own column.
In \proglang{R}, filtering and transforming are performed by the base \proglang{R} functions \code{subset()} and \code{transform()}. These are input and output-tidy. The \code{aggregate()} function performs group-wise aggregation. It is input-tidy. Provided that a single aggregation method is used, it is also output-tidy . The \pkg{plyr} package provides tidy \code{summarise()} and \code{arrange()} functions for aggregation and sorting.
The four verbs can be, and often are, modified by the ``by'' preposition. We often need group-wise aggregates, transformations and subsets, to pick the biggest in each group, to average over replicates and so on. Combining each of the four verbs with a by operator allows them to operate on subsets of a data frame at a time. Most \proglang{SAS} {\sc proc}s possess a {\sc by} statement which allows the operation to be performed by group, and are generally input-tidy. Base \proglang{R} possesses a \code{by()} function, which is input-tidy, but not output-tidy, because it produces a list. The \code{ddply()} function from the \pkg{plyr} package is a tidy alternative.
% Some aggregations occur so frequently they deserve their own optimised implementations. One such operation is (weighted) counting. Base R provides the {\tt table} function for this, but it is not output-tidy: it returns a multidimensional array. An tidy alternative is the {\tt count} function from {\tt plyr}, which returns a tidy dataset with a column for each of the input variables plus a new variable {\tt freq}, which records the number of records in each category.
Other tools are needed when we have multiple datasets. An advantage of tidy data is the ease with which it can be combined with other tidy datasets. All that is needed is a join operator that works by matching common variables and adding new columns. This is implemented in the \code{merge()} function in base \proglang{R}, or the \code{join()} function in \pkg{plyr}. Compare these operators with the difficulty of combining datasets stored in arrays. This task typically requires painstaking alignment before matrix operations can be used, which can can make errors very hard to detect.
\subsection{Visualisation}
Tidy visualisation tools only need to be input-tidy as their output is visual. Domain specific languages work particularly well for the visualisation of tidy datasets because they can describe a visualisation as a mapping between variables and aesthetic properties of the graph (e.g., position, size, shape and colour). This is the idea behind the grammar of graphics \citep{wilkinson:2006}, and the layered grammar of graphics \citep{wickham:2007d}, an extension tailored specifically for \proglang{R}.
Most graphical tools in \proglang{R} are input-tidy, including the \pkg{base} \code{plot()} function, the \pkg{lattice} family of plots \citep{sarkar:2008} and \pkg{ggplot2} \citep{me:ggplot2}. Some specialised tools exist for visualising messy datasets. Some base \proglang{R} functions like \code{barplot()}, \code{matplot()}, \code{dotchart()}, and \code{mosaicplot()}, work with messy datasets where variables are spread out over multiple columns. Similarly, the parallel coordinates plot \citep{wegman:1990,inselberg:1985} can be used to create time series plots for messy datasets where each time point is a column.
\subsection{Modelling}
\label{sub:modelling}
Modelling is the driving inspiration of this work because most modelling tools work best with tidy datasets. Every statistical language has a way of describing a model as a connection among different variables, a domain specific language that connects responses to predictors:
\begin{itemize}
\item \proglang{R} (\code{lm()}): \code{y ~ a + b + c * d}.
\item \proglang{SAS} (\code{PROC GLM}): \code{y = a + b + c + d + c * d}.
\item \proglang{SPSS} (\code{glm}): \code{y BY a b c d / DESIGN a b c d c * d}.
\item \proglang{Stata} (\code{regress}): \code{y a b c\#d}.
\end{itemize}
This is not to say that tidy data is the format used internally to compute the regression. Significant transformations take place to produce a numeric matrix that can easily be fed to standard linear algebra routines. Common transformations include adding an intercept column (a column of ones), turning categorical variables into multiple binary dummy variables, and projecting data onto the appropriate basis of a spline function.
% http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_glm_sect022.htm
There have been some attempts to adapt modelling functions for specific types of messy datasets. For example, in \proglang{SAS}'s \code{proc glm}, multiple variables on the response side of the equation will be interpreted as repeated measures if the {\sc repeated} keyword is present. For the raw Billboard data, we could construct a model of the form \code{wk1-wk76 = track} instead of \code{rank = week * track} on the tidy data.
Another interesting example is the classic paired t-test, which can be computed in two ways depending on how the data is stored. If the data is stored as in Table~\subref*{tbl:paired}, then a paired t-test is just a t-test applied to the mean difference between x and y. If it is stored in the form of Table~\subref*{tbl:mixed}, then an equivalent result can be produced by fitting a mixed effects model, with a fixed dummy variable representing the \code{variable}, and a random intercept for each id. (In \proglang{R}'s lmer4 notation, this is expressed as \code{value ~ variable + (1 | id)}). Either way of modelling the data yields the same result. Without more information we can't say which form of the data is tidy: if x and y represent length of left and right arms, then Table~\subref*{tbl:paired} is tidy, if x and y represent measurements on day 1 and day 10, then Table~\subref*{tbl:mixed} is tidy.
\begin{table}
\centering
\subfloat[Data for paired t-test]{
\label{tbl:paired}
\input{model-1.tex}
}%
\hspace{2em}%
\subfloat[Data for mixed effects model]{
\label{tbl:mixed}
\input{model-2.tex}
}
\caption{Two data sets for performing the same test.}
\label{label}
\end{table}
While model inputs usually require tidy inputs, such attention to detail doesn't carry over to model outputs. Outputs such as predictions and estimated coefficients aren't always tidy. This makes it more difficult to combine results from multiple models. For example, in \proglang{R}, the default representation of model coefficients is not tidy because it does not have an explicit variable that records the variable name for each estimate, they are instead recorded as row names. In \proglang{R}, row names must be unique, so combining coefficients from many models (e.g., from bootstrap resamples, or subgroups) requires workarounds to avoid losing important information. This knocks you out of the flow of analysis and makes it harder to combine the results from multiple models. I'm not currently aware of any packages that resolve this problem.
\section{Case study}
\label{sec:case-study}
The following case study illustrates how tidy data and tidy tools make data analysis easier by easing the transitions between manipulation, visualisation and modelling. You will not see any code that exists solely to get the output of one function into the right format to input to another. I'll show the \proglang{R} code that performs the analysis, but even if you're not familiar with \proglang{R} or the exact idioms I use, I've tried to make it easy to understand by tightly interleaving code, results and explanation.
The case study uses individual-level mortality data from Mexico. The goal is to find causes of death with unusual temporal patterns within a day. Figure~\ref{fig:overall} shows the temporal pattern, the number of deaths per hour, for all causes of death. My goal is to find the diseases that differ most from this pattern.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.65\linewidth]{case-study/overall}
\caption{Temporal pattern of all causes of death.}
\label{fig:overall}
\end{figure}
The full dataset has information on 539,530 deaths in Mexico in 2008 and 55 variables, including the the location and time of death, the cause of death, and demographics of the deceased. Table~\ref{fig:raw} shows a small sample of the dataset, focussing on variables related to time of death (\code{year}, \code{month}, \code{day} and \code{hour}), and cause of death (\code{cod}).
\begin{table}
\centering
\input{case-study/raw.tex}
\caption{A sample of 16 rows and 5 columns from the original dataset of 539,530 rows and 55 columns.}
\label{fig:raw}
\end{table}
To achieve our goal of finding unusual temporal patterns, we do the following. First, we count the number of deaths in each hour (\code{hod}) for each cause (\code{cod}) with the tidy \code{count} function.
\begin{Verbatim}
hod2 <- count(deaths, c("hod", "cod"))
\end{Verbatim}
Then we remove missing (and hence uninformative for our purpose) values with \code{subset}.
\begin{Verbatim}
hod2 <- subset(hod2, !is.na(hod))
\end{Verbatim}
This gives Table~\subref*{tbl:counts:1}. To provide informative labels for disease, we next join the dataset to the \code{codes} dataset, connected by the \code{cod} variable. This adds a new variable, \code{disease}, shown in Table~\subref*{tbl:counts:2}.
\begin{Verbatim}
hod2 <- join(hod2, codes, by = "cod")
\end{Verbatim}
The total deaths for each cause varies over several orders of magnitude: there are 46,794 deaths from heart attack but only 10 from avalanche. This means that rather than the total number, it makes more sense to compare the proportion of deaths in each hour. We compute this by breaking the dataset down by \code{cod}, and then \code{transform()}ing to add a new \code{prop} column, the hourly frequency divided by the total number of deaths from that cause. This new column is Table~\subref*{tbl:counts:3}.
\code{ddply()} breaks down the first argument (\code{hod2}) by its second (the \code{cod} variable), and then applies the third argument (\code{transform}) to each resulting piece. The fourth argument (\code{prop = freq / sum(freq)}) is then passed on to \code{transform()}.
\begin{Verbatim}
hod2 <- ddply(hod2, "cod", transform, prop = freq / sum(freq))
\end{Verbatim}
We then compute the overall average death rate for each hour, and merge that back into the original dataset. This yields Table~\subref*{tbl:counts:4}. By comparing \code{prop} to \code{prop\_all}, we can easily compare each disease with the overall incidence pattern.
First, we work out the number of people dying each hour. We break down \code{hod2} by \code{hod}, and compute the total for each cause of death.
\begin{Verbatim}
overall <- ddply(hod2, "hod", summarise, freq_all = sum(freq))
\end{Verbatim}
Next, we work out the overall proportion of people dying each hour:
\begin{Verbatim}
overall <- transform(overall, prop_all = freq_all / sum(freq_all))
\end{Verbatim}
Finally, we join the overall dataset with the individual dataset to make it easier to compare the two:
\begin{Verbatim}
hod2 <- join(hod2, overall, by = "hod")
\end{Verbatim}
\begin{table}[htbp]
\centering
\subfloat[]{
\label{tbl:counts:1}\input{case-study/counts.tex}
}%
\subfloat[]{
\label{tbl:counts:2}\input{case-study/counts-disease.tex}
}%
\subfloat[]{
\label{tbl:counts:3}\input{case-study/counts-prop.tex}
}%
\subfloat[]{
\label{tbl:counts:4}\input{case-study/counts-all.tex}
}
\caption{A sample of four diseases and four hours from \code{hod2} data frame.}
\label{tbl:counts}
\end{table}
Next we compute a distance between the temporal pattern of each cause of death and the overall temporal pattern. There are many ways to measure this distance, but I found a simple mean squared deviation to be revealing. We also record the sample size, the total number of deaths from that cause. To ensure that the diseases we consider are sufficiently representative we'll only work with diseases with more than 50 total deaths ($\sim$2/hour).
\begin{Verbatim}
devi <- ddply(hod2, "cod", summarise,
n = sum(freq),
dist = mean((prop - prop_all)^2))
devi <- subset(devi, n > 50)
\end{Verbatim}
We don't know the variance characteristics of this estimator, but we can explore it visually by plotting \code{n} vs.\ \code{deviation}, Figure~\subref*{fig:deviation-raw}. On a linear scale, the plot shows little, except that variability decreases with sample size. But on the log-log scale, Figure~\subref*{fig:deviation-log}, there is a clear pattern. This is particularly easy to see the pattern when we add the line of best fit from a robust linear model.
\begin{Verbatim}
ggplot(data = devi, aes(x = n, y = dist) + geom_point()
last_plot() +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "rlm", se = F)
\end{Verbatim}
\begin{figure}[htbp]
\centering
\subfloat[Linear scales]{
\label{fig:deviation-raw}
\includegraphics[width=0.5\linewidth]{case-study/n-dist-raw.pdf}
}%
\subfloat[Log scales]{
\label{fig:deviation-log}
\includegraphics[width=0.5\linewidth]{case-study/n-dist-log.pdf}
}
\caption{(a) Plot of n vs deviation. Variability of deviation is dominated by sample size: small samples have large variability. (b) Log-log plot makes it easy to see the pattern of variation as well as unusually high values. The blue line is a robust line of best fit.}
\label{fig:deviation}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\linewidth]{case-study/n-dist-resid}
\caption{Residuals from a robust linear model predicting $\log(dist)$ by $\log(n)$. Horizontal line at 1.5 shows threshold for further exploration.}
\label{fig:devi-resid}
\end{figure}
We are interested in points that have high $y$-values, relative to their $x$-neighbours. Controlling for the number of deaths, these points represent the diseases which depart the most from the overall pattern.
To find these unusual points, we fit a robust linear model and plot the residuals, Figure~\ref{fig:devi-resid}. The plot shows an empty region around a residual of 1.5. So somewhat arbitrarily, we'll select those diseases with a residual greater than 1.5. We do this in two steps: first, we select the appropriate rows from \code{devi} (one row per disease), and then we find the matching temporal course information from the original summary dataset (24 rows per disease).
\begin{Verbatim}
devi$resid <- resid(rlm(log(dist) ~ log(n), data = devi))
unusual <- subset(devi, resid > 1.5)
hod_unusual <- match_df(hod2, unusual)
\end{Verbatim}
Finally, we plot the temporal course for each unusual cause, Figure~\ref{fig:disease}. We split the diseases into two plots because of differences in variability. The top plot shows diseases with over 350 deaths and the bottom with under 350. The causes of death fall into three main groups: murder, drowning, and transportation related. Murder is more common at night, drowning in the afternoon, and transportation related deaths during commute times. The pale gray line in the background shows the temporal course across all diseases.
\begin{Verbatim}
ggplot(data = subset(hod_unusual, n > 350), aes(x = hod, y = prop)) +
geom_line(aes(y = prop_all), data = overall, colour = "grey50") +
geom_line() +
facet_wrap(~ disease, ncol = 3)
\end{Verbatim}
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\textwidth]{case-study/unusual-big}
\includegraphics[width=0.9\textwidth]{case-study/unusual-sml}
\caption{Causes of death with unusual temporal courses. Overall hourly death rate shown in grey. (Top) Causes of death with more than 350 deaths over a year. (Bottom) Causes of death with fewer than 350 deaths. Note that the y-axes are on different scales.}
\label{fig:disease}
\end{figure}
\section{Discussion}
\label{sec:discussion}
Data cleaning is an important problem, but it is an uncommon subject of study in statistics. This paper carves out a small but important subset of data cleaning that I've called data tidying: structuring datasets to facilitate manipulation, visualisation and modelling. There is still much work to be done. Incremental improvements will happen as our understanding of tidy data and tidy tools improves, and as we improve our ability to reduce the friction of getting data into a tidy form.
Bigger improvements may be possible by exploring alternative formulations of tidiness. There is a chicken-and-egg problem with tidy data: if tidy data is only as useful as the tools that work with it, then tidy tools will be inextricably linked to tidy data. This makes it easy to get stuck in a local maxima where independently changing data structures or data tools will not improve workflow. Breaking out of this local maxima is hard. It requires long-term concerted effort with the prospect of many false starts. While I hope that the tidy data framework is not one of those false starts, I also don't see it as the final solution. I hope others will build on this framework to develop even better data storage strategies and better tools.
Surprisingly, I have found few principles to guide the design of tidy data, which acknowledge both statistical and cognitive factors. To date, my work has been driven by my experience doing data analysis, my knowledge of relational database design, and my own rumination on the tools of data analysis. The human factors, user-centered design, and human-computer interaction communities may be able to add to this conversation, but the design of data and tools to work with it has not been an active research topic in those fields. In the future, I hope to use methodologies from these fields (user-testing, ethnography, talk-aloud protocols) to improve our understanding of the cognitive side of data analysis, and to further improve our ability to design appropriate tools.
Other formulations of tidy data are possible. For example, it would be possible to construct a set of tools for dealing with values stored in multidimensional arrays. This is a common storage format for large biomedical datasets generated by microarrays or fMRI's. It's also necessary for many multivariate methods based on matrix manipulation. Fortunately, because there are many efficient tools for working with high-dimensional arrays, even sparse ones, such an array-tidy format is not only likely to be quite compact and efficient, it should also be able to easily connect with the mathematical basis of statistics. This, in fact, is the approach taken by the Pandas python data analysis library \citep{mckinney:2010}. Even more interestingly, we could consider tidy tools that can ignore the underlying data representation and automatically choose between array-tidy and dataframe-tidy formats to optimise memory usage and performance.
Apart from tidying, there are many other tasks involved in cleaning data: parsing dates and numbers, identifying missing values, correcting character encodings (for international data), matching similar but not identical values (created by typos), verifying experimental design, and filling in structural missing values, not to mention model-based data cleaning that identifies suspicious values. Can we develop other frameworks to make these tasks easier?
% While the tools that power this work grew out of my personal struggle to work with data, the framework that hooks them all together did not develop until I had to teach data cleaning. I could look at a dataset and intuit what needed to be done to it, but I couldn't explain what I was doing, and I found it very difficult to teach. This description of tidy data in this paper easier to teach because students are pretty good at identifying variables and values, and then there is a straightforward path to follow to get data in the right format.
\section{Acknowledgements}
\label{sec:acknowledgements}
This work wouldn't be possible without the many conversations I've had about data and how to deal with them statistically. I'd particularly like to thank Phil Dixon, Di Cook, and Heike Hofmann, who have put up with numerous questions over the years. I'd also like to thank the users of the \pkg{reshape} package who have provided many challenging problems, and my students who continue to challenge me to explain what I know in a way that they can understand. I'd also like to thank Bob Muenchen, Burt Gunter, Nick Horton and Garrett Grolemund who gave detailed comments on earlier drafts, and to particularly thank Ross Gayler who provided the nice example of the challenges of defining a variable and Ben Bolker who showed me the natural equivalence between a paired t-test and a mixed effects model.
% bibtool -x tidy-data.aux -c > references.bib
\bibliography{references}
\end{document}