Getting and Cleaning Data Course Project
The script contained within this repository is intended to satisfy project requirements for Getting and Cleaning Data, 2015 June. In short: we've been given some messy data spread out over several files; our task is to create a new, tidy dataset by merging and summarizing.
The data in question involve measurements of three-dimensional motion for thirty individuals. Multiple pieces of information must be aggregated:
- An X dataset, where the (i, j)th entry records motion type j for some individual;
- A Y dataset, where the entry in the ith row indicates the type of activity taking place when the motion was recorded in the corresponding row of X;
- A subject dataset, where the entry in the ith row indicates the individual in question from the corresponding row of X;
- A dataset of labels describing the motion recordings;
- A dataset of labels describing the activity taking place;
- ... where the X, Y, and subject datasets come in 'train' and 'test' varieties.
My end goal is a dataset that looks something like this (for the first six rows):
subject_id | activity | motion_label | mean |
---|---|---|---|
1 | WALKING | tBodyAcc-mean()-X | 0.27733076 |
1 | WALKING | tBodyAcc-mean()-Y | -0.01738382 |
1 | WALKING | tBodyAcc-mean()-Z | -0.11114810 |
1 | WALKING | tBodyAcc-std()-X | -0.28374026 |
1 | WALKING | tBodyAcc-std()-Y | 0.11446134 |
1 | WALKING | tBodyAcc-std()-Z | -0.26002790 |
which fits my interpretation of 'tidy': the mean variable gives the mean value of a motion measurement x (motion_label) for individual i (subject_id) undertaking activity j (activity).
All of this is much less complicated than it sounds. Because the raw data files are all of the same dimension, they can be very simply appended together. Then it's a matter of using a few dplyr functions to construct the tidy dataset:
- Select only the mean and standard deviation values;
- Note that the 'wide' dataset represents four variables (subject_id, activity, motion_label, and value), but the motion values are spread out in long rows. So we melt them into a longer / thinner dataset with the gather function;
- Take the mean over motion types for each subject / activity by grouping and then summarizing.
Check the source in run_analysis.r for more details / comments.