Skip to content

cunningjames/GACDProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

GACDProject

Getting and Cleaning Data Course Project

The script contained within this repository is intended to satisfy project requirements for Getting and Cleaning Data, 2015 June. In short: we've been given some messy data spread out over several files; our task is to create a new, tidy dataset by merging and summarizing.

The data in question involve measurements of three-dimensional motion for thirty individuals. Multiple pieces of information must be aggregated:

  1. An X dataset, where the (i, j)th entry records motion type j for some individual;
  2. A Y dataset, where the entry in the ith row indicates the type of activity taking place when the motion was recorded in the corresponding row of X;
  3. A subject dataset, where the entry in the ith row indicates the individual in question from the corresponding row of X;
  4. A dataset of labels describing the motion recordings;
  5. A dataset of labels describing the activity taking place;
  6. ... where the X, Y, and subject datasets come in 'train' and 'test' varieties.

My end goal is a dataset that looks something like this (for the first six rows):

subject_id activity motion_label mean
1 WALKING tBodyAcc-mean()-X 0.27733076
1 WALKING tBodyAcc-mean()-Y -0.01738382
1 WALKING tBodyAcc-mean()-Z -0.11114810
1 WALKING tBodyAcc-std()-X -0.28374026
1 WALKING tBodyAcc-std()-Y 0.11446134
1 WALKING tBodyAcc-std()-Z -0.26002790

which fits my interpretation of 'tidy': the mean variable gives the mean value of a motion measurement x (motion_label) for individual i (subject_id) undertaking activity j (activity).

All of this is much less complicated than it sounds. Because the raw data files are all of the same dimension, they can be very simply appended together. Then it's a matter of using a few dplyr functions to construct the tidy dataset:

  1. Select only the mean and standard deviation values;
  2. Note that the 'wide' dataset represents four variables (subject_id, activity, motion_label, and value), but the motion values are spread out in long rows. So we melt them into a longer / thinner dataset with the gather function;
  3. Take the mean over motion types for each subject / activity by grouping and then summarizing.

Check the source in run_analysis.r for more details / comments.

About

Getting and Cleaning Data Course Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages