'80% of time in data science and analysis is spent data cleaning'
This includes:
• Loading multiple sources of data
• Consolidating data for analysis
• Reshaping and joining datasets
• Dealing with missing values, duplicates and outliers
• Cleaning strings
• Estimated $3 trillion US GDP lost in 2016 - IBM
• 1 in 3 business leaders did not trust the data sources used in decision-making
'Garbage in leads to garbage out'
• Understanding data through inital cleaning and exploration
• Reduces the risk of incorrect assumptions
• Raises relevant questions
• Discovery of issues such as biases in data collection
• Opportunities to problem solve for unique datasets
• Setup to extract additional insight
• Setup to emphasis particular questions
This project centres around cleaning six dirty datasets [Folder]
• Task 1 - Decathlon Events [Analysis]
• Task 2 - Cake Ingredients [Analysis]
• Task 3 - Seabirds Spottings [Analysis]
• Task 4 - Sweeties Survey [Analysis]
• Task 5 - Right Wing Authoritarianism [Analysis]
• Task 6 - Dogs [Analysis]
Each solution includes:
• Cleaning script
• Commentary, assumptions and process
• Answers to questions
• here
• janitor
• readr
• tidyverse
raw_data
data_cleaning_scripts
clean_data
documentation_and_analysis