Wrangling WeRateDogs

Please check the jupyter notebook in:

Introduction

This project Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. This task is intended for Udacity Nanodegree Data Wrangling Project.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. The data separated in 3 part, 1 from a local file which provided by Udacity, second data from Udacity server, and the last one from twitter API.

The goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

Gathering The Data

In this part, I gather 3 types of file resources. First file from a local file which provided by Udacity. The second file from Udacity Server, and the last data from Twitter API.

Accessing The Data

In this part, I do some task that is:

Check the length of data
Check the type of data
Check the value of data
Check the missing value of data
Check stat describe data

The issue I am founded are: quality issues:

Axist not original tweet
tweet_id format in third data doesn't like first data so maybe it can make some problem if we join the two table
tweet_id position in the third table not the same as the other table, so we can't easily see the id
timestamp in the first table not in DateTime format
The missing value was not uniformly, sometime NaN but some other None
There are exist columns that have >90% missing value, also exist dog name that just has 1 character ('a')
Cols retwitted and favorited have the same value in all row
Cols source have HTML format
Cols expanded_urls and jpg_urls have duplicated value

tidiness issues:

Stage of a dog must be 1 col instead of 4 cols
Join all data is needed to make easier for analysis

Cleaning and Tidying The Data

In the cleaning section, I just solve the problem from section "Accessing The Data".

Analyzing and Visualizing Data

In this section, I answer some question, that is:

Is there any outlier in the data?
How about the correlation between variables?
Do the retweet count and favorite count increase with time?
Does the rating increase with time?
Is the rating affect the number of favorite and retweet count?
How much each algorithm predict the picture is a dog?
What are the most popular dog names?
What is the most popular dog predict?
What is the most popular dog predict when all algorithm predicts the same dog?

Predict dog_stage

In this section, I predict missing dog_stage from third file I have.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data_generated		data_generated
data_udacity		data_udacity
documentation		documentation
plot		plot
Data Wrangling Project.ipynb		Data Wrangling Project.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wrangling WeRateDogs

Table of Contents

Introduction

Gathering The Data

Accessing The Data

Cleaning and Tidying The Data

Analyzing and Visualizing Data

Predict dog_stage

About

Releases

Packages

Languages

RyMey/Wrangling-WeRateDogs

Folders and files

Latest commit

History

Repository files navigation

Wrangling WeRateDogs

Table of Contents

Introduction

Gathering The Data

Accessing The Data

Cleaning and Tidying The Data

Analyzing and Visualizing Data

Predict dog_stage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages