Data_Wrangling_Project

Udacity Data Wrangling Project

The Data Wrangling involved Gathering, Assessing, Cleaning and Saving the cleaned data. Gathering Phase: The project made use of 3 sources for data. I was provided with one which I only had to upload and read into my project workspace. The second was downloaded programmatically from Udacity server using the request library. The third was gathered using the tweet_ids with the aid of twitter API which was generated with the Tweepy library after successfully setting a Twitter Developer Account. Assessing Phase: I employed both Visual and Programmatic Assessment to detect quality and tidiness issues with the dataset. Visual Assessment was using Microsoft Excel and Programmatic Assessment was done in my project workspace (jupyter notebook). I was able to detect 8 quality issues and 3 tidiness issues. Cleaning Phase: The Cleaning Phase proved to be the most tedious aspect of the project. After making a copy of each table, I proceeded to cleaning the first table. The first cleaning I did was to make remove the tweets which were retweets. This was one of the guidelines I was given for the project. This was done by using a line of code that filters the dataset of retweets. In the dataset I was given, the names of dogs were not extracted properly. So, I used regular expressions (re library) to extract the names to a new column which I named dog_names. I also used a regex to extract the numerators of ratings which were decimal numbers. This was because rating_numerators for decimal numbers were incorrect. From my assessment, I also discovered that for denominators which were greater than 10, there contained 2 or more dogs. What I did was to create a new column: rating, that normalizes all the rating_numerators. Then I melted the doggo, floofer, pupper & puppo columns into a single column and removed unnecessary columns from the table. This was for the first table In the second table, I removed duplicates and I removed columns I felt weren’t germane in the third table. After I addressed all the issues I detected from the Assessing phase, I merged the datasets using the tweet_ids and then saved it to a csv file named twitter_archive_master.csv as was instructed.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
README.md		README.md
Screenshot (69).png		Screenshot (69).png
act_report.pdf		act_report.pdf
image_predictions.tsv		image_predictions.tsv
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_act.html		wrangle_act.html
wrangle_act.ipynb		wrangle_act.ipynb
wrangle_report.pdf		wrangle_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data_Wrangling_Project

About

Releases

Packages

Languages

Oluwajuwon-O/Wrangling_of_Twitter_Data

Folders and files

Latest commit

History

Repository files navigation

Data_Wrangling_Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages