-
-
Notifications
You must be signed in to change notification settings - Fork 17
Intro to Data Analysis With Python
In the wild, data is dirty and data is often disorganized. In order to make a resource useful you may need to filter it, modify it, and combine it with other resources. To arrive at your final useful dataset, these operations may need to be performed thousands of times in sequence, this is where Python comes in handy. You may also need to retrace your steps to do it all again in order to teach somebody else to do it or in the case where your hard drive crashes. (back up your hard drive)
Data tools in Python can allow you to create organized sets of operations called "data pipelines" to start from nothing and assemble and clean data to arrive at a dataset you can actually use.
This tutorial will begin with a brief guide to loading datasets into Python, followed by an introduction to exploratory data analysis, and conclude with what comes after the EDA process. After this tutorial, you should be prepared to turn raw data into clean data for analysis.
This tutorial assumes that you have knowledge of package managers in python like conda
and pip
, a basic understanding of working in Python (how to write if/else/while statements, assign variables, etc), and at least a passing familiarity with the most commonly used tools in its associated standard library. See here for tutorial on Python.
-
Jupyter Notebooks (optional integrated development environment (IDE))
One of the easiest ways to import data into Python is through the Pandas library.
Start by importing the Pandas library:
import pandas as pd
Then perform this function on your csv file:
pd.read_csv('#name of csv filepath')
Other ways to import data to Python
EDA is the process of investigating a new dataset and cataloguing its features. Broadly it's the process of getting to know your data, getting it in the right format, and identifying any inconsistencies it might have. EDA should always be your first step when you get a new dataset, even if it's brief. Otherwise your conclusions may not mean what you think they do.
EDA is very personalized and is really all about learning to think deeply about a new dataset and cover your bases in a methodical way while keeping an eye out for any interesting trends. The below are provided as examples, but none are an authoritative workflow.
- A General Intro To EDA A conceptual introduction to the thought process of EDA.
- YouTube EDA Example A quick investigation of a dataset.
- Another YouTube EDA Example
- Kaggle EDA Example One example of an EDA process with executable code. There are MANY notebooks on Kaggle that involve an EDA, it's a good idea to google around and see how other people have approached the thought process.
Here are some options for procedures that may be useful in the EDA process:
- Checking the overall structure of your data (size, shape)
- Joining more data if you find it lacking
- Grouping the data into categories that may be helpful to answer the goal of the project
- Cleaning the data to deal with missing values, outliers and mismatched data
-
Attribute: A value associated with an object or class which is referenced by name using dot notation.
-
Method: A function that belongs to a class and typically performs an action or operation.
df.head() returns first rows of dataframe.
df.info() summarizes dataframe.
df.describe() returns descriptive statistics of dataframe (mean, median, Q1, Q3).
df.shape returns tuple with shape of dataframe (ex: (2,3) for a dataframe with 2 rows and 3 columns).
df.size returns number of cells in dataframe.
df.value_counts() can be performed on a series or a dataframe, counts the number of values for each category
Additional attributes and methods of a Pandas DataFrame
Note:
"df" should be replaced with the name of your data frame. For example, turtles.shape if the name of my data frame is turtles.
df.merge()
pd.concat()
df.join()
df.groupby()
df.sort_values()
df[condition]
df.iloc[]
df.loc[]
df.duplicated()
df.drop_duplicates()
pd.to_datetime()
df.isnull()/df.isna()
df.isnotnull()/df.notna()
df.fillna()
df.replace()
df.dropna()
df.astype()
After your data has been cleaned and you have a basic understanding of what it looks like, the next step is to visualize the data. You may want to return to add new columns, group items, and fix formatting based on what you learn from the visualizations. Data analysis is an iterative process and this part is all about exploring what stories your data might hold. Please see the data visualization tutorial if you want help getting started! After visualization, you might want to continue the data analysis with statistical analyses, regression or machine learning. These can be done using specialized statistics libraries. Look in the tutorials and guides section for more information.
NumPy is the library that underlies most Python data tools. It is more granular and allows many optimized mathematical operations for working with large arrays. It is especially useful for performing linear algebra operations like matrix multiplies, which are ubiquitous in machine learning and deep learning. Pandas is based on NumPy, and many of its data structures and operations act the way they do because they are built on top of NumPy's code and philosophy. For a deeper understanding of how to manipulate data, a working knowledge of NumPy can be very powerful.
- Official NumPy Tutorial How to get up and running with NumPy
- NumPy Illustrated Graphical guide to NumPy with some good visualized explanations of how things work.
- NumPy For Your Grandma From scratch tutorial covering the fundamental NumPy operations and data structures.
- FreeCodeCamp Tutorial Video covering high level operations in NumPy and using Numpy data structures.
Pandas is the workhorse of Python data analysis. Its dataframe data structure makes available a huge variety of tools. In addition Pandas is supported by a great variety of packages in Python for specialized data analysis and machine learning, which makes it a valuable core competency.
- Official Pandas Tutorial Up to date and well maintained tutorial focused on getting you up to speed and running quickly
- Daniel Chen Pandas Tutorial Good in-depth video walkthrough showing a full data analysis with explanations
- Brandon Rhodes Pandas Tutorial Considered by many people the definitive intro to pandas. Be aware that some small changes have happened to the way pandas works since this was filmed, so you may need to google if the code examples don't work exactly as shown.
Jupyter Notebooks are a useful interface for doing data analysis. Most practitioners of Pandas use Jupyter at least a little bit since the two tools are very well integrated and it makes things look pretty nice and makes live coding a much cleaner exercise.
- Jupyter Install Tutorial How to install and get started with Jupyter
- Jupyter Notebook Tutorial An overview of how to use the high level interface and keybindings for Jupyter notebooks.
Once you have your data organized there are a number of options for doing data processing, drawing statistical conclusions, or building machine learning models. Explaining the inner workings and theory of these packages is beyond the scope of this tutorial, but if you want to investigate they are very powerful and useful tools. In some cases they can be useful for basic tasks like finding outliers or performing similar tasks using statistics-guided approaches.
- scikit-learn The standard for performing general machine learning and testing tasks in Python.
- statsmodels statsmodels includes various specialized statistical techniques and basic techniques with more comprehensive human readable output than scikit-learn. Useful for frequentist statistics tasks.
- sciPy is useful for performing optimized numeric operations
Ryan Swan
Zakary Lang