Skip to content

Intro to Data Analysis With Python

zaklang123 edited this page Jan 8, 2024 · 45 revisions

Overview

In the wild, data is dirty and data is often disorganized. In order to make a resource useful you may need to filter it, modify it, and combine it with other resources. To arrive at your final useful dataset, these operations may need to be performed thousands of times in sequence, this is where Python comes in handy. You may also need to retrace your steps to do it all again in order to teach somebody else to do it or in the case where your hard drive crashes. (back up your hard drive)

Data tools in Python can allow you to create organized sets of operations called "data pipelines" to start from nothing and assemble and clean data to arrive at a dataset you can actually use.

This tutorial will begin with a brief guide to loading datasets into Python, followed by an introduction to exploratory data analysis, and conclude with what comes after the EDA process. After this tutorial, you should be prepared to turn raw data into clean data for analysis.

Prerequisites

This tutorial assumes that you have knowledge of package managers in python like conda and pip, a basic understanding of working in Python (how to write if/else/while statements, assign variables, etc), and at least a passing familiarity with the most commonly used tools in its associated standard library. See here for tutorial on Python.

1: Loading Data into Python

One of the easiest ways to import data into Python is through the Pandas library.

Start by importing the Pandas library:

import pandas as pd

Then perform this function on your csv file:

pd.read_csv('#name of csv filepath')

Other ways to import data to Python

2: Exploratory Data Analysis

EDA is the process of investigating a new dataset and cataloguing its features. Broadly it's the process of getting to know your data, getting it in the right format, and identifying any inconsistencies it might have. EDA should always be your first step when you get a new dataset, even if it's brief. Otherwise your conclusions may not mean what you think they do.

EDA is very personalized and is really all about learning to think deeply about a new dataset and cover your bases in a methodical way while keeping an eye out for any interesting trends. The below are provided as examples, but none are an authoritative workflow.

Here are some options for procedures that may be useful in the EDA process:

  • Checking the overall structure of your data (size, shape)
  • Joining more data if you find it lacking
  • Grouping the data into categories that may be helpful to answer the goal of the project
  • Cleaning the data to deal with missing values, outliers and mismatched data

Checking the overall structure of your data:

Useful EDA methods and attributes of Pandas DataFrame (df) type:

  • Attribute: A value associated with an object or class which is referenced by name using dot notation.

  • Method: A function that belongs to a class and typically performs an action or operation. 

df.head() returns first rows of dataframe.

df.info() summarizes dataframe.

df.describe() returns descriptive statistics of dataframe (mean, median, Q1, Q3).

df.shape returns tuple with shape of dataframe (ex: (2,3) for a dataframe with 2 rows and 3 columns).

df.size returns number of cells in dataframe.

df.value_counts() can be performed on a series or a dataframe, counts the number of values for each category

Additional attributes and methods of a Pandas DataFrame

Note:
"df" should be replaced with the name of your data frame. For example, turtles.shape if the name of my data frame is turtles.

Joining more data if you find it lacking:

Useful methods for joining data:

df.merge()

pd.concat()

df.join()

More Information

Grouping and filtering the data into categories:

Useful methods for grouping and filtering:

df.groupby()

df.sort_values()

df[condition]

df.iloc[]

df.loc[]

Filtering

Sorting

Grouping

Cleaning data to deal with mismatched data, missing values and outliers:

Useful methods for cleaning data:

df.duplicated()

df.drop_duplicates()

pd.to_datetime()

df.isnull()/df.isna()

df.isnotnull()/df.notna()

df.fillna()

df.replace()

df.dropna()

df.astype()

Datetime Information

More Information

3: What's Next!

After your data has been cleaned and you have a basic understanding of what it looks like, the next step is to visualize the data. You may want to return to add new columns, group items, and fix formatting based on what you learn from the visualizations. Data analysis is an iterative process and this part is all about exploring what stories your data might hold. Please see the data visualization tutorial if you want help getting started! After visualization, you might want to continue the data analysis with statistical analyses, regression or machine learning. These can be done using specialized statistics libraries. Look in the tutorials and guides section for more information.

Resources

Numpy

NumPy is the library that underlies most Python data tools. It is more granular and allows many optimized mathematical operations for working with large arrays. It is especially useful for performing linear algebra operations like matrix multiplies, which are ubiquitous in machine learning and deep learning. Pandas is based on NumPy, and many of its data structures and operations act the way they do because they are built on top of NumPy's code and philosophy. For a deeper understanding of how to manipulate data, a working knowledge of NumPy can be very powerful.

Pandas

Pandas is the workhorse of Python data analysis. Its dataframe data structure makes available a huge variety of tools. In addition Pandas is supported by a great variety of packages in Python for specialized data analysis and machine learning, which makes it a valuable core competency.

  • Official Pandas Tutorial Up to date and well maintained tutorial focused on getting you up to speed and running quickly
  • Daniel Chen Pandas Tutorial Good in-depth video walkthrough showing a full data analysis with explanations
  • Brandon Rhodes Pandas Tutorial Considered by many people the definitive intro to pandas. Be aware that some small changes have happened to the way pandas works since this was filmed, so you may need to google if the code examples don't work exactly as shown.

Jupyter Notebooks

Jupyter Notebooks are a useful interface for doing data analysis. Most practitioners of Pandas use Jupyter at least a little bit since the two tools are very well integrated and it makes things look pretty nice and makes live coding a much cleaner exercise.

Specialized Statistics Libraries

Once you have your data organized there are a number of options for doing data processing, drawing statistical conclusions, or building machine learning models. Explaining the inner workings and theory of these packages is beyond the scope of this tutorial, but if you want to investigate they are very powerful and useful tools. In some cases they can be useful for basic tasks like finding outliers or performing similar tasks using statistics-guided approaches.

  • scikit-learn The standard for performing general machine learning and testing tasks in Python.
  • statsmodels statsmodels includes various specialized statistical techniques and basic techniques with more comprehensive human readable output than scikit-learn. Useful for frequentist statistics tasks.
  • sciPy is useful for performing optimized numeric operations

Issues used in the creation of this page

#143

Contributors

Ryan Swan
Zakary Lang

Clone this wiki locally