Things in Python, Pandas, GeoPandas and Jupyter that I've had to look up or weren't obvious in the documentation.
TODO: Break quick inline examples into individual notebooks.
- Basic loops and related concepts:
for
andwhile
loops, looping over a range of numbers, using enumerate() to access both the item and the index inside a loop, unpacking tuples and lists and exiting early from a loop.
- Converting multiple columns to categories
- Creating DataFrames from lists and dictionaries
- Parsing date values
- Querying a SQL database in a Jupyter notebook
- Loading a DataFrame into a temporary table of a SQL database
- Read shapefile from zip
- Create spatial data (a Geopandas GeoDataFrame) from a CSV with coordinates: From CSV to GeoDataFrame in two lines
- Download a file with a progress bar: This also shows how to only download a file if it hasn't already been downloaded.
- Writing a DataFrame to an Excel workbook: Writing multiple
DataFrame
s to different worksheets, sizing columns to fit data and setting wrapping on cells. - Loading the results of a Spatialite spatial query
I usually first pass the addresses through the Census Geocoder Batch API and then geocode the addresses that it can't handle using Geocodio.
Helpful packages:
I put a sample script in scripts/geocode_data.py
. Note that I haven't tried to run this script. It's a quick and dirty combination of code from a project, but it should provide an idea of how to do this.
- Count null values in any column
- Calculate percentages of a group
- Length of a DataFrame
- Check if column exists
- Show all columns
- Format values in a DataFrame, including how to specify different formats for different columns and rows.
- List columns
- Find rows with null values in any column
- Find all cells with zero values
- Select all the contents of a MultiIndex level
- Select a single item by index
- Get consecutive pairs from an interable
- Select a set of items by a list of index values
s = s[s != 1]
or in a more chainable way:
s.where(lambda x : x != 1)
You can specify a function inside []
:
df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
df[lambda x: x['b'] > x['c']]
Use pandas.Series.isin
.
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df[df['A'].isin([3, 6])]
via Use a list of values to select rows from a pandas dataframe [duplicate].
Use the ~
operator. See Boolean indexing in the docs.
via How can I obtain the element-wise logical NOT of a pandas Series?
Use the Vectorized string functions for Series.
For example, using len
:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': ['1234', '12345', '123', '12']})
In [3]: df
Out[3]:
a
0 1234
1 12345
2 123
3 12
In [4]: df[df['a'].str.len() > 3]
Out[4]:
a
0 1234
1 12345
- Combine records when information is split across rows
- Split a DataFrame into chunks
- Add row numbers to a DataFrame
- GeoPandas and Shapely binary predicates: Understand the meaning of the different operations used when doing spatial joins in GeoPandas.
- Get the border of a polygon feature as a line feature
- Break a multi-polygon in one row of a GeoDataFrame into multiple rows for each polygon
- Pivot values in rows into separate columns: Get one row per unique ID by putting values from multiple rows into multiple columns like
col
,col_2
,col_3
, ...
This StackOverflow answer offers a good rundown of options.
Overview of Pandas Data Types has a rundown of the different data types, and their relationship to Numpy types and Python types.
Use pandas.DataFrame.replace()
Use Series.rank
.
via Ranking Rows Of Pandas Dataframes
Use pandas.DataFrame.sum
.
It seems easy to just use the +
operator to add values across columns:
In [1]: import pandas as pd
In [2]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
...: 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
...:
In [3]: df = pd.DataFrame(d)
In [4]: df
Out[4]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
In [5]: df['sum'] = df['one'] + df['two']
In [6]: df
Out[6]:
one two sum
a 1.0 1.0 2.0
b 2.0 2.0 4.0
c 3.0 3.0 6.0
d NaN 4.0 NaN
But this gives us NaN
whenever one of the values is NaN
. In many, but not all cases, we want to treat NaN
as 0
.
To get around this we can use pandas.DataFrame.sum
on a slice of the columns we want to sum:
In [7]: df['sum'] = df[['one', 'two']].sum(axis=1)
In [8]: df
Out[8]:
one two sum
a 1.0 1.0 2.0
b 2.0 2.0 4.0
c 3.0 3.0 6.0
d NaN 4.0 4.0
Note that the axis=1
argument makes sum
operate across columns. Also, if you want to not skip the NaN
values, you can specify skipna=False
.
h/t Pandas sum two columns, skipping NaN
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]
via python - Pandas dataframe with multiindex column - merge levels - Stack Overflow
removals_inactive_migration['removed_inactive_rate_q'] = pd.qcut(
removals_inactive_migration['removed_inactive_rate'],
4,
labels=[1, 2, 3, 4]
)
Use DataFrame.assign()
Dealing with the "Geometry is in a geographic CRS. Results form 'area' are likely incorrect." warning in GeoPandas
When you're trying to find the area of a geometry column of a GeoPandas GeoDataFrame
, you may encounter this error:
Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
The message tells you what you need to do, but doesn't explain the issue.
Map Projection and Coordinate Reference Systems has an explaination of the difference between geographic coordinate reference systems and projected ones, as well as Python code, using GeoPandas, that shows how to reproject. Moreover, it offers some suggestions for which projected CRS to use. tl;dr, if you're working with spatial data that spans the US, Albers Equal Area Conic (EPSG = 5070 or 102003) is a good choice.
Address data can be messy in a way that makes parsing addresses yourself difficult. I recommend using the usaddress package, which uses a statistical model to try to parse components from addresses.
If you don't want to use the Python library, they have a web interface that lets you parse sets of 500 addresses.
Regular expressions are powerful tools for searching and cleaning text data, but they can have a bit of a learning curve.
The documentation for the standard library re
module isn't the worst place to start.
I find Pythex to be a really useful sandbox for testing out regular expressions. I find it often offers a faster feedback loop than testing it in the context of my programs. Once I have it working, I just copy the regex back into my code.
- Coloring a choropleth based on an explicit color value in a GeoDataFrame
- Showing a layer of points over a layer of polygons in Altair (Altair Tutorial)
- Quickly plotting multiple GeoPandas geometries
Use the Python Markdown extension which is part of the Unofficial Jupyter Notebook Extensions
%config Application.log_level="INFO"
TODO: Check if this actually works.
via Get Output From the logging Module in IPython Notebook
Use the autoreload extension.
%load_ext autoreload
%autoreload 2
from foo import some_function
some_function()
# open foo.py in an editor and change some_function to return 43
some_function()
The best way of doing this depends on what you're trying to do.
For passing tabular data, I like to write Pandas DataFrame
s to Feather files because they can be read by R programs/notebooks, are faster to read in than CSV or Pickle files and preserve data types. I store these in a conventional file structure. That way, one notebook can read in the DataFrame
written by another notebook. Sometimes I'll reference the file paths with a variable in a project-level settings module, so if the file name/organization changes, I don't have to update my code in a bunch of places.
For individual variables, there is the store magic.
For code and (possibly) values, there are (somewhat complicated) methods for importing notebooks as modules.
Joe Germuska suggests using the %run
magic for executing a Python script (or, apparently, another notebook) from within a notebook. The interactive namespace gets updated with all variables defined in the program. Joe says this is useful for boostrapping notebooks that share common initialization.
TODO: Experiment with store magic and importing notebooks as modules and perhaps make some example notebooks.
Use IPython.display.display()
and pandas.option_context()
.
You can also set the Pandas options globally, but I've almost always wanted to do this on a case-by-case basis, which is what pandas.option_context()
lets you do.
from IPython.display import display
import pandas as pd
with pd.option_context("display.max_rows", None):
display(long_df)
My Python Development Environment, 2020 Edition is a good guide. I prefer to use Pipenv instead of Poetry because pipenv run
automatically sources .env
files.
Python Wrangler is a GUI application for installing Python in a way that's similar to my setup. Even if you don't use it to install Python, it's useful for seeing all the different versions of Python that end up on your system.
I've never found much extra utility in using Anaconda, but it does offer one big benefit: it installs useful packages for working with data, like Pandas and Jupyter, all in one installation process, instead of installing them one at a time with pip or Pipenv.
pipenv install --python=/path/to/your/python
python -m pdb -c continue myscript.py arg1 arg2
Source: Starting python debugger automatically on error
This makes it possible to have a single Jupyter Lab installation with separate kernels for the virtualenvs for each project.
pipenv run python -m ipykernel install --user --name=your-project-slug
Source: associatedpress/cookiecutter-python-project
In many projects, I use Black to format my source code and pylint to check my code for both formatting issues and things that could cause errors or make my code more difficult to read and update.
There are ways to integrate these tools into your editor, such as vim or Visual Studio Code. However, I've found a good universal way to approach this is to install them as git hooks that run when trying to commit code. To do this, I use the pre-commit framework.
I've put an example .pre-commit-config.yml
file that runs Black and pylint in this repo.
To disable a rule for a signle line, add a comment like # pylint: disable=wildcard-import, method-hidden
to the end of the line.
Source: How to disable a particular message?
pipenv --venv
This is important if you're running pylint on a particular project using VS Code integration and want to make sure it detects the packages installed in your project virtual environment.
To select a specific environment, use the Python: Select Interpreter command from the Command Palette (⇧⌘P on Mac). If the environment you want isn't listed, you can enter a full path to the Python executable.
Source: Using Python Environments in Visual Studio Code
Use nbconvert
to execute the notebook:
jupyter nbconvert --to notebook --execute mynotebook.ipynb
I haven't run that particular version of the command in a while, but I think that either overwrites the input notebook or creates a similarly-named copy. Since I usually want an HTML version, similar to how RStudio automatically creates an HTML version of RMarkdown notebooks, I usually use a somewhat different set of parameters:
jupyter nbconvert \
--to html \
--output notebook.nb.html \
--execute \
notebook.ipynb
Source: Executing notebooks from the command line
These are the data sources used as example data in this project.
This will be for examples of working with GeoPandas and Spatialite.
-
Table B02001, American Community Survey, 5-year estimates, 2019 for Chicago Tracts
- Agency: U.S. Census Bureau
- URL: https://censusreporter.org/data/table/?table=B02001&geo_ids=16000US1714000,140|16000US1714000&primary_geo_id=16000US1714000
-
Chicago Neighborhood Boundaries
- Agency: City of Chicago
- URL: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Neighborhoods/bbvz-uum9
-
County boundaries, 2021
- Agency: U.S. Census Bureau
- URL: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
-
American Indian/Alaska Native/Native Hawaiian Area boundaries, 2021
- Agency: U.S. Census Bureau
- URL: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
-
Census Places, 2021
- Agency: U.S. Census Bureau
- URL: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
I've used the Jupytext Jupyter Lab extension to create "paired" .Rmd versions of the notebooks in this repository.
This was mostly because the git diffs for the Rmd files more clearly show the changes between versions. There may be other, potentially preferable ways to do this in 2021. I also just wanted to experiment with Jupytext.
You'll need to install this extension in the same environment as your Jupyter lab.
For me, since I installed Jupyter Lab using pipx, this looks like:
pipx inject jupyterlab jupytext