Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: clarify pandas usage with non-numeric columns #674

Merged
merged 1 commit into from
Mar 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
fix: clarify pandas usage with non-numeric columns
pandas now raises errors when computing the mean on a dataframe with
non-numeric columns

add a note to the challenge describing the issue and provide a few
additional ways of solving it

Thanks to @davidwilby for finding this bug and outlining possible
solutions to it in #670

Co-authored-by: David Wilby <[email protected]>
Co-authored-by: Olav Vahtras <[email protected]>
  • Loading branch information
3 people committed Mar 5, 2024
commit cff2467f0f82770aa19178bdba0e03403197c621
20 changes: 16 additions & 4 deletions episodes/14-looping-data-sets.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,10 @@ What other special strings does the [`float` function][float-function] recognize

Write a program that reads in the regional data sets
and plots the average GDP per capita for each region over time
in a single chart.
in a single chart. Pandas will raise an error if it encounters
non-numeric columns in a dataframe computation so you may need
to either filter out those columns or tell pandas to ignore them.


::::::::::::::: solution

Expand All @@ -200,8 +203,17 @@ for filename in glob.glob('data/gapminder_gdp*.csv'):
# we will split the string using the split method and `_` as our separator,
# retrieve the last string in the list that split returns (`<region>.csv`),
# and then remove the `.csv` extension from that string.
# NOTE: the pathlib module covered in the next callout also offers
# convenient abstractions for working with filesystem paths and could solve this as well:
# from pathlib import Path
# region = Path(filename).stem.split('_')[-1]
region = filename.split('_')[-1][:-4]
dataframe.mean().plot(ax=ax, label=region)
# pandas raises errors when it encounters non-numeric columns in a dataframe computation
# but we can tell pandas to ignore them with the `numeric_only` parameter
dataframe.mean(numeric_only=True).plot(ax=ax, label=region)
# NOTE: another way of doing this selects just the columns with gdp in their name using the filter method
# dataframe.filter(like="gdp").mean().plot(ax=ax, label=region)

plt.legend()
plt.show()
```
Expand Down Expand Up @@ -231,8 +243,8 @@ gapminder_gdp_africa
.csv
```

**Hint:** It is possible to check all available attributes and methods on the `Path` object with the `dir()`
function!
**Hint:** Check all available attributes and methods on the `Path` object with the `dir()`
function.


::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down
Loading