From cff2467f0f82770aa19178bdba0e03403197c621 Mon Sep 17 00:00:00 2001
From: Allen Lee <alee@users.noreply.github.com>
Date: Mon, 4 Mar 2024 17:38:25 -0700
Subject: [PATCH] fix: clarify pandas usage with non-numeric columns

pandas now raises errors when computing the mean on a dataframe with
non-numeric columns

add a note to the challenge describing the issue and provide a few
additional ways of solving it

Thanks to @davidwilby for finding this bug and outlining possible
solutions to it in #670

Co-authored-by: David Wilby <davidwilby@users.noreply.github.com>
Co-authored-by: Olav Vahtras <vahtras@users.noreply.github.com>
---
 episodes/14-looping-data-sets.md | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/episodes/14-looping-data-sets.md b/episodes/14-looping-data-sets.md
index ec19be367..6e95b24c3 100644
--- a/episodes/14-looping-data-sets.md
+++ b/episodes/14-looping-data-sets.md
@@ -180,7 +180,10 @@ What other special strings does the [`float` function][float-function] recognize
 
 Write a program that reads in the regional data sets
 and plots the average GDP per capita for each region over time
-in a single chart.
+in a single chart. Pandas will raise an error if it encounters
+non-numeric columns in a dataframe computation so you may need
+to either filter out those columns or tell pandas to ignore them.
+
 
 :::::::::::::::  solution
 
@@ -200,8 +203,17 @@ for filename in glob.glob('data/gapminder_gdp*.csv'):
     # we will split the string using the split method and `_` as our separator,
     # retrieve the last string in the list that split returns (`<region>.csv`), 
     # and then remove the `.csv` extension from that string.
+    # NOTE: the pathlib module covered in the next callout also offers
+    # convenient abstractions for working with filesystem paths and could solve this as well:
+    # from pathlib import Path
+    # region = Path(filename).stem.split('_')[-1]
     region = filename.split('_')[-1][:-4] 
-    dataframe.mean().plot(ax=ax, label=region)
+    # pandas raises errors when it encounters non-numeric columns in a dataframe computation
+    # but we can tell pandas to ignore them with the `numeric_only` parameter
+    dataframe.mean(numeric_only=True).plot(ax=ax, label=region)
+    # NOTE: another way of doing this selects just the columns with gdp in their name using the filter method
+    # dataframe.filter(like="gdp").mean().plot(ax=ax, label=region)
+
 plt.legend()
 plt.show()
 ```
@@ -231,8 +243,8 @@ gapminder_gdp_africa
 .csv
 ```
 
-**Hint:** It is possible to check all available attributes and methods on the `Path` object with the `dir()`
-function!
+**Hint:** Check all available attributes and methods on the `Path` object with the `dir()`
+function.
 
 
 ::::::::::::::::::::::::::::::::::::::::::::::::::