Fix handling of comma in suffix for player names #254

TheMathNinja · 2024-09-04T04:44:10Z

clean_player_names() was handling "Dante Fowler, Jr." -> "Jr Dante Fowler" (this was FFToday's handling of his name; apparently they use a more traditional format w/ a comma for suffix)

I believe this change should fix it. I'm sorry I couldn't test it locally. I was having corruption issues.

mrcaseb · 2024-09-04T12:11:22Z

What is your desired output of

nflreadr::clean_player_names("Dante Fowler, Jr.")

mrcaseb · 2024-09-04T13:24:58Z

broken Mac tests fixed in #255

tanho63 · 2024-09-04T13:27:35Z

R/utils_name_cleaning.R

@@ -95,7 +95,7 @@ clean_player_names <- function(player_name,
  if(isTRUE(convert_lastfirst)) n <- gsub(pattern = "^(.+), (.+)$", replacement = "\\2 \\1", x = n)

  # suffix removal
-  n <- gsub(pattern = " Jr\\.$| Sr\\.$| III$| II$| IV$| V$|'|\\.|,",
+  n <- gsub(pattern = "(,? Jr\\.?$|,? Sr\\.?$|,? III$|,? II$|,? IV$|,? V$|'|\\.|,)",


this will not have the desired effect because the convert_lastfirst handling happens before this in L95, which is what is driving the issue you're identifying (automatically converting "Dante Fowler, Jr." to "Jr Dante Fowler")

Oh crap. I wrongly assumed convert_lastfirst happened later on.

I want clean_player_names(Dante Fowler, Jr.) to return the same thing as clean_player_names(Dante Fowler Jr.).

The most intuitive solution to me, in light of this rare but utilized naming convention would be to handle and clean suffixes first (via something like the code I proposed) and then convert_lastfirst after any “suffix commas” have been removed.

tanho63 · 2024-09-04T13:36:26Z

This seems already sufficiently handled by the user setting convert_lastfirst to FALSE:

nflreadr::clean_player_names("Dante Fowler, Jr.", convert_lastfirst = FALSE)
#> [1] "Dante Fowler"

and am not particularly convinced that we need to make the regex change in question.

Even in the rare case of some awkward usage like:

"Fowler, Jr., Dante" |> nflreadr::clean_player_names()
#> [1] "Dante Fowler"

it is still handled correctly (first flipping the first name around the last name and then cleaning it)

TheMathNinja · 2024-09-04T13:46:59Z

Ok so if I understand correctly, you don’t want this function to automatically handle the use case I found myself in (cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t) because it’s too cumbersome on the code, but in such cases the user should run local tests to make sure that this issue isn’t happening in their data set? Just wanting to make sure I’m tracking here.

tanho63 · 2024-09-04T13:49:31Z

cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t

is there a data source that provides a mix of names that need last-first conversion and some that don't? data cleaning is an interactive/iterative process that should be done on the raw sources themselves imo, so user deciding what specific cleaning needs applied can be adjusted by source

TheMathNinja · 2024-09-04T13:54:19Z

cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t

is there a data source that provides a mix of names that need last-first conversion and some that don't? data cleaning is an interactive/iterative process that should be done on the raw sources themselves imo, so user deciding what specific cleaning needs applied can be adjusted by source

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

TheMathNinja · 2024-09-04T13:57:11Z

Also, I don't know if this is totally dumb from a code standpoint, but could one solution be that clean_player_names, when set to clean_lastfirst = TRUE be that it first runs the clean_lastfirst = FALSE code and then runs the clean_lastfirst = TRUE code? It seems like that sequential "double-run" would fix my Dante Fowler issue?

tanho63 · 2024-09-04T14:03:57Z

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

I would probably make a PR to ffanalytics pkg to clean the names at time of scraping in this scenario, it seems actively maintained enough that it would be accepted

TheMathNinja · 2024-09-04T14:23:07Z

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

I would probably make a PR to ffanalytics pkg to clean the names at time of scraping in this scenario, it seems actively maintained enough that it would be accepted

Does this mean you DON'T want to create a new input option called squeaky_clean = TRUE that runs a double-clean of convert_lastfirst = FALSE + convert_lastfirst = TRUE? Cuz I think that sounds fly. Lol.

Fix handling of comma in suffix for player names

f42f6e1

Merge branch 'main' into fix-comma-in-suffix

78777f2

tanho63 reviewed Sep 4, 2024

View reviewed changes

tanho63 closed this Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of comma in suffix for player names #254

Fix handling of comma in suffix for player names #254

TheMathNinja commented Sep 4, 2024

mrcaseb commented Sep 4, 2024

mrcaseb commented Sep 4, 2024

tanho63 Sep 4, 2024 •

edited

Loading

TheMathNinja Sep 4, 2024

tanho63 commented Sep 4, 2024 •

edited

Loading

TheMathNinja commented Sep 4, 2024

tanho63 commented Sep 4, 2024 •

edited

Loading

TheMathNinja commented Sep 4, 2024

TheMathNinja commented Sep 4, 2024 •

edited

Loading

tanho63 commented Sep 4, 2024

TheMathNinja commented Sep 4, 2024 •

edited

Loading

Fix handling of comma in suffix for player names #254

Fix handling of comma in suffix for player names #254

Conversation

TheMathNinja commented Sep 4, 2024

mrcaseb commented Sep 4, 2024

mrcaseb commented Sep 4, 2024

tanho63 Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

TheMathNinja Sep 4, 2024

Choose a reason for hiding this comment

tanho63 commented Sep 4, 2024 • edited Loading

TheMathNinja commented Sep 4, 2024

tanho63 commented Sep 4, 2024 • edited Loading

TheMathNinja commented Sep 4, 2024

TheMathNinja commented Sep 4, 2024 • edited Loading

tanho63 commented Sep 4, 2024

TheMathNinja commented Sep 4, 2024 • edited Loading

tanho63 Sep 4, 2024 •

edited

Loading

tanho63 commented Sep 4, 2024 •

edited

Loading

tanho63 commented Sep 4, 2024 •

edited

Loading

TheMathNinja commented Sep 4, 2024 •

edited

Loading

TheMathNinja commented Sep 4, 2024 •

edited

Loading