Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of comma in suffix for player names #254

Closed
wants to merge 2 commits into from

Conversation

TheMathNinja
Copy link
Contributor

clean_player_names() was handling "Dante Fowler, Jr." -> "Jr Dante Fowler" (this was FFToday's handling of his name; apparently they use a more traditional format w/ a comma for suffix)

I believe this change should fix it. I'm sorry I couldn't test it locally. I was having corruption issues.

@mrcaseb
Copy link
Member

mrcaseb commented Sep 4, 2024

What is your desired output of

nflreadr::clean_player_names("Dante Fowler, Jr.")

@mrcaseb
Copy link
Member

mrcaseb commented Sep 4, 2024

broken Mac tests fixed in #255

@@ -95,7 +95,7 @@ clean_player_names <- function(player_name,
if(isTRUE(convert_lastfirst)) n <- gsub(pattern = "^(.+), (.+)$", replacement = "\\2 \\1", x = n)

# suffix removal
n <- gsub(pattern = " Jr\\.$| Sr\\.$| III$| II$| IV$| V$|'|\\.|,",
n <- gsub(pattern = "(,? Jr\\.?$|,? Sr\\.?$|,? III$|,? II$|,? IV$|,? V$|'|\\.|,)",
Copy link
Member

@tanho63 tanho63 Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not have the desired effect because the convert_lastfirst handling happens before this in L95, which is what is driving the issue you're identifying (automatically converting "Dante Fowler, Jr." to "Jr Dante Fowler")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh crap. I wrongly assumed convert_lastfirst happened later on.

I want clean_player_names(Dante Fowler, Jr.) to return the same thing as clean_player_names(Dante Fowler Jr.).

The most intuitive solution to me, in light of this rare but utilized naming convention would be to handle and clean suffixes first (via something like the code I proposed) and then convert_lastfirst after any “suffix commas” have been removed.

@tanho63
Copy link
Member

tanho63 commented Sep 4, 2024

This seems already sufficiently handled by the user setting convert_lastfirst to FALSE:

nflreadr::clean_player_names("Dante Fowler, Jr.", convert_lastfirst = FALSE)
#> [1] "Dante Fowler"

and am not particularly convinced that we need to make the regex change in question.

Even in the rare case of some awkward usage like:

"Fowler, Jr., Dante" |> nflreadr::clean_player_names()
#> [1] "Dante Fowler"

it is still handled correctly (first flipping the first name around the last name and then cleaning it)

@tanho63 tanho63 closed this Sep 4, 2024
@TheMathNinja
Copy link
Contributor Author

Ok so if I understand correctly, you don’t want this function to automatically handle the use case I found myself in (cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t) because it’s too cumbersome on the code, but in such cases the user should run local tests to make sure that this issue isn’t happening in their data set? Just wanting to make sure I’m tracking here.

@tanho63
Copy link
Member

tanho63 commented Sep 4, 2024

cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t

is there a data source that provides a mix of names that need last-first conversion and some that don't? data cleaning is an interactive/iterative process that should be done on the raw sources themselves imo, so user deciding what specific cleaning needs applied can be adjusted by source

@TheMathNinja
Copy link
Contributor Author

cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t

is there a data source that provides a mix of names that need last-first conversion and some that don't? data cleaning is an interactive/iterative process that should be done on the raw sources themselves imo, so user deciding what specific cleaning needs applied can be adjusted by source

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

@TheMathNinja
Copy link
Contributor Author

TheMathNinja commented Sep 4, 2024

Also, I don't know if this is totally dumb from a code standpoint, but could one solution be that clean_player_names, when set to clean_lastfirst = TRUE be that it first runs the clean_lastfirst = FALSE code and then runs the clean_lastfirst = TRUE code? It seems like that sequential "double-run" would fix my Dante Fowler issue?

@tanho63
Copy link
Member

tanho63 commented Sep 4, 2024

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

I would probably make a PR to ffanalytics pkg to clean the names at time of scraping in this scenario, it seems actively maintained enough that it would be accepted

@TheMathNinja
Copy link
Contributor Author

TheMathNinja commented Sep 4, 2024

It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the ffanalytics package, which scrapes and compiles projections from multiple fantasy sources into a single df without cleaning/standardizing names. FantasySharks names are given in "Last, First" form and all other sources are "First Last" form. Ideally, I could run the ffanalytics scrape, then run a single clean on the names using nflreadr that works across conventions.

I would probably make a PR to ffanalytics pkg to clean the names at time of scraping in this scenario, it seems actively maintained enough that it would be accepted

Does this mean you DON'T want to create a new input option called squeaky_clean = TRUE that runs a double-clean of convert_lastfirst = FALSE + convert_lastfirst = TRUE? Cuz I think that sounds fly. Lol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants