-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of comma in suffix for player names #254
Conversation
What is your desired output of nflreadr::clean_player_names("Dante Fowler, Jr.") |
broken Mac tests fixed in #255 |
@@ -95,7 +95,7 @@ clean_player_names <- function(player_name, | |||
if(isTRUE(convert_lastfirst)) n <- gsub(pattern = "^(.+), (.+)$", replacement = "\\2 \\1", x = n) | |||
|
|||
# suffix removal | |||
n <- gsub(pattern = " Jr\\.$| Sr\\.$| III$| II$| IV$| V$|'|\\.|,", | |||
n <- gsub(pattern = "(,? Jr\\.?$|,? Sr\\.?$|,? III$|,? II$|,? IV$|,? V$|'|\\.|,)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not have the desired effect because the convert_lastfirst
handling happens before this in L95, which is what is driving the issue you're identifying (automatically converting "Dante Fowler, Jr." to "Jr Dante Fowler")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh crap. I wrongly assumed convert_lastfirst
happened later on.
I want clean_player_names(Dante Fowler, Jr.)
to return the same thing as clean_player_names(Dante Fowler Jr.)
.
The most intuitive solution to me, in light of this rare but utilized naming convention would be to handle and clean suffixes first (via something like the code I proposed) and then convert_lastfirst after any “suffix commas” have been removed.
This seems already sufficiently handled by the user setting nflreadr::clean_player_names("Dante Fowler, Jr.", convert_lastfirst = FALSE)
#> [1] "Dante Fowler" and am not particularly convinced that we need to make the regex change in question. Even in the rare case of some awkward usage like: "Fowler, Jr., Dante" |> nflreadr::clean_player_names()
#> [1] "Dante Fowler" it is still handled correctly (first flipping the first name around the last name and then cleaning it) |
Ok so if I understand correctly, you don’t want this function to automatically handle the use case I found myself in (cleaning a large data set of names comprised of some names that need first-last conversion and some that don’t) because it’s too cumbersome on the code, but in such cases the user should run local tests to make sure that this issue isn’t happening in their data set? Just wanting to make sure I’m tracking here. |
is there a data source that provides a mix of names that need last-first conversion and some that don't? data cleaning is an interactive/iterative process that should be done on the raw sources themselves imo, so user deciding what specific cleaning needs applied can be adjusted by source |
It depends on the meaning of "source" I suppose. My mixed df was coming to me via a standard scrape using the |
Also, I don't know if this is totally dumb from a code standpoint, but could one solution be that clean_player_names, when set to clean_lastfirst = TRUE be that it first runs the clean_lastfirst = FALSE code and then runs the clean_lastfirst = TRUE code? It seems like that sequential "double-run" would fix my Dante Fowler issue? |
I would probably make a PR to ffanalytics pkg to clean the names at time of scraping in this scenario, it seems actively maintained enough that it would be accepted |
Does this mean you DON'T want to create a new input option called |
clean_player_names() was handling "Dante Fowler, Jr." -> "Jr Dante Fowler" (this was FFToday's handling of his name; apparently they use a more traditional format w/ a comma for suffix)
I believe this change should fix it. I'm sorry I couldn't test it locally. I was having corruption issues.