Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data_replicate() #488

Merged
merged 17 commits into from
Mar 23, 2024
Merged

Implement data_replicate() #488

merged 17 commits into from
Mar 23, 2024

Conversation

strengejacke
Copy link
Member

@strengejacke strengejacke commented Mar 20, 2024

Today, I needed a function that repeats rows based on values of another column (similar to uncount()). Here's a quick implementation, wdyt?

library(datawizard)
data(mtcars)
d <- as.data.frame(head(mtcars))
data_expand(d, "carb")
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  22.8   4  108  93 3.85 2.320 18.61  1  1    4
#> 10 21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 11 18.7   8  360 175 3.15 3.440 17.02  0  0    3
#> 12 18.7   8  360 175 3.15 3.440 17.02  0  0    3
#> 13 18.1   6  225 105 2.76 3.460 20.22  1  0    3

d$mpg[5] <- NA
data_expand(d, "carb")
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  22.8   4  108  93 3.85 2.320 18.61  1  1    4
#> 10 21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 11   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 12   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 13 18.1   6  225 105 2.76 3.460 20.22  1  0    3

d$carb[3] <- NA
data_expand(d, "carb")
#> Error: The column provided in `expand` contains missing values, but `remove_na`
#>   is set to `FALSE`.
#>   Please set `remove_na` to `TRUE` or remove the missing values from the
#>   data frame.

data_expand(d, "carb", remove_na = TRUE)
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 10   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 11   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 12 18.1   6  225 105 2.76 3.460 20.22  1  0    3

Created on 2024-03-20 with reprex v2.1.0

Copy link

codecov bot commented Mar 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.73%. Comparing base (26ca506) to head (caf70e1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #488      +/-   ##
==========================================
+ Coverage   90.66%   90.73%   +0.06%     
==========================================
  Files          74       75       +1     
  Lines        5765     5805      +40     
==========================================
+ Hits         5227     5267      +40     
  Misses        538      538              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@strengejacke
Copy link
Member Author

Alternatively, we could name that function data_replicate()?

@etiennebacher
Copy link
Member

etiennebacher commented Mar 20, 2024

Thanks, I'll try to review this more in depth tomorrow, but I can already say that I'm not a big fan of data_expand() as a name. It makes me think of expand.grid(), which does something very different if I understand correctly this new function. I think data_replicate() would be a better name. Another alternative would just be to name it data_uncount()?

@strengejacke
Copy link
Member Author

I'm fine with both data_replicate() and data_uncount(), though - maybe because I'm no native speaker - data_replicate() feels more appropriate.

@strengejacke
Copy link
Member Author

The function really just replicates rows:

library(datawizard)

d <- data.frame(
  a = c("a", "b", "c"),
  b = 1:3,
  rep = c(3, 2, 4)
)
data_tabulate(d$a)
#> d$a <character>
#> # total N=3 valid N=3
#> 
#> Value | N | Raw % | Valid % | Cumulative %
#> ------+---+-------+---------+-------------
#> a     | 1 | 33.33 |   33.33 |        33.33
#> b     | 1 | 33.33 |   33.33 |        66.67
#> c     | 1 | 33.33 |   33.33 |       100.00
#> <NA>  | 0 |  0.00 |    <NA> |         <NA>

data_expand(d, "rep")
#>   a b
#> 1 a 1
#> 2 a 1
#> 3 a 1
#> 4 b 2
#> 5 b 2
#> 6 c 3
#> 7 c 3
#> 8 c 3
#> 9 c 3

data_tabulate(data_expand(d, "rep")$a)
#> data_expand(d, "rep")$a <character>
#> # total N=9 valid N=9
#> 
#> Value | N | Raw % | Valid % | Cumulative %
#> ------+---+-------+---------+-------------
#> a     | 3 | 33.33 |   33.33 |        33.33
#> b     | 2 | 22.22 |   22.22 |        55.56
#> c     | 4 | 44.44 |   44.44 |       100.00
#> <NA>  | 0 |  0.00 |    <NA> |         <NA>

Created on 2024-03-20 with reprex v2.1.0

@strengejacke strengejacke mentioned this pull request Mar 22, 2024
Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, just minor docs and code tweaks to make. For the name, I'm ok with data_replicate() instead of data_expand()

R/data_expand.R Outdated Show resolved Hide resolved
R/data_expand.R Outdated Show resolved Hide resolved
R/data_expand.R Outdated Show resolved Hide resolved
R/data_expand.R Outdated Show resolved Hide resolved
tests/testthat/test-data_expand.R Outdated Show resolved Hide resolved
tests/testthat/test-data_expand.R Outdated Show resolved Hide resolved
@etiennebacher etiennebacher changed the title Draft data_expand() Implement data_replicate() Mar 22, 2024
Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually can you add some documentation and tests for cases where the "expand" column is not an integer? Here's what I have so far for floats (take the floor value), character (dirty error), and factor (uses the underlying value):

library(datawizard)

foo <- data.frame(
  float = c(1.1, 1.8),
  char = c("a", "b"),
  factor = factor(c("a", "b"))
)

data_replicate(foo, "float")
#>   char factor
#> 1    a      1
#> 2    b      2

data_replicate(foo, "char")
#> Warning in FUN(X[[i]], ...): NAs introduced by coercion
#> Error in FUN(X[[i]], ...): invalid 'times' value

data_replicate(foo, "factor")
#>   float char
#> 1   1.1    a
#> 2   1.8    b
#> 3   1.8    b

Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a minor comment to add

R/data_replicate.R Show resolved Hide resolved
R/data_replicate.R Show resolved Hide resolved
Co-authored-by: Etienne Bacher <[email protected]>
@etiennebacher
Copy link
Member

One failure due to some setup issues in the CI, one because of a segfault that seems unrelated to this.

@etiennebacher etiennebacher merged commit 9c2deb7 into main Mar 23, 2024
25 of 27 checks passed
@etiennebacher etiennebacher deleted the data_expand branch March 23, 2024 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants