Implement `data_replicate()` #488

strengejacke · 2024-03-20T17:56:45Z

Today, I needed a function that repeats rows based on values of another column (similar to uncount()). Here's a quick implementation, wdyt?

library(datawizard)
data(mtcars)
d <- as.data.frame(head(mtcars))
data_expand(d, "carb")
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  22.8   4  108  93 3.85 2.320 18.61  1  1    4
#> 10 21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 11 18.7   8  360 175 3.15 3.440 17.02  0  0    3
#> 12 18.7   8  360 175 3.15 3.440 17.02  0  0    3
#> 13 18.1   6  225 105 2.76 3.460 20.22  1  0    3

d$mpg[5] <- NA
data_expand(d, "carb")
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  22.8   4  108  93 3.85 2.320 18.61  1  1    4
#> 10 21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 11   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 12   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 13 18.1   6  225 105 2.76 3.460 20.22  1  0    3

d$carb[3] <- NA
data_expand(d, "carb")
#> Error: The column provided in `expand` contains missing values, but `remove_na`
#>   is set to `FALSE`.
#>   Please set `remove_na` to `TRUE` or remove the missing values from the
#>   data frame.

data_expand(d, "carb", remove_na = TRUE)
#>     mpg cyl disp  hp drat    wt  qsec vs am gear
#> 1  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 2  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 3  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 4  21.0   6  160 110 3.90 2.620 16.46  0  1    4
#> 5  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 6  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 7  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 8  21.0   6  160 110 3.90 2.875 17.02  0  1    4
#> 9  21.4   6  258 110 3.08 3.215 19.44  1  0    3
#> 10   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 11   NA   8  360 175 3.15 3.440 17.02  0  0    3
#> 12 18.1   6  225 105 2.76 3.460 20.22  1  0    3

^{Created on 2024-03-20 with reprex v2.1.0}

codecov · 2024-03-20T18:03:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.73%. Comparing base (26ca506) to head (caf70e1).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #488      +/-   ##
==========================================
+ Coverage   90.66%   90.73%   +0.06%     
==========================================
  Files          74       75       +1     
  Lines        5765     5805      +40     
==========================================
+ Hits         5227     5267      +40     
  Misses        538      538

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

strengejacke · 2024-03-20T19:15:27Z

Alternatively, we could name that function data_replicate()?

etiennebacher · 2024-03-20T21:05:40Z

Thanks, I'll try to review this more in depth tomorrow, but I can already say that I'm not a big fan of data_expand() as a name. It makes me think of expand.grid(), which does something very different if I understand correctly this new function. I think data_replicate() would be a better name. Another alternative would just be to name it data_uncount()?

strengejacke · 2024-03-20T21:15:10Z

I'm fine with both data_replicate() and data_uncount(), though - maybe because I'm no native speaker - data_replicate() feels more appropriate.

strengejacke · 2024-03-20T21:18:01Z

The function really just replicates rows:

library(datawizard)

d <- data.frame(
  a = c("a", "b", "c"),
  b = 1:3,
  rep = c(3, 2, 4)
)
data_tabulate(d$a)
#> d$a <character>
#> # total N=3 valid N=3
#> 
#> Value | N | Raw % | Valid % | Cumulative %
#> ------+---+-------+---------+-------------
#> a     | 1 | 33.33 |   33.33 |        33.33
#> b     | 1 | 33.33 |   33.33 |        66.67
#> c     | 1 | 33.33 |   33.33 |       100.00
#> <NA>  | 0 |  0.00 |    <NA> |         <NA>

data_expand(d, "rep")
#>   a b
#> 1 a 1
#> 2 a 1
#> 3 a 1
#> 4 b 2
#> 5 b 2
#> 6 c 3
#> 7 c 3
#> 8 c 3
#> 9 c 3

data_tabulate(data_expand(d, "rep")$a)
#> data_expand(d, "rep")$a <character>
#> # total N=9 valid N=9
#> 
#> Value | N | Raw % | Valid % | Cumulative %
#> ------+---+-------+---------+-------------
#> a     | 3 | 33.33 |   33.33 |        33.33
#> b     | 2 | 22.22 |   22.22 |        55.56
#> c     | 4 | 44.44 |   44.44 |       100.00
#> <NA>  | 0 |  0.00 |    <NA> |         <NA>

^{Created on 2024-03-20 with reprex v2.1.0}

etiennebacher

Thanks, LGTM, just minor docs and code tweaks to make. For the name, I'm ok with data_replicate() instead of data_expand()

R/data_expand.R

tests/testthat/test-data_expand.R

Co-authored-by: Etienne Bacher <[email protected]>

etiennebacher

Actually can you add some documentation and tests for cases where the "expand" column is not an integer? Here's what I have so far for floats (take the floor value), character (dirty error), and factor (uses the underlying value):

library(datawizard)

foo <- data.frame(
  float = c(1.1, 1.8),
  char = c("a", "b"),
  factor = factor(c("a", "b"))
)

data_replicate(foo, "float")
#>   char factor
#> 1    a      1
#> 2    b      2

data_replicate(foo, "char")
#> Warning in FUN(X[[i]], ...): NAs introduced by coercion
#> Error in FUN(X[[i]], ...): invalid 'times' value

data_replicate(foo, "factor")
#>   float char
#> 1   1.1    a
#> 2   1.8    b
#> 3   1.8    b

etiennebacher

Thanks, just a minor comment to add

R/data_replicate.R

Co-authored-by: Etienne Bacher <[email protected]>

etiennebacher · 2024-03-23T13:18:24Z

One failure due to some setup issues in the CI, one because of a segfault that seems unrelated to this.

strengejacke added 3 commits March 20, 2024 18:56

Draft data_expand()

c967b6c

Merge branch 'main' into data_expand

e48f3e8

fix

1ccfd42

add tests

ef5e1e6

strengejacke requested a review from etiennebacher March 20, 2024 18:33

news, desc

b83e2d8

strengejacke mentioned this pull request Mar 22, 2024

CRAN submission? #489

Closed

etiennebacher approved these changes Mar 22, 2024

View reviewed changes

etiennebacher changed the title ~~Draft data_expand()~~ Implement data_replicate() Mar 22, 2024

strengejacke and others added 8 commits March 22, 2024 15:58

Update R/data_expand.R

49b4527

Co-authored-by: Etienne Bacher <[email protected]>

Update R/data_expand.R

500bdb9

Co-authored-by: Etienne Bacher <[email protected]>

Update R/data_expand.R

bb14d43

Co-authored-by: Etienne Bacher <[email protected]>

Update tests/testthat/test-data_expand.R

623b626

Co-authored-by: Etienne Bacher <[email protected]>

Update tests/testthat/test-data_expand.R

ec881c4

Co-authored-by: Etienne Bacher <[email protected]>

rename

a9f0ec6

update namespace and RD

afc090f

make styler happy [skip ci]

074c807

etiennebacher requested changes Mar 23, 2024

View reviewed changes

strengejacke added 3 commits March 23, 2024 11:09

check for integer

5de51d9

fix

c2ccd12

Update data_replicate.R

0f87577

etiennebacher approved these changes Mar 23, 2024

View reviewed changes

R/data_replicate.R Show resolved Hide resolved

R/data_replicate.R Show resolved Hide resolved

Update R/data_replicate.R

caf70e1

Co-authored-by: Etienne Bacher <[email protected]>

etiennebacher merged commit 9c2deb7 into main Mar 23, 2024
25 of 27 checks passed

etiennebacher deleted the data_expand branch March 23, 2024 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `data_replicate()` #488

Implement `data_replicate()` #488

strengejacke commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 20, 2024 •

edited

Loading

strengejacke commented Mar 20, 2024

etiennebacher commented Mar 20, 2024 •

edited

Loading

strengejacke commented Mar 20, 2024

strengejacke commented Mar 20, 2024

etiennebacher left a comment

etiennebacher left a comment

etiennebacher left a comment

etiennebacher commented Mar 23, 2024

Implement data_replicate() #488

Implement data_replicate() #488

Conversation

strengejacke commented Mar 20, 2024 • edited Loading

codecov bot commented Mar 20, 2024 • edited Loading

Codecov Report

strengejacke commented Mar 20, 2024

etiennebacher commented Mar 20, 2024 • edited Loading

strengejacke commented Mar 20, 2024

strengejacke commented Mar 20, 2024

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher commented Mar 23, 2024

Implement `data_replicate()` #488

Implement `data_replicate()` #488

strengejacke commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 20, 2024 •

edited

Loading

etiennebacher commented Mar 20, 2024 •

edited

Loading