Extending tibble: A case for `tibble_reconstruct()` #890

DavisVaughan · 2021-05-17T15:02:58Z

@jennybc and I have both extended tibble when creating custom subclasses (see googledrive, tune, rsample, and workflowsets). Our focus was on extending tibble to work with dplyr and vctrs, which generally follows the advice outlined in this rough help document.

Jenny mentioned that tibble could provide a hook to make it easier for package authors to extend tibble. This would work in a way that is very similar to dplyr_reconstruct(), and tibble would automatically call this S3 generic hook at the end of functions that might invalidate the invariants of a tibble subclass. It would look like:

tibble_reconstruct <- function(x, to) {
  UseMethod("tibble_reconstruct", to)
}

For example, if my subclass requires a special index column, and [.tbl_df drops that index column, then ideally the result would be a bare tibble, and would no longer inherit from my subclass. If tibble called tibble_reconstruct(x = out, to = x) at the end of [.tbl_df, then tibble_reconstruct.mysubclass(x, to) could contain all of the logic required to decide if x could be reconstructed to the class of to, or if it should fall back to returning a bare tibble.

This is related to and probably supersedes @hadley's issue in #275.

This would simplify the advice given by dplyr, which currently suggests that [ and names<- methods are required to be compatible with dplyr, along with a dplyr_reconstruct() method. Instead, it could advise that if you are extending tibble, you only need a tibble_reconstruct() method (we would still keep the current advice for the case where you are only extending data.frame).

This also aligns perfectly with the conventions that are already arising for adding support for vctrs and dplyr. We always start by creating mysubclass_maybe_reconstruct() and mysubclass_is_reconstructable() helpers which have all the logic for either reconstructing to a mysubclass or returning a bare tibble:
https://github.com/tidyverse/googledrive/blob/f2f090156236187b803c5ae26afb159bd4f78580/R/compat-dplyr.R#L1-L10

This gets used:

It would make sense for this to be the guts of the tibble_reconstruct() method, then we could remove the [ and names<- methods.

The following methods would need to call tibble_reconstruct():

[.tbl_df
[<-.tbl_df
$<-.tbl_df
[[<-.tbl_df
names<-.tbl_df

There may be others, but I think this would get us pretty far.

If all of this gets added, I could see two tibble vignettes coming out of this:

Extending tibble
- Describing tibble_reconstruct(), what it is used for, and how to add a method
- Encourage adding a standalone mysubclass_reconstruct() that gets used in the tibble_reconstruct() method
Adding tibble subclass compat for dplyr and vctrs
- Assume that users read the above vignette and have mysubclass_reconstruct() ready to go
- Talk about implementing vec_restore(), vec_ptype2(), and vec_cast() methods
- Talk about implementing a dplyr_reconstruct() method

The text was updated successfully, but these errors were encountered:

jennybc · 2021-05-17T15:14:19Z

Thanks @DavisVaughan for opening this. We discussed this, as he coached me through googledrive's recent refresh of how I do "dplyr compatibility". The previous implementation pre-dated vctrs and basically most of our current technology for this. And, of course, now it's not just "dplyr compatibility" but also vctrs and they more I thought about what I was doing ... I wondered why isn't some of this wired in at the tibble level? Which is how we got here.

As one more concrete detail, what really made this need clear for me is that I think many users are much more likely to remove a variable with:

dat$drive_resource <- NULL

than via

dat <- dplyr::select(dat, -drive_resource)

and, in both cases, for a dribble, this should force fallback to a bare tibble, because drive_resource is a required column. But given our current emphasis on implementing dplyr and vctrs methods, this feels like a big hole in the implementation of a tibble subclass.

krlmlr · 2021-05-21T04:52:50Z

Thanks. I agree that we should add an extension point here.

Can you please remind me why we can't implement vctrs::vec_restore() for tibbles?

CC @lionel-.

lionel- · 2021-05-21T07:08:18Z

@krlmlr proxy + restore for data frame subclasses is very challenging. I have explored how to use these operations to encode various data frame semantics (dynamically and statically grouped tibbles) and got quite far, but the extra complexity in vctrs was daunting. The dplyr generic API is an additional layer that aims to make it sufficiently flexible and simple to create these subclasses.

@DavisVaughan What would happen if tibble called dplyr_reconstruct() if available? Or if dplyr_reconstruct() was exported from tibble? In other words, could there be only one generic?

DavisVaughan · 2021-05-21T12:44:16Z

Looking at how dplyr_reconstruct() is used in dplyr, it seems like it would be possible to have a single generic, tibble_reconstruct(), that dplyr uses instead of dplyr_reconstruct(). I'm not sure what happens to dplyr_row_slice() and dplyr_col_modify(). Maybe those also move to tibble, and tibble would use something like tibble_row_slice() when subsetting rows with [.tbl_df.

Also worth thinking about whether these should just be called df_reconstruct(), since in dplyr you can subclass a "data.frame" and provide a dplyr_reconstruct() method and everything would still work. It doesn't have to be a tibble. That could imply that the df_*() generics could live in vctrs alongside the df_*.data.frame() methods, and then tibble and dplyr would use them, with dplyr also providing methods for grouped_df and rowwise_df.

lionel- · 2021-05-21T12:50:11Z

I think that makes sense in the long term, but is the dplyr API sufficiently stable to graduate it to vctrs? These generics could also live in another package, that might be better for documentation and visibility.

Perhaps the first step is to use dplyr_reconstruct() (if available) from tibble methods? This would allow us to further experiment with the dplyr API.

krlmlr · 2021-05-24T02:09:40Z

@DavisVaughan: Is this a case for brushing up https://github.com/DavisVaughan/standardize and implementing the generic there? Happy to spend time on that package.

How would we use dplyr_reconstruct() only if available? If tibble always loads dplyr if installed, this might have unintended consequences; if it's only used when dplyr is loaded, behavior depends on the loaded packages. None of these options appeals to me, I might be missing something though.

lionel- · 2021-05-25T06:26:07Z

How would we use dplyr_reconstruct() only if available? If tibble always loads dplyr if installed, this might have unintended consequences; if it's only used when dplyr is loaded, behavior depends on the loaded packages. None of these options appeals to me, I might be missing something though.

The implementors of dplyr::dplyr_reconstruct() are importing dplyr so there would be no hygiene issue here.

krlmlr · 2021-07-29T05:10:14Z

dplyr_reconstruct() and friends are used a lot already. We could:

Implement baseline versions for dplyr_reconstruct(), dplyr_row_slice() and dplyr_col_modify() that are used if dplyr is not loaded
Use on_package_load() from https://github.com/r-lib/rlang/blob/master/R/aaa.R to overwrite these baseline version with forwarders to the dplyr methods

I'm a bit worried that this will impact performance if users use [ in loops. I guess we'll see, we can't have both flexibility and super-fast code here I suspect.

- `x[j] <- list(name = value)` uses name repair when new columns are created (#890).

krlmlr · 2021-10-24T05:05:09Z

I have a WIP pull request, #937, that shows a substantial slowdown (3x to 5x) if we add this. The results are old, but a recent comparison tells the same story: https://rpubs.com/krlmlr/tibble-reconstruct-benchmark .

Can we afford this?

krlmlr added this to the 3.1.4 milestone Jul 29, 2021

krlmlr added a commit that referenced this issue Jul 31, 2021

Merge pull request #929 from tidyverse/f-890-refactor

45bccc8

- `x[j] <- list(name = value)` uses name repair when new columns are created (#890).

This was referenced Jul 31, 2021

Refactorings #930

Merged

Provide rbind() and cbind() methods #909

Draft

Call tibble_reconstruct() after add_row() #931

Open

Refactor #934

Merged

Refactor #935

Merged

jennybc mentioned this issue Jul 31, 2021

Name repair after column assignment #933

Closed

krlmlr mentioned this issue Jul 31, 2021

WIP: tibble_reconstruct(), tibble_row_slice(), tibble_col_modify() #937

Open

krlmlr modified the milestones: 3.1.4, 3.1.5 Aug 7, 2021

krlmlr modified the milestones: 3.1.5, 3.1.6 Sep 26, 2021

krlmlr mentioned this issue Oct 25, 2021

Clarify how to extend tibbles #275

Closed

krlmlr modified the milestones: 3.1.6, 3.1.7 Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending tibble: A case for `tibble_reconstruct()` #890

Extending tibble: A case for `tibble_reconstruct()` #890

DavisVaughan commented May 17, 2021

jennybc commented May 17, 2021

krlmlr commented May 21, 2021

lionel- commented May 21, 2021

DavisVaughan commented May 21, 2021

lionel- commented May 21, 2021

krlmlr commented May 24, 2021

lionel- commented May 25, 2021

krlmlr commented Jul 29, 2021

krlmlr commented Oct 24, 2021

Extending tibble: A case for tibble_reconstruct() #890

Extending tibble: A case for tibble_reconstruct() #890

Comments

DavisVaughan commented May 17, 2021

jennybc commented May 17, 2021

krlmlr commented May 21, 2021

lionel- commented May 21, 2021

DavisVaughan commented May 21, 2021

lionel- commented May 21, 2021

krlmlr commented May 24, 2021

lionel- commented May 25, 2021

krlmlr commented Jul 29, 2021

krlmlr commented Oct 24, 2021

Extending tibble: A case for `tibble_reconstruct()` #890

Extending tibble: A case for `tibble_reconstruct()` #890