Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending tibble: A case for tibble_reconstruct() #890

Open
DavisVaughan opened this issue May 17, 2021 · 9 comments
Open

Extending tibble: A case for tibble_reconstruct() #890

DavisVaughan opened this issue May 17, 2021 · 9 comments
Milestone

Comments

@DavisVaughan
Copy link
Member

@jennybc and I have both extended tibble when creating custom subclasses (see googledrive, tune, rsample, and workflowsets). Our focus was on extending tibble to work with dplyr and vctrs, which generally follows the advice outlined in this rough help document.

Jenny mentioned that tibble could provide a hook to make it easier for package authors to extend tibble. This would work in a way that is very similar to dplyr_reconstruct(), and tibble would automatically call this S3 generic hook at the end of functions that might invalidate the invariants of a tibble subclass. It would look like:

tibble_reconstruct <- function(x, to) {
  UseMethod("tibble_reconstruct", to)
}

For example, if my subclass requires a special index column, and [.tbl_df drops that index column, then ideally the result would be a bare tibble, and would no longer inherit from my subclass. If tibble called tibble_reconstruct(x = out, to = x) at the end of [.tbl_df, then tibble_reconstruct.mysubclass(x, to) could contain all of the logic required to decide if x could be reconstructed to the class of to, or if it should fall back to returning a bare tibble.

This is related to and probably supersedes @hadley's issue in #275.


This would simplify the advice given by dplyr, which currently suggests that [ and names<- methods are required to be compatible with dplyr, along with a dplyr_reconstruct() method. Instead, it could advise that if you are extending tibble, you only need a tibble_reconstruct() method (we would still keep the current advice for the case where you are only extending data.frame).


This also aligns perfectly with the conventions that are already arising for adding support for vctrs and dplyr. We always start by creating mysubclass_maybe_reconstruct() and mysubclass_is_reconstructable() helpers which have all the logic for either reconstructing to a mysubclass or returning a bare tibble:
https://github.com/tidyverse/googledrive/blob/f2f090156236187b803c5ae26afb159bd4f78580/R/compat-dplyr.R#L1-L10

This gets used:

It would make sense for this to be the guts of the tibble_reconstruct() method, then we could remove the [ and names<- methods.


The following methods would need to call tibble_reconstruct():

  • [.tbl_df
  • [<-.tbl_df
  • $<-.tbl_df
  • [[<-.tbl_df
  • names<-.tbl_df

There may be others, but I think this would get us pretty far.

If all of this gets added, I could see two tibble vignettes coming out of this:

  • Extending tibble
    • Describing tibble_reconstruct(), what it is used for, and how to add a method
    • Encourage adding a standalone mysubclass_reconstruct() that gets used in the tibble_reconstruct() method
  • Adding tibble subclass compat for dplyr and vctrs
    • Assume that users read the above vignette and have mysubclass_reconstruct() ready to go
    • Talk about implementing vec_restore(), vec_ptype2(), and vec_cast() methods
    • Talk about implementing a dplyr_reconstruct() method
@jennybc
Copy link
Member

jennybc commented May 17, 2021

Thanks @DavisVaughan for opening this. We discussed this, as he coached me through googledrive's recent refresh of how I do "dplyr compatibility". The previous implementation pre-dated vctrs and basically most of our current technology for this. And, of course, now it's not just "dplyr compatibility" but also vctrs and they more I thought about what I was doing ... I wondered why isn't some of this wired in at the tibble level? Which is how we got here.

As one more concrete detail, what really made this need clear for me is that I think many users are much more likely to remove a variable with:

dat$drive_resource <- NULL

than via

dat <- dplyr::select(dat, -drive_resource)

and, in both cases, for a dribble, this should force fallback to a bare tibble, because drive_resource is a required column. But given our current emphasis on implementing dplyr and vctrs methods, this feels like a big hole in the implementation of a tibble subclass.

@krlmlr
Copy link
Member

krlmlr commented May 21, 2021

Thanks. I agree that we should add an extension point here.

Can you please remind me why we can't implement vctrs::vec_restore() for tibbles?

CC @lionel-.

@lionel-
Copy link
Member

lionel- commented May 21, 2021

@krlmlr proxy + restore for data frame subclasses is very challenging. I have explored how to use these operations to encode various data frame semantics (dynamically and statically grouped tibbles) and got quite far, but the extra complexity in vctrs was daunting. The dplyr generic API is an additional layer that aims to make it sufficiently flexible and simple to create these subclasses.

@DavisVaughan What would happen if tibble called dplyr_reconstruct() if available? Or if dplyr_reconstruct() was exported from tibble? In other words, could there be only one generic?

@DavisVaughan
Copy link
Member Author

Looking at how dplyr_reconstruct() is used in dplyr, it seems like it would be possible to have a single generic, tibble_reconstruct(), that dplyr uses instead of dplyr_reconstruct(). I'm not sure what happens to dplyr_row_slice() and dplyr_col_modify(). Maybe those also move to tibble, and tibble would use something like tibble_row_slice() when subsetting rows with [.tbl_df.

Also worth thinking about whether these should just be called df_reconstruct(), since in dplyr you can subclass a "data.frame" and provide a dplyr_reconstruct() method and everything would still work. It doesn't have to be a tibble. That could imply that the df_*() generics could live in vctrs alongside the df_*.data.frame() methods, and then tibble and dplyr would use them, with dplyr also providing methods for grouped_df and rowwise_df.

@lionel-
Copy link
Member

lionel- commented May 21, 2021

I think that makes sense in the long term, but is the dplyr API sufficiently stable to graduate it to vctrs? These generics could also live in another package, that might be better for documentation and visibility.

Perhaps the first step is to use dplyr_reconstruct() (if available) from tibble methods? This would allow us to further experiment with the dplyr API.

@krlmlr
Copy link
Member

krlmlr commented May 24, 2021

@DavisVaughan: Is this a case for brushing up https://github.com/DavisVaughan/standardize and implementing the generic there? Happy to spend time on that package.

How would we use dplyr_reconstruct() only if available? If tibble always loads dplyr if installed, this might have unintended consequences; if it's only used when dplyr is loaded, behavior depends on the loaded packages. None of these options appeals to me, I might be missing something though.

@lionel-
Copy link
Member

lionel- commented May 25, 2021

How would we use dplyr_reconstruct() only if available? If tibble always loads dplyr if installed, this might have unintended consequences; if it's only used when dplyr is loaded, behavior depends on the loaded packages. None of these options appeals to me, I might be missing something though.

The implementors of dplyr::dplyr_reconstruct() are importing dplyr so there would be no hygiene issue here.

@krlmlr
Copy link
Member

krlmlr commented Jul 29, 2021

dplyr_reconstruct() and friends are used a lot already. We could:

  • Implement baseline versions for dplyr_reconstruct(), dplyr_row_slice() and dplyr_col_modify() that are used if dplyr is not loaded
  • Use on_package_load() from https://github.com/r-lib/rlang/blob/master/R/aaa.R to overwrite these baseline version with forwarders to the dplyr methods

I'm a bit worried that this will impact performance if users use [ in loops. I guess we'll see, we can't have both flexibility and super-fast code here I suspect.

@krlmlr krlmlr added this to the 3.1.4 milestone Jul 29, 2021
krlmlr added a commit that referenced this issue Jul 31, 2021
- `x[j] <- list(name = value)` uses name repair when new columns are created (#890).
@krlmlr krlmlr modified the milestones: 3.1.4, 3.1.5 Aug 7, 2021
@krlmlr krlmlr modified the milestones: 3.1.5, 3.1.6 Sep 26, 2021
@krlmlr
Copy link
Member

krlmlr commented Oct 24, 2021

I have a WIP pull request, #937, that shows a substantial slowdown (3x to 5x) if we add this. The results are old, but a recent comparison tells the same story: https://rpubs.com/krlmlr/tibble-reconstruct-benchmark .

Can we afford this?

@krlmlr krlmlr modified the milestones: 3.1.6, 3.1.7 Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants