-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending tibble: A case for tibble_reconstruct()
#890
Comments
Thanks @DavisVaughan for opening this. We discussed this, as he coached me through googledrive's recent refresh of how I do "dplyr compatibility". The previous implementation pre-dated vctrs and basically most of our current technology for this. And, of course, now it's not just "dplyr compatibility" but also vctrs and they more I thought about what I was doing ... I wondered why isn't some of this wired in at the tibble level? Which is how we got here. As one more concrete detail, what really made this need clear for me is that I think many users are much more likely to remove a variable with: dat$drive_resource <- NULL than via dat <- dplyr::select(dat, -drive_resource) and, in both cases, for a |
Thanks. I agree that we should add an extension point here. Can you please remind me why we can't implement CC @lionel-. |
@krlmlr proxy + restore for data frame subclasses is very challenging. I have explored how to use these operations to encode various data frame semantics (dynamically and statically grouped tibbles) and got quite far, but the extra complexity in vctrs was daunting. The dplyr generic API is an additional layer that aims to make it sufficiently flexible and simple to create these subclasses. @DavisVaughan What would happen if tibble called |
Looking at how Also worth thinking about whether these should just be called |
I think that makes sense in the long term, but is the dplyr API sufficiently stable to graduate it to vctrs? These generics could also live in another package, that might be better for documentation and visibility. Perhaps the first step is to use |
@DavisVaughan: Is this a case for brushing up https://github.com/DavisVaughan/standardize and implementing the generic there? Happy to spend time on that package. How would we use |
The implementors of |
I'm a bit worried that this will impact performance if users use |
- `x[j] <- list(name = value)` uses name repair when new columns are created (#890).
I have a WIP pull request, #937, that shows a substantial slowdown (3x to 5x) if we add this. The results are old, but a recent comparison tells the same story: https://rpubs.com/krlmlr/tibble-reconstruct-benchmark . Can we afford this? |
@jennybc and I have both extended tibble when creating custom subclasses (see googledrive, tune, rsample, and workflowsets). Our focus was on extending tibble to work with dplyr and vctrs, which generally follows the advice outlined in this rough help document.
Jenny mentioned that tibble could provide a hook to make it easier for package authors to extend tibble. This would work in a way that is very similar to
dplyr_reconstruct()
, and tibble would automatically call this S3 generic hook at the end of functions that might invalidate the invariants of a tibble subclass. It would look like:For example, if my subclass requires a special
index
column, and[.tbl_df
drops that index column, then ideally the result would be a bare tibble, and would no longer inherit from my subclass. If tibble calledtibble_reconstruct(x = out, to = x)
at the end of[.tbl_df
, thentibble_reconstruct.mysubclass(x, to)
could contain all of the logic required to decide ifx
could be reconstructed to the class ofto
, or if it should fall back to returning a bare tibble.This is related to and probably supersedes @hadley's issue in #275.
This would simplify the advice given by dplyr, which currently suggests that
[
andnames<-
methods are required to be compatible with dplyr, along with adplyr_reconstruct()
method. Instead, it could advise that if you are extending tibble, you only need atibble_reconstruct()
method (we would still keep the current advice for the case where you are only extending data.frame).This also aligns perfectly with the conventions that are already arising for adding support for vctrs and dplyr. We always start by creating
mysubclass_maybe_reconstruct()
andmysubclass_is_reconstructable()
helpers which have all the logic for either reconstructing to a mysubclass or returning a bare tibble:https://github.com/tidyverse/googledrive/blob/f2f090156236187b803c5ae26afb159bd4f78580/R/compat-dplyr.R#L1-L10
This gets used:
vec_restore()
methodvec_ptype2()
andvec_cast()
methodsdplyr_reconstruct()
method[
andnames<-
It would make sense for this to be the guts of the
tibble_reconstruct()
method, then we could remove the[
andnames<-
methods.The following methods would need to call
tibble_reconstruct()
:[.tbl_df
[<-.tbl_df
$<-.tbl_df
[[<-.tbl_df
names<-.tbl_df
There may be others, but I think this would get us pretty far.
If all of this gets added, I could see two tibble vignettes coming out of this:
tibble_reconstruct()
, what it is used for, and how to add a methodmysubclass_reconstruct()
that gets used in thetibble_reconstruct()
methodmysubclass_reconstruct()
ready to govec_restore()
,vec_ptype2()
, andvec_cast()
methodsdplyr_reconstruct()
methodThe text was updated successfully, but these errors were encountered: