Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching reference objects #2176

Open
3 tasks done
rhijmans opened this issue Sep 27, 2022 · 11 comments · May be fixed by #2340
Open
3 tasks done

Caching reference objects #2176

rhijmans opened this issue Sep 27, 2022 · 11 comments · May be fixed by #2340
Assignees

Comments

@rhijmans
Copy link

rhijmans commented Sep 27, 2022

Is there a way, or could support be added, to tell knitr how to serialize objects of a particular class before caching, and how to restore them after caching? See this and this questions on stackoverflow for more context

The "terra" package uses S4 objects with one slot that has a reference to a C++ object (via a Rcpp module). These S4 objects can be serialized and unserialized with very little effort. For example, one can do

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
x <- serialize(r, NULL) 
# recreate the object 
s <- terra::rast( unserialize(x) )  
s
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

But can one use that mechanism with knitr caching? I suppose that it could be possible, in principle, to specify that objects of class "SpatRaster" should be passed to serialize before caching, and that terra::rast should be used when restoring the object. Whether this is easy to do in practice, I do not know.


By filing an issue to this repo, I promise that

  • I have fully read the issue guide at https://yihui.org/issue/.
  • I have provided the necessary information about my issue.
    • If I'm asking a question, I have already asked it on Stack Overflow or RStudio Community, waited for at least 24 hours, and included a link to my question there.
    • If I'm filing a bug report, I have included a minimal, self-contained, and reproducible example, and have also included xfun::session_info('knitr'). I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version: remotes::install_github('yihui/knitr').
    • If I have posted the same issue elsewhere, I have also mentioned it in this issue.
  • I have learned the Github Markdown syntax, and formatted my issue correctly.

I understand that my issue may be closed if I don't fulfill my promises.

@yihui

This comment was marked as outdated.

@yihui
Copy link
Owner

yihui commented Sep 27, 2022

Okay, I have a better idea on how to implement this now. Basically, as you hinted in the SO posts, neither save() + load() or tools:::makeLazyLoadDB() + lazyLoad() works for these raster objects, but saveRDS() + readRDS() works. That means I can cache to .rds files instead of .RData.

However, again, as you mentioned, users would need to call terra::rast() on the value from readRDS(). Although I could do that automatically in knitr, I would prefer not to deal with this type of special cases unless I must. I wonder if it's possible for the terra package to automatically deal with packed objects. I'm asking since I just realized who you are 20 minutes ago :)

@rhijmans
Copy link
Author

rhijmans commented Sep 27, 2022

I agree that you should not have to deal with special cases. Here are some ideas for "terra" and some new things that I have implemented. Feedback appreciated.

  1. It would be possible, I think, to automagically restore the "terra" objects if (a) instead of creating separate packed objects like PackedSpatRaster I would put the data in a slot in the original object (SpatRaster) and (b) any time a SpatRaster is used, a check would be done if it needs to be unpacked. So that means that I would have to add that check to all functions that take a SpatRaster (there are several 100s). A bit of a nightmare but doable. There could also be a huge performance penalty because the unpacking would need to be done every time the object is used (unless it is possible to safely overwrite the object in the global environment). If only there were something like an "onReadRDS" event that could trigger unwrapping...

  2. I have added a method unwrap that could be run after a cached block. It will restore the original object if it is a PackedSpatRaster or PackedSpatVector, and let all other objects pass through. That is much better than using rast because it works in all cases. Whereas if you run rast on a SpatRaster it returns a different object (a template with no cell values). But this would mean that a user would have to remember to call for all terra objects used after a cached block. Not pretty.

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
p <- wrap(r)
unwrap(p)
  1. Allow for setting an unwrap method that is called after restoring the cache. Something in the spirit of
cacherestore : terra::unwrap 

As all objects could pass through, and a user could provide more specialized functions. That would still be a bit of additional work, but probably a nicer user experience than having to use unwrap on all terra objects in blocks after a cached block.

  1. This may be the best and the worst solution. I have made readRDS and unserialize generic functions and now you get the terra version in you load the package (unless another package does the same and it loaded later). I make them do:
base::readRDS("file.rds")  |> unwrap()
# and
base::unserialzie("file.rds")  |> unwrap()

And now I see this:

x <- serialize(r, NULL) 
y <- unserialize(x)
(y <- unserialize(x))
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

And

frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

This could work well with knitr caching if knitr can use saveRDS. However, while saveRDS has a specific signature for the object to be saved (say, a SpatRaster), the signature for readRDS is "character" (the filename). So this seems a bit fragile as other packages could overwrite it. While method overwriting is not uncommon, this one would be hard to spot for a user and they could not fix it with terra::readRDS as they do not call it.

Also, you have to load the package for this to work. It does not work like this in a clean session:

f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)
frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)

This works

terra::saveRDS(r, frds) 
terra::readRDS(frds)

Perhaps an ideal long-term solution would be for base to have a generic unwrap method that is called in readRDS such that packages could implement it for different types of wrapped objects.

@yihui
Copy link
Owner

yihui commented Sep 28, 2022

Thanks for these ideas! For now, I feel the best way might be that I provide an S3 generic function in knitr to process cached objects. Then other package authors like you can register the methods in their own packages. The default method of this function will be simply the identity function, i.e., function(x) x.

@rhijmans
Copy link
Author

Would the downside of that solution be that these packages then would need to depend on knitr?

The same would be needed when creating the cache, unless you can use saveRDS. Using saveRDS would be great, I think because that is very general.

@yihui
Copy link
Owner

yihui commented Sep 29, 2022

I'll use saveRDS().

You don't need to depend on knitr. Here is the trick we've been using to register S3 methods dynamically on load: https://github.com/rstudio/htmltools/blob/5fa01e7197143844be141dcc0cca85096059497d/R/tags.R#L22-L56

@gavril0

This comment was marked as duplicate.

@atusy atusy linked a pull request Apr 26, 2024 that will close this issue
@atusy
Copy link
Collaborator

atusy commented Apr 26, 2024

@rhijmans @yihui
I opened #2340 to solve this issue, although I am not sure if I fully understand the discussion.

This PR adds knit_cache_preprocess and knit_cache_postprocess who can modify objects being saved and loaded, respectively.
Does it meet @rhijmans 's request?

registerS3method("knit_cache_preprocess", "terra", function(x) {
  # do some modifications at here
}, envir = asNamespace("knitr"))
registerS3method("knit_cache_postprocess", "terra", function(x) {
  # do some modifications at here  
}, envir = asNamespace("knitr"))

@cderv cderv moved this from To discuss / To plan to Next / Ready for Dev in R Markdown Team Projects Apr 29, 2024
@cderv
Copy link
Collaborator

cderv commented Aug 26, 2024

@atusy can you update the example above on how the new PR logic with hooks could solve this issue ?

thanks!

@yihui yihui linked a pull request Aug 26, 2024 that will close this issue
@atusy

This comment was marked as outdated.

@cderv

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants