-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new board_gcs()
for Google Cloud Storage
#695
Changes from all commits
9cf74ba
a87bd7d
f8fe5ba
70b0866
835952e
6a509c4
76973ae
a919975
8afce68
3677308
a89c865
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ packages | |
docs | ||
inst/doc | ||
.Renviron | ||
google-pins.json |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,257 @@ | ||
#' Use a Google Cloud Storage bucket as a board | ||
#' | ||
#' Pin data to a Google Cloud Storage bucket using the googleCloudStorageR | ||
#' package. | ||
#' | ||
#' # Authentication | ||
#' | ||
#' `board_gcs()` is powered by the googleCloudStorageR package which provides | ||
#' several authentication options, as documented in its | ||
#' [main vignette](https://code.markedmondson.me/googleCloudStorageR/articles/googleCloudStorageR.html). | ||
#' The two main options are to create a service account key (a JSON file) or an | ||
#' authentication token; you can manage either using the [gargle](https://gargle.r-lib.org/) package. | ||
#' | ||
#' # Details | ||
#' | ||
#' * The functions in pins do not create a new bucket. You can create | ||
#' a new bucket from R with [googleCloudStorageR::gcs_create_bucket()]. | ||
#' * You can pass arguments for [googleCloudStorageR::gcs_upload] such as | ||
#' `predefinedAcl` and `upload_type` through the dots of `pin_write()`. | ||
#' * `board_gcs()` is powered by the googleCloudStorageR package, which is a | ||
#' suggested dependency of pins (not required for pins in general). If | ||
#' you run into errors when deploying content to a server like | ||
#' <https://shinyapps.io> or [Connect](https://posit.co/products/enterprise/connect/), | ||
#' add `requireNamespame(googleCloudStorageR)` to your app or document for [automatic | ||
#' dependency discovery](https://support.posit.co/hc/en-us/articles/229998627-Why-does-my-app-work-locally-but-not-on-my-RStudio-Connect-server). | ||
#' | ||
#' @inheritParams new_board | ||
#' @param bucket Bucket name. You can only write to an existing bucket, and you | ||
#' can use [googleCloudStorageR::gcs_get_global_bucket()] here. | ||
#' @param prefix Prefix within this bucket that this board will occupy. | ||
#' You can use this to maintain multiple independent pin boards within | ||
#' a single GCS bucket. Will typically end with `/` to take advantage of | ||
#' Google Cloud Storage's directory-like handling. | ||
#' @export | ||
#' @examples | ||
#' \dontrun{ | ||
#' board <- board_gcs() | ||
#' board %>% pin_write(mtcars) | ||
#' board %>% pin_read("mtcars") | ||
#' | ||
#' # A prefix allows you to have multiple independent boards in the same pin. | ||
#' board_sales <- board_gcs("company-pins", prefix = "sales/") | ||
#' board_marketing <- board_gcs("company-pins", prefix = "marketing/") | ||
#' # You can make the hierarchy arbitrarily deep. | ||
#' | ||
#' # Pass arguments like `predefinedAcl` through the dots of `pin_write`: | ||
#' board %>% pin_write(mtcars, predefinedAcl = "publicRead") | ||
#' } | ||
board_gcs <- function(bucket, | ||
prefix = NULL, | ||
versioned = TRUE, | ||
cache = NULL) { | ||
|
||
check_installed("googleCloudStorageR") | ||
|
||
# Check that have access to the bucket | ||
googleCloudStorageR::gcs_get_bucket(bucket) | ||
|
||
cache <- cache %||% board_cache_path(paste0("gcs-", bucket)) | ||
new_board_v1( | ||
"pins_board_gcs", | ||
name = "gcs", | ||
bucket = bucket, | ||
prefix = prefix, | ||
cache = cache, | ||
versioned = versioned | ||
) | ||
} | ||
|
||
board_gcs_test <- function(...) { | ||
|
||
skip_if_missing_envvars( | ||
tests = "board_gcs()", | ||
envvars = c("PINS_GCS_PASSWORD") | ||
) | ||
|
||
path_to_encrypted_json <- fs::path_package("pins", "secret", "pins-gcs-testing.json") | ||
raw <- readBin(path_to_encrypted_json, "raw", file.size(path_to_encrypted_json)) | ||
pw <- Sys.getenv("PINS_GCS_PASSWORD", "") | ||
json <- sodium::data_decrypt( | ||
bin = raw, | ||
key = sodium::sha256(charToRaw(pw)), | ||
nonce = secret_nonce() | ||
) | ||
googleCloudStorageR::gcs_auth(json_file = rawToChar(json)) | ||
|
||
board_gcs("pins-dev", cache = tempfile(), ...) | ||
} | ||
|
||
## for decrypting JSON for service account: | ||
secret_nonce <- function() { | ||
sodium::hex2bin("cb36bab652dec6ae9b1827c684a7b6d21d2ea31cd9f766ac") | ||
} | ||
|
||
#' @export | ||
pin_list.pins_board_gcs <- function(board, ...) { | ||
NA | ||
} | ||
|
||
#' @export | ||
pin_exists.pins_board_gcs <- function(board, name, ...) { | ||
withr::local_options(list(googleAuthR.verbose = 4)) | ||
gcs_file_exists(board, name) | ||
} | ||
|
||
#' @export | ||
pin_delete.pins_board_gcs <- function(board, names, ...) { | ||
for (name in names) { | ||
check_pin_exists(board, name) | ||
gcs_delete_dir(board, name) | ||
} | ||
invisible(board) | ||
} | ||
|
||
#' @export | ||
pin_versions.pins_board_gcs <- function(board, name, ...) { | ||
check_pin_exists(board, name) | ||
resp <- googleCloudStorageR::gcs_list_objects( | ||
bucket = board$bucket, | ||
prefix = paste0(board$prefix, name) | ||
) | ||
paths <- fs::path_split(unique(fs::path_dir(resp$name))) | ||
version_from_path(map_chr(paths, ~ .x[[length(.x)]])) | ||
Comment on lines
+122
to
+123
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what I did to deal with the flat namespace situation; suggestions welcome for a nicer option! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is the concept of object versioning in GCS, does this match what pin_versions does? https://cloud.google.com/storage/docs/object-versioning There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, one of the ideas of pins is to provide versioning, even if the data practitioner is not allowed to turn on GCS versioning because of, say, the bucket retention policy. |
||
} | ||
|
||
#' @export | ||
pin_version_delete.pins_board_gcs <- function(board, name, version, ...) { | ||
gcs_delete_dir(board, fs::path(name, version)) | ||
} | ||
|
||
#' @export | ||
pin_meta.pins_board_gcs <- function(board, name, version = NULL, ...) { | ||
withr::local_options(list(googleAuthR.verbose = 4)) | ||
check_pin_exists(board, name) | ||
version <- check_pin_version(board, name, version) | ||
metadata_blob <- fs::path(name, version, "data.txt") | ||
|
||
if (!gcs_file_exists(board, metadata_blob)) { | ||
abort_pin_version_missing(version) | ||
} | ||
|
||
path_version <- fs::path(board$cache, name, version) | ||
fs::dir_create(path_version) | ||
gcs_download(board, metadata_blob) | ||
local_meta( | ||
read_meta(fs::path(board$cache, name, version)), | ||
name = name, | ||
dir = path_version, | ||
version = version | ||
) | ||
} | ||
|
||
#' @export | ||
pin_fetch.pins_board_gcs <- function(board, name, version = NULL, ...) { | ||
withr::local_options(list(googleAuthR.verbose = 4)) | ||
meta <- pin_meta(board, name, version = version) | ||
cache_touch(board, meta) | ||
|
||
for (file in meta$file) { | ||
key <- fs::path(name, meta$local$version, file) | ||
gcs_download(board, key) | ||
} | ||
|
||
meta | ||
} | ||
|
||
#' @export | ||
pin_store.pins_board_gcs <- function(board, name, paths, metadata, | ||
versioned = NULL, x = NULL, ...) { | ||
withr::local_options(list(googleAuthR.verbose = 4)) | ||
ellipsis::check_dots_used() | ||
check_name(name) | ||
version <- version_setup(board, name, version_name(metadata), versioned = versioned) | ||
version_dir <- fs::path(name, version) | ||
gcs_upload_yaml( | ||
board, | ||
fs::path(paste0(board$prefix, version_dir), "data.txt"), | ||
metadata | ||
) | ||
|
||
for (path in paths) { | ||
googleCloudStorageR::gcs_upload( | ||
file = path, | ||
bucket = board$bucket, | ||
name = fs::path(paste0(board$prefix, version_dir), fs::path_file(path)), | ||
... | ||
) | ||
} | ||
|
||
name | ||
} | ||
|
||
#' @rdname board_deparse | ||
#' @export | ||
board_deparse.pins_board_gcs <- function(board, ...) { | ||
bucket <- check_board_deparse(board, "bucket") | ||
expr(board_gcs(!!bucket, prefix = !!board$prefix)) | ||
} | ||
|
||
#' @rdname required_pkgs.pins_board | ||
#' @export | ||
required_pkgs.pins_board_gcs <- function(x, ...) { | ||
ellipsis::check_dots_empty() | ||
"googleCloudStorageR" | ||
} | ||
|
||
# Helpers ----------------------------------------------------------------- | ||
|
||
gcs_delete_dir <- function(board, dir) { | ||
resp <- googleCloudStorageR::gcs_list_objects( | ||
bucket = board$bucket, | ||
prefix = paste0(board$prefix, dir, "/") | ||
) | ||
|
||
if (nrow(resp) == 0) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is probably not needed unless when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, that is the situation: library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-testing")
resp <- googleCloudStorageR::gcs_list_objects(
bucket = b$bucket,
prefix = paste0(b$prefix, "cats-and-dogs", "/")
)
#> ℹ 2023-01-12 10:50:27 > No objects found
resp
#> data frame with 0 columns and 0 rows Created on 2023-01-12 with reprex v2.0.2 |
||
return(invisible()) | ||
} | ||
|
||
for (path in resp$name) { | ||
googleCloudStorageR::gcs_delete_object(path, bucket = board$bucket) | ||
} | ||
|
||
invisible() | ||
} | ||
|
||
gcs_upload_yaml <- function(board, key, yaml, ...) { | ||
temp_file <- withr::local_tempfile() | ||
yaml::write_yaml(yaml, file = temp_file) | ||
googleCloudStorageR::gcs_upload( | ||
file = temp_file, | ||
bucket = board$bucket, | ||
type = "text/yaml", | ||
name = key, | ||
... | ||
) | ||
} | ||
|
||
gcs_download <- function(board, key) { | ||
path <- fs::path(board$cache, key) | ||
if (!fs::file_exists(path)) { | ||
suppressMessages(googleCloudStorageR::gcs_get_object( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one function doesn't seem to respect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess its the use of the cli:: message bar - yes please open an issue as should be a quick fix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just opened here: cloudyr/googleCloudStorageR#170 |
||
object_name = paste0(board$prefix, key), | ||
bucket = board$bucket, | ||
saveToDisk = path | ||
)) | ||
fs::file_chmod(path, "u=r") | ||
} | ||
path | ||
} | ||
|
||
gcs_file_exists <- function(board, name) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would do this via There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An example here for a pull request I did for targets a while ago https://github.com/MarkEdmondson1234/targets/blob/8204e4268553a2680f5d5ff3b0fc006b0c40d45a/R/utils_gcp.R#L7-22 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see what you're saying! Hmmm... It's possible this function could have a better name, but it's not always looking for a specific object; sometimes it is looking for a directory-like thing. Given the information at hand when calling, for example, |
||
resp <- googleCloudStorageR::gcs_list_objects( | ||
bucket = board$bucket, | ||
prefix = paste0(board$prefix, name) | ||
) | ||
nrow(resp) > 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I discussed earlier in this PR, no
pin_list()
for the time being because we don't have good tooling for handling the flat namespace in a bucket. This is different from AWS and Azure. We can try coming back to this later if it is a high priority issue for folks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess to enable it this would need to download a list of all objects, then parse through it locally for those ending with /. Its doable, but could get slow if a very large (10k+) number of objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see how we handle that for AWS and Azure. Both of those rely on underlying functionality already available in paws.storage and AzureStor (ways to identify directory-like structure, common prefixes, etc). I don't think I want to move forward with writing that parsing code here in pins for now; maybe we can collaborate in the future on adding features like this to googleCloudStorageR if it is a high priority for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use
gcs_list_objects()
plus a prefix? Because it returns all objects "recursively"?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right:
Created on 2023-01-12 with reprex v2.0.2
In the packages for AWS and Azure, there is existing support for identifying directory-like structures or to find "common prefixes" but googleCloudStorageR doesn't have that as of today.