-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new board_gcs()
for Google Cloud Storage
#695
Conversation
As I am working through this, I have a few questions that I hope you can help me with @MarkEdmondson1234. The first is the logging/messaging that googleCloudStorageR does: library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
b %>% pin_write(1:10, "small-numbers", type = "rds")
#> Creating new version '20221220T011855Z-4d1eb'
#> ℹ 2022-12-19 18:18:55 > File size detected as 189 bytes
#>
#> ℹ 2022-12-19 18:18:55 > File size detected as 61 bytes
#>
#> Writing to pin 'small-numbers'
b %>% pin_write(100:110, "big-numbers", type = "json")
#> Creating new version '20221220T011856Z-b89e8'
#> ℹ 2022-12-19 18:18:56 > File size detected as 187 bytes
#>
#> ℹ 2022-12-19 18:18:56 > File size detected as 46 bytes
#>
#> Writing to pin 'big-numbers'
b %>% pin_write(1:10, "small-numbers", type = "json")
#> Creating new version '20221220T011856Z-c3943'
#> ℹ 2022-12-19 18:18:57 > File size detected as 191 bytes
#>
#> ℹ 2022-12-19 18:18:57 > File size detected as 23 bytes
#>
#> Writing to pin 'small-numbers' Created on 2022-12-19 with reprex v2.0.2 Is there a straightforward way for me to suppress this messaging? An option? I looked around but didn't find it quickly. I think it is getting too noisy, with the pins messages and the googleCloudStorageR messages together. The second is about how I am supposed to use library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
googleCloudStorageR::gcs_list_objects(
bucket = b$bucket,
prefix = "small-numbers"
)
#> name size
#> 1 small-numbers/20221220T011807Z-4d1eb/data.txt 189 bytes
#> 2 small-numbers/20221220T011807Z-4d1eb/small-numbers.rds 61 bytes
#> 3 small-numbers/20221220T011808Z-c3943/data.txt 191 bytes
#> 4 small-numbers/20221220T011808Z-c3943/small-numbers.json 23 bytes
#> 5 small-numbers/20221220T011855Z-4d1eb/data.txt 189 bytes
#> 6 small-numbers/20221220T011855Z-4d1eb/small-numbers.rds 61 bytes
#> 7 small-numbers/20221220T011856Z-c3943/data.txt 191 bytes
#> 8 small-numbers/20221220T011856Z-c3943/small-numbers.json 23 bytes
#> updated
#> 1 2022-12-20 01:18:08
#> 2 2022-12-20 01:18:08
#> 3 2022-12-20 01:18:09
#> 4 2022-12-20 01:18:09
#> 5 2022-12-20 01:18:55
#> 6 2022-12-20 01:18:56
#> 7 2022-12-20 01:18:57
#> 8 2022-12-20 01:18:57 Created on 2022-12-19 with reprex v2.0.2 However, when I try to use library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-dev")
googleCloudStorageR::gcs_list_objects(
bucket = b$bucket,
prefix = "small-numbers",
delimiter = "/"
)
#> ℹ 2022-12-19 18:22:56 > No objects found
#> data frame with 0 columns and 0 rows Created on 2022-12-19 with reprex v2.0.2 I want to get something more like a list of directories. I can work around when I know the pin name, but I would like to get |
suppressMessages() should do it for the upload feedback. If it needs more fine grain control it could be conditional on options(googleAuthR.verbose = 4) - the default is 3, 4 will suppress all messages. The objects aren't really in folders more just names with / in them, so listing folders isn't directly supported. If folder structure is important perhaps make new buckets? Prefix is only to filter results coming back that begin with x, so can do pseudo "list a folder" with "everything that starts with "folder/" |
@MarkEdmondson1234 it looks like withr::with_options(
list(googleAuthR.verbose = 4),
{
googleCloudStorageR::gcs_auth("~/google-pins.json")
googleCloudStorageR::gcs_get_object(
object_name = "big-numbers/20221220T011808Z-b89e8/big-numbers.json",
bucket = "pins-dev",
saveToDisk = tempfile()
)
}
)
#> ℹ Downloading big-numbers/20221220T011808Z-b89e8/big-numbers.json
#> ✔ Saved big-numbers/20221220T011808Z-b89e8/big-numbers.json to /var/folders/hv/…
#>
#> [1] TRUE Created on 2022-12-20 with reprex v2.0.2 I believe other functions like |
I'm going to keep |
@MarkEdmondson1234 do you have the bandwidth to do a review of this PR and see if you have any suggestions for better use of the GCS API? |
R/board_gcs.R
Outdated
rlang::check_installed("sodium") | ||
path_to_encrypted_json <- fs::path_package("pins", "secret", "pins-gcs-testing.json") | ||
raw <- readBin(path_to_encrypted_json, "raw", file.size(path_to_encrypted_json)) | ||
pw <- Sys.getenv("PINS_GCS_PASSWORD", "") | ||
json <- sodium::data_decrypt( | ||
bin = raw, | ||
key = sodium::sha256(charToRaw(pw)), | ||
nonce = secret_nonce() | ||
) | ||
googleCloudStorageR::gcs_auth(json_file = rawToChar(json)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up taking the approach outlined in the gargle vignette "Managing tokens securely", and it seems to be working great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice solution for local testing
pin_list.pins_board_gcs <- function(board, ...) { | ||
NA | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I discussed earlier in this PR, no pin_list()
for the time being because we don't have good tooling for handling the flat namespace in a bucket. This is different from AWS and Azure. We can try coming back to this later if it is a high priority issue for folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess to enable it this would need to download a list of all objects, then parse through it locally for those ending with /. Its doable, but could get slow if a very large (10k+) number of objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see how we handle that for AWS and Azure. Both of those rely on underlying functionality already available in paws.storage and AzureStor (ways to identify directory-like structure, common prefixes, etc). I don't think I want to move forward with writing that parsing code here in pins for now; maybe we can collaborate in the future on adding features like this to googleCloudStorageR if it is a high priority for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use gcs_list_objects()
plus a prefix? Because it returns all objects "recursively"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right:
library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-testing")
googleCloudStorageR::gcs_list_objects(bucket = b$bucket)
#> name size
#> 1 big-numbers/20230110T204627Z-b89e8/big-numbers.json 46 bytes
#> 2 big-numbers/20230110T204627Z-b89e8/data.txt 187 bytes
#> 3 small-numbers/20230110T204626Z-4d1eb/data.txt 189 bytes
#> 4 small-numbers/20230110T204626Z-4d1eb/small-numbers.rds 61 bytes
#> 5 small-numbers/20230110T204628Z-c3943/data.txt 191 bytes
#> 6 small-numbers/20230110T204628Z-c3943/small-numbers.json 23 bytes
#> 7 test-kcH7P3OJjX/20230110T205041Z-6e4eb/data.txt 192 bytes
#> 8 test-kcH7P3OJjX/20230110T205041Z-6e4eb/test-kcH7P3OJjX.rds 44 bytes
#> updated
#> 1 2023-01-10 20:46:28
#> 2 2023-01-10 20:46:28
#> 3 2023-01-10 20:46:26
#> 4 2023-01-10 20:46:27
#> 5 2023-01-10 20:46:29
#> 6 2023-01-10 20:46:29
#> 7 2023-01-10 20:50:41
#> 8 2023-01-10 20:50:41
googleCloudStorageR::gcs_list_objects(bucket = b$bucket, prefix = "big-numbers")
#> name size
#> 1 big-numbers/20230110T204627Z-b89e8/big-numbers.json 46 bytes
#> 2 big-numbers/20230110T204627Z-b89e8/data.txt 187 bytes
#> updated
#> 1 2023-01-10 20:46:28
#> 2 2023-01-10 20:46:28
Created on 2023-01-12 with reprex v2.0.2
In the packages for AWS and Azure, there is existing support for identifying directory-like structures or to find "common prefixes" but googleCloudStorageR doesn't have that as of today.
paths <- fs::path_split(unique(fs::path_dir(resp$name))) | ||
version_from_path(map_chr(paths, ~ .x[[length(.x)]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I did to deal with the flat namespace situation; suggestions welcome for a nicer option!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the concept of object versioning in GCS, does this match what pin_versions does? https://cloud.google.com/storage/docs/object-versioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, one of the ideas of pins is to provide versioning, even if the data practitioner is not allowed to turn on GCS versioning because of, say, the bucket retention policy.
gcs_download <- function(board, key) { | ||
path <- fs::path(board$cache, key) | ||
if (!fs::file_exists(path)) { | ||
suppressMessages(googleCloudStorageR::gcs_get_object( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one function doesn't seem to respect googleAuthR.verbose
so for now we'll use suppressMessages()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess its the use of the cli:: message bar - yes please open an issue as should be a quick fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just opened here: cloudyr/googleCloudStorageR#170
path | ||
} | ||
|
||
gcs_file_exists <- function(board, name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this via gcs_get_object("obj_name", meta=TRUE)
and then tryCatch(http_404 = FALSE)
- will be much quicker and direct only inspecting HTTP headers and not content. Custom error types are supported in the HTTP responses to help this workflow. Listing objects could be slow for large amount of pins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example here for a pull request I did for targets a while ago https://github.com/MarkEdmondson1234/targets/blob/8204e4268553a2680f5d5ff3b0fc006b0c40d45a/R/utils_gcp.R#L7-22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you're saying! Hmmm...
It's possible this function could have a better name, but it's not always looking for a specific object; sometimes it is looking for a directory-like thing. Given the information at hand when calling, for example, pin_exists()
, I think we do need to list the objects in the bucket rather than get a specific object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
pin_list.pins_board_gcs <- function(board, ...) { | ||
NA | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we use gcs_list_objects()
plus a prefix? Because it returns all objects "recursively"?
prefix = paste0(board$prefix, dir, "/") | ||
) | ||
|
||
if (nrow(resp) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably not needed unless when nrow()
is a 0, there's no name
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is the situation:
library(pins)
googleCloudStorageR::gcs_auth("~/google-pins.json")
b <- board_gcs("pins-testing")
resp <- googleCloudStorageR::gcs_list_objects(
bucket = b$bucket,
prefix = paste0(b$prefix, "cats-and-dogs", "/")
)
#> ℹ 2023-01-12 10:50:27 > No objects found
resp
#> data frame with 0 columns and 0 rows
Created on 2023-01-12 with reprex v2.0.2
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Closes #572