Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super fast checking of cloud targets #1181

Merged
merged 50 commits into from
Nov 13, 2023
Merged

Super fast checking of cloud targets #1181

merged 50 commits into from
Nov 13, 2023

Conversation

wlandau
Copy link
Member

@wlandau wlandau commented Nov 13, 2023

Prework

Related GitHub issues and pull requests

Summary

Before today, targets was slow to check the status of cloud targets. Each target needed its own individual HEAD request to check if the object in the bucket was up to date. But as of this PR, targets uses a LIST request to get all the hashes in the prefix ahead of time, then look up the hash in the list.

The results benchmarks are exciting! I tested performance in a pipeline with 1000 file targets and about 1000 regular targets:

# _targets.R file:
library(targets)

tar_option_set(
  repository = "aws",
  resources = tar_resources(
    aws = tar_resources_aws(
      bucket = "CENSORED",
      prefix = "_targets"
    )
  )
)

write_file <- function() {
  path <- tempfile()
  writeLines("line", path)
  path
} 

list(
  tar_target(index, seq_len(1000)),
  tar_target(object, "object", pattern = map(index)),
  tar_target(file, write_file(), pattern = map(index), format = "file")
)

When this pipeline is up to date, tar_outdated() should take no time at all. Previously, it took about 4 minutes and 6.775 seconds. With this PR, it only took 3.805 seconds! That's a 64-fold speedup. For comparison, the unwise shortcut version with cue = tar_cue(file = FALSE) took 1.162 seconds, which is not much faster.

Everything is ready as far as AWS is concerned. I just need to finish benchmarking on Google Cloud.

FYI @noamross

@wlandau wlandau self-assigned this Nov 13, 2023
@wlandau
Copy link
Member Author

wlandau commented Nov 13, 2023

This PR does have a couple side effects that may require an adjustment:

  1. To get the hashes of cloud targets efficiently, I needed to switch from local xxhash64 hashes to the ETags and MD5 hashes that the cloud services provide in LIST requests. Unfortunately, this will invalidate existing targets in existing pipelines. But the upcoming release also has similarly disruptive changes with regard to RNG seeds, so I think this change to hashes is coming at the right time.
  2. If you roll back the version of the metadata file to access historical versions of existing targets, functions like tar_read() and tar_destroy() will still use the version ID in the local metadata. However, unless you use tar_cue(file = FALSE), tar_make() and tar_outdated() now disregard the version ID when they check if each target is up to date. I needed to do this because LIST requests do not provide version IDs, so users will need to adjust their expectations. However, I think this is a positive change from a philosophical perspective. If a target to be up to date, then the local record of the data should agree with the current latest version in the bucket.

@noamross
Copy link

noamross commented Nov 13, 2023

Hmm, so it sounds like my in-development tar_read_version() will work fine, but something like , tar_make_version(), which would be pretty similar to withr::with_dir(tar_make(), REPOSITORY_CHECKED_OUT_TO_OLD_VERSION). Would no longer work, or at least not be able to to draw on old versions of the targets in the cloud repository. Is that correct?

@wlandau wlandau merged commit 4bc6db4 into main Nov 13, 2023
13 checks passed
@wlandau wlandau deleted the 1172 branch November 13, 2023 18:46
@wlandau
Copy link
Member Author

wlandau commented Nov 13, 2023

Hmm, so it sounds like my in-development tar_read_version() will work fine, but something like ,

Yeah, the tar_read_version() you describe at #1175 should definitely still work.

but something like , tar_make_version(), which would be pretty similar to withr::with_dir(tar_make(), REPOSITORY_CHECKED_OUT_TO_OLD_VERSION). Would no longer work, or at least not be able to to draw on old versions of the targets in the cloud repository. Is that correct?

If a tar_make_version() function would assume old AWS version IDs can be up to date, then yes, it won't work any longer. Does something like this already exist? After reflecting on #1172, I'm not sure I would recommend that approach. tar_make()-like functions are supposed to bring the pipeline up to date in a general sense, and an archived version of an AWS S3 object is outdated by definition, regardless of anything to do with targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants