Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plug S3 plugin changes into multi-export S3 origins #1045

Merged
merged 12 commits into from
Apr 5, 2024

Conversation

jhiemstrawisc
Copy link
Member

@jhiemstrawisc jhiemstrawisc commented Apr 3, 2024

This PR plugs upstream changes in the S3 plugin for XRootD into Pelican. Under the new setup, here are a few example ways to configure an S3 origin:

# Set up an origin using the full exports blocks
Origin:
  StorageType: "s3"
  # Some S3 config options remain top-level
  S3UrlStyle: <path or virtual>
  S3Region: us-east-1
  S3ServiceUrl: https://my-s3-url.com (eg https://s3dev.chtc.wisc.edu)
  Exports:
    # Other S3 configurations move into the exports block
    - S3Bucket: first-bucket
      S3AccessKeyfile: /path/to/access.key (this bucket requires auth)
      S3SecretKeyfile: /path/to/secret.key (this bucket requires auth)
      FederationPrefix: /second/prefix
      Capabilities: [ some capabilities here ]
   - S3Bucket: second-bucket
      FederationPrefix: /second/prefix
      Capabilities: [ some capabilities here ]
# Set up an origin using the old top-level config options
Origin:
  StorageType: "s3"
  S3UrlStyle: <path or virtual>
  S3Region: us-east-1
  S3ServiceUrl: https://my-s3-url.com (eg https://s3dev.chtc.wisc.edu)
  S3Bucket: my-bucket
  S3AccessKeyfile: /path/to/access.key (this bucket requires auth)
  S3SecretKeyfile: /path/to/secret.key (this bucket requires auth)
  FederationPrefix: /my/prefix
  # Various capabilities omitted for brevity
# Please don't do it this way, but in theory it's possible. ExportVolumes is only really meant for CLI/env var compat.
Origin:
  StorageType: "s3"
  S3UrlStyle: <path or virtual>
  S3Region: us-east-1
  S3ServiceUrl: https://my-s3-url.com (eg https://s3dev.chtc.wisc.edu)
  S3AccessKeyfile: /path/to/access.key (this bucket requires auth)
  S3SecretKeyfile: /path/to/secret.key (this bucket requires auth)
  ExportVolumes:
    - "my-bucket:/first/prefix"
    - "different-bucket:/second/prefix
  # Various capabilities omitted for brevity

And finally, from the command line:

pelican origin serve -m s3 -v my-bucket:/my/prefix --service-url https://s3.us-east-1.amazonaws.com --region us-east-1 --bucket-access-keyfile /path/to/access.key --bucket-secret-keyfile /path/to/secret.key

Here's something funky I did (open to negative feedback on it):
Right now we have an origin that exports all of AWS public data, and to make something similar possible under this new setup, we need a way to tell Pelican how to export an entire S3 endpoint, potentially without knowing all the buckets (many thousands in the case of AWS open data). I achieved this by deciding that to export an entire S3 endpoint, no bucket should be provided, and we assume ALL buckets at the endpoint are public. For example, this config will export all of AWS open data:

Origin:
  StorageType: "s3"
  S3UrlStyle: "path"
  S3Region: "us-east-1"
  S3ServiceUrl:  https://s3.us-east-1.amazonaws.com
  Exports:
    - FederationPrefix: /aws-open-data
      Capabilities: ["PublicReads", "Writes", "Listings", "DirectReads"]

The peculiarity in this setup is that unlike other S3 exports where the bucket name is abstracted away from the user by FederationPrefix, in this setup objects are accessed by /aws-open-data/<bucket>/<object>

For example, my usual test file is to get the file MD5SUMS from the noaa-wod-pds bucket. In this setup, it comes from /aws-open-data/noaa-wod-pds/MD5SUMS

@jhiemstrawisc jhiemstrawisc requested a review from turetske April 3, 2024 21:50
@jhiemstrawisc
Copy link
Member Author

@brianhlin This is the PR I indicated I'd like you to glance over with an eye toward "how do people actually configure S3 origins". The PR description should have the various setup bits, and I think you know how all of that gets converted to env vars. Let me know if you see any issues!

@brianhlin
Copy link
Contributor

Nothing jumps out at me immediately but I think I need a little bit to digest the configs.

I achieved this by deciding that to export an entire S3 endpoint, no bucket should be provided, and we assume ALL buckets at the endpoint are public

Seems potentially dangerous. Maybe we should have folks opt into this with a special config option?

@turetske
Copy link
Collaborator

turetske commented Apr 4, 2024

I achieved this by deciding that to export an entire S3 endpoint, no bucket should be provided, and we assume ALL buckets at the endpoint are public.

What happens if this is done and a bucket at the endpoint isn't public?

@jhiemstrawisc
Copy link
Member Author

I achieved this by deciding that to export an entire S3 endpoint, no bucket should be provided, and we assume ALL buckets at the endpoint are public.

What happens if this is done and a bucket at the endpoint isn't public?

There are no associated S3 credentials, so everything is prevented by lack of authentication.

@brianhlin
Copy link
Contributor

This config makes it look like S3 and POSIX origins are going to be mutually exclusive. Is that intentional?

There are no associated S3 credentials, so everything is prevented by lack of authentication.

Even still, it feels like we're entering foot gun territory here

@jhiemstrawisc
Copy link
Member Author

This config makes it look like S3 and POSIX origins are going to be mutually exclusive. Is that intentional?

Yep, that's always been the case. We can either operate in S3 mode or posix mode.

Copy link
Collaborator

@turetske turetske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to get an end-to-end s3 test going with both a configuration file setup and a command line setup? Or is that really not available because we don't have a good way of setting up a test S3 origin?

server_utils/origin_test.go Show resolved Hide resolved
@jhiemstrawisc
Copy link
Member Author

Is there any way to get an end-to-end s3 test going with both a configuration file setup and a command line setup? Or is that really not available because we don't have a good way of setting up a test S3 origin?

I'm happy to write one, but I don't know the best way to set up an S3 endpoint in the process. I can take a peek at programmatically creating a bucket and S3 credentials in Minio, but wasn't able to get that working when I tried previously. One alternative is to spin up the origin and point it at an AWS open data bucket, but that sounds like a test that's asking for trouble. If you think it's still worth the risk, I'll set it up.

@turetske
Copy link
Collaborator

turetske commented Apr 5, 2024

Is there any way to get an end-to-end s3 test going with both a configuration file setup and a command line setup? Or is that really not available because we don't have a good way of setting up a test S3 origin?

I'm happy to write one, but I don't know the best way to set up an S3 endpoint in the process. I can take a peek at programmatically creating a bucket and S3 credentials in Minio, but wasn't able to get that working when I tried previously. One alternative is to spin up the origin and point it at an AWS open data bucket, but that sounds like a test that's asking for trouble. If you think it's still worth the risk, I'll set it up.

That's fair. I guess I would like confirmation from you that you've tested the following locally:

  1. Tested both the command line and configuration file setups with S3 with both the bucket name provided and not provided
  2. Pull the data via a cache that has access to an S3 origin (one with the bucket name provided, one with the bucket name not provided)

@jhiemstrawisc
Copy link
Member Author

Okay, I got rid of the Minio dependency and pointed the origin tests at an AWS endpoint serving historical data (which, being historical, should never change). That allowed me to test various configuration setups and make sure the file pulled had the correct contents.

Copy link
Collaborator

@turetske turetske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions I want clarified.

cmd/origin.go Show resolved Hide resolved
server_utils/origin.go Show resolved Hide resolved
server_utils/origin.go Show resolved Hide resolved
@jhiemstrawisc jhiemstrawisc force-pushed the update-s3-exports branch 2 times, most recently from 0262eaa to 312dc9f Compare April 5, 2024 19:19
@jhiemstrawisc jhiemstrawisc requested a review from turetske April 5, 2024 19:19
Copy link
Collaborator

@turetske turetske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very tentatively approving assuming all the local tests path.

If everything fails after merging even with the dev-container changes, revert the PR.

@jhiemstrawisc
Copy link
Member Author

Just adding breadcrumbs here in case we end up needing them -- we're merging even though tests aren't passing because
these tests don't actually run with the container changes I've also set up in this PR to install the new upstream S3 plugin version, and there's no way to get them to do that UNLESS we do that in a separate PR and merge that first.

The problem is that the upstream changes are breaking (even though we keep the configuration in Pelican the same), so merging the updated container will break lots of other stuff. All the tests for this PR are currently passing when I run locally. Famous last words!

@jhiemstrawisc jhiemstrawisc merged commit c0a2083 into PelicanPlatform:main Apr 5, 2024
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants