Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support to explicitly specify a CPU architecture for the to-be-copied image instead of relying on auto-detection #961

Merged
merged 3 commits into from
Dec 23, 2024

Conversation

xiongbo-sjtu
Copy link
Collaborator

@xiongbo-sjtu xiongbo-sjtu commented Nov 29, 2024

What's the problem?

It's observed that when copying a multi-arch source image from location A to B in ECR, we always end up with amd64 in the destination image at location B. The hypothesis is that the underlying lambda function used to auto-detect the architecture is backed up by Dockerfile with a hard-coded runtime.GOARCH.

Proposed Solution

Instead of solely relying on an auto-detection mechanism, this PR is meant to add a feature to allow users to explicitly specify a CPU architecture for the to-be-copied image, by triggering the underlying logic here. This will unblock the usage of Graviton instances for Amazon SageMaker and other computing platforms.

As a side note, I've run npm run release on my dev box to refresh API.md

@xiongbo-sjtu xiongbo-sjtu changed the title Specify CPU architecture for both image and Lambda feat: Add support to specify CPU architecture for both image and Lambda Nov 29, 2024
auto-merge was automatically disabled November 29, 2024 04:28

Head branch was pushed to by a user without write access

@xiongbo-sjtu xiongbo-sjtu changed the title feat: Add support to specify CPU architecture for both image and Lambda feat: Add support to specify CPU architecture for the to-be-copied image Nov 29, 2024
lambda/main_test.go Outdated Show resolved Hide resolved
lambda/main_test.go Outdated Show resolved Hide resolved
src/index.ts Outdated
/**
* The image architecture to be copied.
*
* @default - ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain what this default means in the doc string

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

src/index.ts Outdated
*
* @default - ""
*/
readonly imageArch?: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably lean towards this. What do you think?

Suggested change
readonly imageArch?: string;
readonly architecture?: string;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imageArch is intentionally used to avoid any potential confusion with the architecture field referenced by lambda.SingletonFunction in the ECRDeployment class.

Currently, our Dockerfile uses a hard-coded architecture GOARCH=amd64. If we'd like to allow customers to control what's used for the lambda function, we can uplift such a constraint and then expose lambdaArch as a sibling field of imageArch.

Hope that makes sense.

Copy link
Contributor

@rix0rrr rix0rrr Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't really explain the problem and the solution in the PR body, so I have to ask about it here to properly understand the motivation and the solution. I'll summarize to see if I understand correctly:

  • This Lambda is copying and image from location A to location B.
  • I know that an image can contain multiple architectures.
  • In this PR, we are making it possible to copy only one of the architectures in that image over.

Here are my questions:

  • imageArch?: string makes it so we can only copy 1 architecture over. Why not multiple? (imageArch?: string[])
  • What is the motivation? It seems like the most logical behavior of this construct would be to copy all architectures over. So either that is not the behavior, or this PR is intended to save space and time for the copies, by only copying a subset out of potentially dozens of images.
    • If the behavior is NOT that all architectures are copied, then wouldn't that be a more logical and safer fix than having people specify their image architecture in Yet Another Place? (Potentially still with the subsetting behavior if we want to save on space and time).

Again, it would have helped me a lot if the line of thought had been spelled out more clearly in the PR body, because I'm left guessing a bit here.

Copy link
Collaborator Author

@xiongbo-sjtu xiongbo-sjtu Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comment! I've updated the PR description (copied below) to provide more clarity.

Regarding why we don't use string[], the reason is that the underlying containers/image lib doesn't give us such an option, as shown here.

Note that the current system behavior is that the destination image is always amd64 even if it's copied from a multi-arch source image. For example, if the multi-arch source image supports both amd64 and arm64, this PR will give users the flexibility to choose between those two, instead of having to live with amd64.

===========================================================================

What's the problem?

It's observed that when copying a multi-arch source image from location A to B in ECR, we always end up with amd64 in the destination image at location B. The hypothesis is that the underlying lambda function used to auto-detect the architecture is backed up by Dockerfile with a hard-coded runtime.GOARCH.

Proposed Solution

Instead of solely relying on an auto-detection mechanism, this PR is meant to add a feature to allow users to explicitly specify a CPU architecture for the to-be-copied image, by triggering the underlying logic here. This will unblock the usage of Graviton instances for Amazon SageMaker and other computing platforms.

Copy link
Contributor

@rix0rrr rix0rrr Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of a user, this documentation string is at worst lying, and at best unhelpful:

 @default - the underlying lambda auto-detects the relevant architecture (e.g., amd64, arm64)
   * https://github.com/containers/image/blob/main/internal/pkg/platform/platform_matcher.go#L161

This description is true for the library that we use. But from the point of view of a CDK user, they don't control the platform of the Lambda that runs the library. That library will always run on amd64, because we chose what platform to build and run the Lambda for, and we always pick amd64 because that's just who we are 😉 . So the library will helpfully autodetect amd64 in all cases, and there's nothing that CDK user can do about it.

We should just state that the default if unspecified is amd64.

Question: what happens if the target image only has an arm64 variant and not an amd64 viarant, and we leave the default off? Will the library then copy the one variant that's there, or will it fail because amd64 doesn't exist?

Copy link
Contributor

@rix0rrr rix0rrr Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding why we don't use string[], the reason is that the underlying containers/image lib doesn't give us such an option, as shown here.

I am looking at the documentation of this package, and I can at least see a type ImageListSelection that seems to indicate it's possible to copy a list of images, and a GitHub issue saying it's currently possible to copy all architectures.

https://pkg.go.dev/github.com/containers/image/v5/copy#Image

Do these things not actually work? (I'm not up on all the intricacies of Docker but I hope you are)

Copy link
Contributor

@mrgrain mrgrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

@xiongbo-sjtu xiongbo-sjtu force-pushed the main branch 4 times, most recently from d5cf620 to 3643177 Compare November 29, 2024 20:02
@xiongbo-sjtu xiongbo-sjtu requested a review from mrgrain November 29, 2024 20:26
@jiayiwang7 jiayiwang7 assigned mrgrain and rix0rrr and unassigned mrgrain Dec 16, 2024
src/index.ts Outdated Show resolved Hide resolved
src/index.ts Outdated
*
* @default - ""
*/
readonly imageArch?: string;
Copy link
Contributor

@rix0rrr rix0rrr Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't really explain the problem and the solution in the PR body, so I have to ask about it here to properly understand the motivation and the solution. I'll summarize to see if I understand correctly:

  • This Lambda is copying and image from location A to location B.
  • I know that an image can contain multiple architectures.
  • In this PR, we are making it possible to copy only one of the architectures in that image over.

Here are my questions:

  • imageArch?: string makes it so we can only copy 1 architecture over. Why not multiple? (imageArch?: string[])
  • What is the motivation? It seems like the most logical behavior of this construct would be to copy all architectures over. So either that is not the behavior, or this PR is intended to save space and time for the copies, by only copying a subset out of potentially dozens of images.
    • If the behavior is NOT that all architectures are copied, then wouldn't that be a more logical and safer fix than having people specify their image architecture in Yet Another Place? (Potentially still with the subsetting behavior if we want to save on space and time).

Again, it would have helped me a lot if the line of thought had been spelled out more clearly in the PR body, because I'm left guessing a bit here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the assertions in the tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike other typescript files under the test folder, example.ecr-deployment.ts doesn't have a single line of expect statement for assertion. My understanding is that it's a smoke test to check if anything chokes up when we interact with the ECRDeployment class with different flavors of arguments.

@xiongbo-sjtu xiongbo-sjtu force-pushed the main branch 3 times, most recently from 501271a to f97986a Compare December 18, 2024 20:18
@xiongbo-sjtu xiongbo-sjtu changed the title feat: Add support to specify CPU architecture for the to-be-copied image feat: Add support to explicitly specify a CPU architecture for the to-be-copied image instead of relying on auto-detection Dec 18, 2024
…-be-copied image instead of relying on auto-detection.
@rix0rrr rix0rrr added this pull request to the merge queue Dec 23, 2024
Merged via the queue into cdklabs:main with commit fbab401 Dec 23, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants