Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New ephemeral]: aws_eks_cluster_auth should be turned into an ephemeral resource #40343

Open
erpel opened this issue Nov 28, 2024 · 18 comments · May be fixed by #40660
Open

[New ephemeral]: aws_eks_cluster_auth should be turned into an ephemeral resource #40343

erpel opened this issue Nov 28, 2024 · 18 comments · May be fixed by #40660
Labels
new-ephemeral-resource Introduces a new ephemeral resource. service/eks Issues and PRs that pertain to the eks service.

Comments

@erpel
Copy link
Contributor

erpel commented Nov 28, 2024

Description

The data source aws_eks_cluster_auth causes the token to be saved in the plan and potentially expiring before the apply. This is discussed in #13189. The new ephemeral resources in terraform 1.10 should address this perfectly:

Requested Resource(s) and/or Data Source(s)

  • ephemeral aws_eks_cluster_auth

Potential Terraform Configuration

data "aws_eks_cluster" "example" {
  name = "example"
}

ephemeral "aws_eks_cluster_auth" "example" {
  name = "example"
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.example.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.example.certificate_authority[0].data)
  token                  = ephemeral.aws_eks_cluster_auth.example.token
}

References

Original issue: #13189.

Documentation:

Example of other ephemeral resources in terraform-provider-aws:

Would you like to implement a fix?

None

Copy link

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added service/eks Issues and PRs that pertain to the eks service. needs-triage Waiting for first response or review from a maintainer. labels Nov 28, 2024
@johnsonaj johnsonaj added new-ephemeral-resource Introduces a new ephemeral resource. and removed needs-triage Waiting for first response or review from a maintainer. labels Dec 2, 2024
@bryantbiggs
Copy link
Contributor

shouldn't you use the exec() method of the respective provider (kubernetes, helm, etc.)? that solves two problems - no credentials persisted to statefile, and no stale credentials

@erpel
Copy link
Contributor Author

erpel commented Dec 5, 2024

Using the exec parameter does solve the issue. I imagine that this is what many do today, including my organization. It does add reliance on an additional executable though and therefore reduces portability. The executable ist another component that needs to be maintained in CI environments where it might otherwise not be needed.
I believe implementing this using ephemeral resources would be worthwhile.

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Dec 5, 2024

do users (humans) access your clusters today (i.e. - kubectl)? and what do you use for CI today?

@TechIsCool
Copy link

At a different company but we have a thin wrapper around kubectl and use both GitHub Actions and Jenkins to manage our clusters. We would love to see this support as we have multiple times had tokens expire at first time build of a heavy EKS cluster that takes hours to provision.

@bryantbiggs
Copy link
Contributor

If you're using kubectl, are you using aws eks update-kubeconfig --name xxx as well?

@gamunu
Copy link

gamunu commented Dec 11, 2024

@bryantbiggs Running aws cli is not possible in managed services like Terraform Cloud.

@erpel
Copy link
Contributor Author

erpel commented Dec 11, 2024

Users do have access using kubectl and whatever else they like to use that can work with provided kubeconfig files. We're not using aws cli to manage kubeconfig files though, but not for technical reasons iirc.

CI is mainly Atlantis plus some pipelines running on GitLab CI. Both of them are possible to use with additional tooling but the simpler solution is preferable to me.

@bschaatsbergen
Copy link
Member

bschaatsbergen commented Dec 20, 2024

Hi!

Thank you for reporting this issue, Terraform (1.10) supports referencing ephemeral resource attributes directly in providers. And, having an ephemeral variant available of aws_eks_cluster_auth would improve the security posture of Terraform users working with Amazon EKS and the Kubernetes or Helm provider as the temporary obtained IAM token is no longer persisted to the state.

If I recall correctly, both Helm and Kubernetes provide an exec block that allows you to obtain a token by invoking a binary at runtime, and I believe this is considered a good practice nowadays too (I'm no Amazon EKS expert, unlike @bryantbiggs)

The exec block typically looks like this:

exec {
  api_version = "client.authentication.k8s.io/v1alpha1"
  args        = ["eks", "get-token", "--cluster-name", data.aws_eks_cluster.cluster.name]
  command     = "aws"
}

However, this isn’t very native to Terraform, as CI/CD environments must manage an additional binary and ensure it’s available at runtime—one more thing to keep track of!

I’m not a subject matter expert on the eks command-line utility, but based on its implementation https://github.com/aws/aws-cli/blob/develop/awscli/customizations/eks/get_token.py#L129, it seems quite similar to ours. Do you know if there are any differences, @bryantbiggs? I’d be happy to dive deeper into this as we should weigh these things carefully to provide the most secure and robust authentication mechanism possible.

FWIW, using an ephemeral resource:

ephemeral "aws_eks_cluster_auth" "example" {
  name = data.aws_eks_cluster.example.id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.example.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.example.certificate_authority[0].data)
  token                  = ephemeral.aws_eks_cluster_auth.example.token
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.example.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.example.certificate_authority[0].data)
    token                  = ephemeral.aws_eks_cluster_auth.example.token
  }
}

Over in #40660 I'm prototyping this, and will provide an update soon. Thanks again!

@bryantbiggs
Copy link
Contributor

in the respective providers, the exec() method is what is defined as the prferred approach for managed offerings which have short token expirations https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs#exec-plugins

The implementation looks similar to what is done in the awscli, but the risk is that the implementation changes but the method doesn't. The contract to users is that aws eks update-kubeconfig ... is stable, but what happens in the background is an implementation detail that is free to change, and most likely will across our offerings of EKS, local EKS clusters on Outpost, etc.

@lorengordon
Copy link
Contributor

lorengordon commented Dec 20, 2024

The exec method also is irritating in that it creates a tight coupling between the environment's shell setup for aws profiles and the kubernetes provider. I often have several aws providers and lots of profiles, and a different aws credential in my shell than what is being used by the desired aws provider. So, no. I don't care that shelling out with exec is the "preferred approach". It's wrong, and a bad option. The real problem is that there isn't yet a mechanism for renewing the credential via the aws_eks_cluster_auth data source and the kubernetes and helm providers. At least making the data source ephemeral buys us a little time, since the value would update between plan-time vs apply-time.

@bryantbiggs
Copy link
Contributor

That just sounds like you need to rethink your configs and maybe refactor a bit

@lorengordon
Copy link
Contributor

lorengordon commented Dec 20, 2024

That just sounds like you need to rethink your configs and maybe refactor a bit

Negative. The configs are fine. It's just a lot of environments and customers. Please don't deflect.

@bryantbiggs
Copy link
Contributor

So why are profiles causing an issue - you can specify a profile in the AWS provider and you can pass "--profile", "foo" to the exec()

I'm not seeing the issue

@bryantbiggs
Copy link
Contributor

At least making the data source ephemeral buys us a little time, since the value would update between plan-time vs apply-time.

This doesn't make sense.

The difference between the exec() in the provider versus the data source is - using the data source, that GET request is made ASAP, which may be long before its needed. Using the provider, the GET request isn't made until the resource for the provider is first encountered in the graph.

To give a more concrete example - cluster upgrades. If you use the static token route, you are almost guaranteed to hit the expired token issue because the CP takes some time to update (8-10 minutes in the case of EKS), and then some additional time for node groups/etc. before you get down to the point where you start trying to use the token for Kubernetes/Helm resources. If you use the exec() method, that token isn't requested until after the CP has upgraded and at the point at which it first needs to be used

@lorengordon
Copy link
Contributor

lorengordon commented Dec 20, 2024

So why are profiles causing an issue - you can specify a profile in the AWS provider and you can pass "--profile", "foo" to the exec()

That's the exact definition of the tight coupling I'm talking about. Everyone (and every CI service) that runs the config then needs the exact named profile defined in their shell config (and of course, as everyone else is complaining about, also needs the aws cli installed). It doesn't work for teams larger than a single person. Rather than simply chaining from the aws provider config, which already has loads of options for resolving an aws credential.

At least making the data source ephemeral buys us a little time, since the value would update between plan-time vs apply-time.

This doesn't make sense.

It does make sense. The ephemeral data source resolves one credential at plan time, and then executes again at apply time and resolves another credential with a new expiration period. For execution models that save the plan and then run an apply with a saved plan at a future time, this is a very big deal.

@bryantbiggs
Copy link
Contributor

So you said:

I often have several aws providers and lots of profiles, and a different aws credential in my shell than what is being used by the desired aws provider.

So you are already "tightly coupled" between your AWS profiles and the AWS provider, no? Because you have to specify the profile you want to use on the respective provider, such as:

provider "aws" {
  profile = "foo"
  region  = "us-east-1"
}

With an exec provider that looks like:

provider "kubernetes" {
  host                   = yyy
  cluster_ca_certificate = base64decode(xxx)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", "example", "--profile", "foo"]
    command     = "aws"
  }
}

Or if you are managing your profiles out of band (i.e. - loading the respective creds in the environment using the profile), its just:

provider "aws" {
  region  = "us-east-1"
}

Which means your exec is:

provider "kubernetes" {
  host                   = yyy
  cluster_ca_certificate = base64decode(xxx)
  exec {
    api_version = "client.authentication.k8s.io/v1beta1"
    args        = ["eks", "get-token", "--cluster-name", "example"]
    command     = "aws"
  }
}

So I don't see how the exec() method creates a "tighter coupling" than what is already present

@lorengordon
Copy link
Contributor

That again requires the aws cli to be present (simply setting the profile in aws provider does not require the aws cli), and specific named profiles to be configured on every system that executes that config. That's not reasonable. More typically, the resolution of the credential is up to the user or the executing environment. The "profile" is just an input variable, not hardcoded. Maybe it's null, maybe it's not. It is up to the user how to resolve the credential for the target account and role. Or perhaps the aws provider is configured to support assume_role blocks, and chained roles. The AWS credential chain can be very complicated (as can other cloud providers), and simply saying "shell out to an external command" imposes significantly more limited restrictions on that credential chain, as well as imposing weird external requirements for the execution environment.

I get the hesitation and push back on supporting options native to the cloud providers... The Kubernetes provider doesn't want to have to bundle SDKs for every cloud provider and maintain those code paths, and the Terraform community doesn't want to introduce cross-provider dependencies. But the limitation just makes the user experience kinda awful and limits the use cases.

Regardless, that's all off topic. An ephemeral data source for aws_eks_cluster_auth does have value, since it would return a different token at plan time and apply time. That definitely addresses a couple use cases that currently fall over when configuring the Kubernetes provider using aws_eks_cluster_auth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-ephemeral-resource Introduces a new ephemeral resource. service/eks Issues and PRs that pertain to the eks service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants