Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STS Client doesn't refresh IAM Permissions #2332

Closed
AlexVulaj opened this issue Oct 23, 2023 · 10 comments
Closed

STS Client doesn't refresh IAM Permissions #2332

AlexVulaj opened this issue Oct 23, 2023 · 10 comments
Assignees
Labels
guidance Question that needs advice or information.

Comments

@AlexVulaj
Copy link

Describe the bug

I have an sts.Client configured with credentials for a given IAM Role. If the IAM permissions of the Role are updated elsewhere (e.g. via the web console) the sts.Client never seems to pick up those changes no matter how long I retry. I can recreate my sts.Client in the event of this error and manually retry, but I'd expect the client to handle this on its own.

Expected Behavior

I would expect the IAM Permission changes to affect my Go client after a few seconds. To demonstrate this using the AWS. CLI, I did the following:

  1. Get sts CLI credentials for an IAM role
  2. Export those credentials in my terminal
  3. run aws ec2 describe-instances, this gives me an UnauthorizedOperation error as I expected.
  4. Log into the AWS console for that account and add IAM permissions to list all ec2 instances.
  5. Without changing my credentials in my terminal, run aws ec2-describe instances again. This fails at first but after a few seconds successfully returns results.

I would expect this same behavior of the sts.Client in the Go sdk.

Current Behavior

I can retry for multiple minutes, but my sts.Client never seems to have the new permissions. I know the permissions have propagated because if I add a short time.Sleep before creating my client, everything works fine.

Reproduction Steps

It's hard to provide an exact code snippet because this seems to be due to a race condition, but doing my best here:

cfg, _ := config.LoadDefaultConfig(context.TODO(), config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(creds.AccessKeyID, creds.SecretAccessKey, creds.SessionToken)))
...
// external service adds a new IAM Policy for the above provided credentials, then immediately this next line runs...
assumeRoleProvider := stscreds.NewAssumeRoleProvider(stsClient, roleArn)
result, err := assumeRoleProvider.Retrieve(context.TODO())

I've also tried configuring a retryer for the client that explicitly lists 403 HTTP responses as a retryable error:

...
config.WithRetryer(func() aws.Retryer {
	return retry.NewStandard(func(options *retry.StandardOptions) {
		options.Retryables = append(options.Retryables, retry.RetryableHTTPStatusCode{
			Codes: map[int]struct{}{403: {}},
		})
		options.Backoff = retry.BackoffDelayerFunc(func(attempt int, err error) (time.Duration, error) {
	                fmt.Println("Retrying...")
	                return 5 * time.Second, nil
                })
                options.MaxAttempts = 50
	})
})
...

No matter how long of a backoff I add or how many attempts, the request continues to fail.

Possible Solution

The standard retry mechanism in sts.Client should be able to detect this error automatically and refresh the IAM permissions.

Additional Information/Context

My specific use case is making AssumeRole calls quickly in sequence that span multiple AWS accounts. An external service updates the trust policies/IAM Permissions for this chain of AssumeRole calls to be executed. There seems to be a race condition where I have to wait for the IAM changes to propagate before I can even create my sts.Client. I'm not sure how I can accurately tell when those changes are recognized by AWS, so I'm left to retry by creating my sts Client over and over.

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.21.2
github.com/aws/aws-sdk-go-v2/config v1.18.45
github.com/aws/aws-sdk-go-v2/credentials v1.13.43
github.com/aws/aws-sdk-go-v2/service/sts v1.23.2

Compiler and Version used

go version go1.20.4 darwin/arm64

Operating System and version

macOS Ventura 13.6

@AlexVulaj AlexVulaj added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 23, 2023
@RanVaknin RanVaknin self-assigned this Oct 23, 2023
@lucix-aws
Copy link
Contributor

Are you sure your sts provider isn't being wrapped in a cache? All credentials providers loaded from the "chain" (so LoadDefaultConfig at a minimum) are wrapped in a caching mechanism, which would hold onto effectively stale credentials potentially after a role's permissions have changed.

I'm not seeing this behavior with an explicitly instantiated assumerole provider, though. If I create a client like so:

s3c := s3.NewFromConfig(cfg, func(o *s3.Options) {
    // the root credentials that power AssumeRole calls are static from config
    o.Credentials = stscreds.NewAssumeRoleProvider(sts.NewFromConfig(cfg), roleARN)
})

put that client in a ListBuckets loop, and then update the role permissions by adding and removing s3 access, I see that take effect in the loop in real time.

The difference between cache and otherwise also explains why the CLI works as you expect, your two invocations there are entirely separate processes, there's no cache to speak of (they don't persist to disk with things like that to my knowledge).

@AlexVulaj
Copy link
Author

Hey @lucix-aws , thanks for the quick and thorough response! I'm going to take a deep dive into this to see what I can do.

In the meantime - the place where I'm seeing this just so happens to be in a public repo that I can share for additional context.

The call to stscreds.NewAssumeRoleProvider: https://github.com/AlexVulaj/backplane-cli/blob/backplane-assume-implement-retry/pkg/awsutil/sts.go#L91

The place in the same file where I've written my custom retry: https://github.com/AlexVulaj/backplane-cli/blob/backplane-assume-implement-retry/pkg/awsutil/sts.go#L123

So I think your theory about my provider being wrapped in a cache is true. What I've written now "works", however it would be nice if I could rewrite it to not recreate the sts client each time.

@AlexVulaj
Copy link
Author

Is there a recommended way to flush the cache or force it to refresh? I tried the following and unfortunately had no luck either:

	credsProvider := credentials.NewStaticCredentialsProvider(creds.AccessKeyID, creds.SecretAccessKey, creds.SessionToken)
	credsProvider.Value.CanExpire = true
	credsProvider.Value.Expires = time.Now().Add(5 * time.Second)
        ...
        config.WithCredentialsProvider(credsProvider)

@lucix-aws
Copy link
Contributor

Those are the static credentials that are used to make the assume role call though if I understand correctly. The credentials that need to not be cached are the ones retrieved from stscreds.AssumeRoleProvider.Retrieve(). The credentials in your snippet, if they're being used to actually call assume role, can be static assuming they have the ability to actually make the assume call etc.

@lucix-aws lucix-aws added guidance Question that needs advice or information. and removed bug This issue is a bug. labels Oct 24, 2023
@AlexVulaj
Copy link
Author

AlexVulaj commented Oct 24, 2023

Thanks for your continued help on this - it's really appreciated.

What you're saying makes sense. And yes, that's correct - the credentials retrieved from stscreds.AssumeRoleProvider.Retrieve() are what will be used to make the next call to stscreds.AssumeRoleProvider.Retriever(), and so on.

The flow here is a bit complicated, but I'll do my best to shed some more light here...

I have 3 roles: 1, 2 and 3, that assume into each other in order.
Role 1 -> Role 2 -> Role 3

Each role is in a separate AWS account. Role 1 and Role 2 have static policies that allow Role 1 to assume Role 2. Role 2's permissions are dynamically updated by an external service to be able to assume Role 3. Role 3 has a trust relationship that always allows Role 2 to assume it.

My flow looks like:

1. have external service update IAM permissions for Role 2 to be able to assume Role 3.
2. use initial credentials to assume Role 1 (the equivalent of assume-role-with-web-identity), which gives me new credentials
3. use new credentials from step 2 to create a new sts client (now acting as "Role 1")
4. make a call to assume into Role 2 - this works successfully and gives me back credentials.
5. use the new credentials from step 4 to create a new sts client (now acting as "Role 2")
6. attempt to make a call to assume into Role 3

I believe the problem I'm running into is that at the time of step 5, the update from step 1 hasn't yet propagated. Because of that and the caching you mentioned earlier, the client created in step 5 will always fail its next call no matter how long I retry for. I've gotten around that by repeating both steps 5 and 6 until my call works, but at this point I'm wondering if there's a better way than recreating the client repeatedly.

@RanVaknin RanVaknin removed the needs-triage This issue or PR still needs to be triaged. label Oct 24, 2023
@RanVaknin
Copy link
Contributor

Hi @AlexVulaj,

This is quite a convoluted setup, but I have an idea of what this is happening here.

When you say:

// external service adds a new IAM Policy for the above provided credentials, then immediately this next line runs...

I assume this external service operates in an asynchronous fashion while the rest of that Go code you have there is synchronous.

The reason why are not running into this in the CLI is that you take a manual step here that is synchronous:

Log into the AWS console for that account and add IAM permissions to list all EC2 instances.

This will also explain why sleeping the main thread would insure enough time has passed for the async plugin to complete before the client is initialized. It will also explain why retrying doesn't work; The SDK caches the credentials before the role policy is updated.

I'm not sure why you are updating the role policy dynamically but from a security standpoint, this raises some red flags for me.
Since we are not really sure about what your product does or why, it's hard to come up with a viable workaround.
If removing that plugin altogether is not an option, what I would do is create a waiter that sleeps the thread and listens to a change in state - something like a policy creation and only then release the thread. You can borrow code from existing waiters we have generated for the IAM client.

Thanks,
Ran~

@RanVaknin RanVaknin added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 24, 2023
@AlexVulaj
Copy link
Author

AlexVulaj commented Oct 24, 2023

Thanks for the response @RanVaknin

The code I'm working on is for a CLI. It makes a call to a backend API that updates the IAM permissions, and then returns back to the CLI.

To summarize the context of the product and our use case - this relates to how we (Red Hat) will manage access to AWS accounts for managed ROSA customers. We go through this process to make sure that permission is only given at the time at which it's requested, and only to the individual requesting it. I'll note that this was a requirement from AWS to manage access this way.

That all said, it sounds like the solution I'm going with currently fits our needs best. I appreciate both of your time!

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 25, 2023
@lucix-aws
Copy link
Contributor

Closing for now since we know there's no behavioral issue SDK-side.

@lucix-aws lucix-aws closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@RanVaknin RanVaknin reopened this Oct 25, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

3 participants