Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] sts get caller identity does not work immediately #2517

Closed
hjkatz opened this issue Feb 26, 2024 · 14 comments
Closed

[bug] sts get caller identity does not work immediately #2517

hjkatz opened this issue Feb 26, 2024 · 14 comments
Assignees
Labels
guidance Question that needs advice or information.

Comments

@hjkatz
Copy link

hjkatz commented Feb 26, 2024

Describe the bug

See: #2093

Expected Behavior

I expect the call to sts.GetCallerIdentity() to succeed immediately after authenticating/receiving new sts credentials.

Current Behavior

The calls to sts.GetCallerIdentity() seem to be inconsistently failing.

Reproduction Steps

See: #2093

Possible Solution

See: #2093

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

Compiler and Version used

go version go1.21.6 linux/amd64

Operating System and version

ubuntu

@hjkatz hjkatz added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 26, 2024
@lucix-aws
Copy link
Contributor

lucix-aws commented Mar 5, 2024

@hjkatz --

This is ultimately a fault in server-side behavior. The IAM state change associated with retrieving credentials (via AssumeRole or whatever else) does not propagate immediately after credentials are returned.

@lucix-aws lucix-aws added service-api This issue is due to a problem in a service API, not the SDK implementation. and removed needs-triage This issue or PR still needs to be triaged. labels Mar 5, 2024
@RanVaknin
Copy link
Contributor

Hi @hjkatz ,

Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.

Thanks,
Ran~

@RanVaknin RanVaknin closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
@RanVaknin RanVaknin self-assigned this Mar 5, 2024
Copy link

github-actions bot commented Mar 5, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

@hjkatz
Copy link
Author

hjkatz commented Mar 5, 2024

Hi @hjkatz ,

Just to chime in as well. This is not an SDK issue and is not unique to SSO. What you are experiencing is a propagation delay that is solved with a retry. This is not an issue with the SDK itself but just a nature of a distributed system.

Thanks,
Ran~

@RanVaknin Kindly review the linked discussion. Retrying 5 times with a gradual backoff from 500ms to 2s still reproduces the issue. (I can provide a mvp example if desired too.)

I understand that a distributed system will require time to propogate the new token. So I'll start with my goal instead: I want to verify that the SSO session via the SDK is valid. How can I test that reliably?

@RanVaknin RanVaknin reopened this Mar 5, 2024
@hjkatz
Copy link
Author

hjkatz commented Mar 5, 2024

@RanVaknin I realize that the discussion doesn't have as clear of an example as I'm suggesting. I'll upload a clear example later today.

@RanVaknin RanVaknin added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Mar 5, 2024
@hjkatz
Copy link
Author

hjkatz commented Mar 6, 2024

// Wrapper for sts.GetCallerIdentity() that supports retries
//
// Use the context to set a maximum time that retries can be performed.
// When the context is canceled then the final response will be returned.
//
// See: https://github.com/aws/aws-sdk-go-v2/discussions/2093#discussioncomment-8455830
func StsGetCallerIdentity(ctx context.Context, client *sts.Client) (result *sts.GetCallerIdentityOutput, err error) {
    if client == nil {
        return nil, errs.New("cannot call StsGetCallerIdentity with nil client!")
    }

    attempt := 0
    retries := 5
    // internal lib that implements a backoff between start -> end, without jitter (false)
    backoff := reliable.NewBackoff(100*time.Millisecond, 1*time.Second, false)

    for attempt < retries {
        attempt++

        select {
        case <-ctx.Done():
            // ran out of time, return whatever we got last
            return
        default:
            // continue below
        }

        result, err = client.GetCallerIdentity(ctx, &sts.GetCallerIdentityInput{})
        if err == nil {
            // success
            return // final values
        }

        // See: https://github.com/aws/aws-sdk-go-v2/discussions/2093
        if !strings.Contains(err.Error(), "api error InvalidClientTokenId") {
            // non-retryable error
            return // last result + error
        }

        // error, backoff and try again
        be := backoff.Wait(ctx) // Wait() == time.Sleep(backoff.NextDuration())
        if be != nil {
            // context canceled
            return // last values
        }
    }

    return // last values
}

I hope this helps reproduce what we're seeing.

I'm also hopeful we can find a way via the SDK to verify the session credentials returned by the SSO credentials provider are valid. Some thoughts to get at this information:

  • sts get-caller-identity (as above)
  • reading a profile from ~/.aws/config that depends on sso_session = my-session and see if it errors or not (relying on internal aws CLI's boto implementation)
  • reading refresh token from the SSO provider directly? (not yet tried) see: https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/credentials/ssocreds
  • Some new implementation in ssocreds that provides this ability

Happy to discuss alternatives too!

@lucix-aws
Copy link
Contributor

lucix-aws commented Mar 6, 2024

@hjkatz On average, how long does it take for GetCallerIdentity to recognize the token?

This really seems like it should be something that sts models as @waitable. That would enable automatic code generation of a waiter API, in every SDK, that polls GetCallerIdentity until a success response is returned. In the absence of that trait, you are forced to re-solve that problem, as you've basically implemented your own waiter above.

In case you're not familiar with waiters -- https://aws.github.io/aws-sdk-go-v2/docs/making-requests/#using-waiters.

The most ubiquitous example in my mind would be dynamodb.TableExistsWaiter, since ddb table creates are technically asynchronous.

tl;dr waiting for async state changes is a problem we've solved at large, if I understand correctly it's just a question of pushing for sts to add some additional modeling to solve this specific case

@lucix-aws
Copy link
Contributor

Followup - what is the delay between you provisioning the token (looks like through sso) and first calling GetCallerIdentity?

If it's on the order of seconds, then what I said above generally stands.

If this is like minutes or hours, that doesn't seem at all like acceptable behavior in the IAM sense and would warrant further investigation.

@lucix-aws lucix-aws added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Mar 6, 2024
@hjkatz
Copy link
Author

hjkatz commented Mar 6, 2024

Followup - what is the delay between you provisioning the token (looks like through sso) and first calling GetCallerIdentity?

It's on the order of milliseconds.

For context we have a shared developer CLI that everyone uses for various commands/tools/utilities. In that CLI we annotate some commands as needing SSO to work correctly. Many of our commands are annotated and interact with AWS in some required way.

Our goal was to warn the user that they have not started their SSO session for the running command. To do this we need to check if the session is valid (not sure how to do this), and we came up with generating a token then trying to see if it works.

If it's on the order of seconds, then what I said above generally stands.

For our use case I think the order of seconds is too long. It feels like a delay for our users interacting with a CLI. I would prefer milliseconds or some approach that does the bare minimum for testing that the SSO session is valid for generating an STS token or something like that.

@lucix-aws
Copy link
Contributor

Sorry, my last question there was incomplete in wording. I'm trying to understand what the actual delay is you're observing between provisioning the token and then getting a successful call to GetCallerIdentity.

@hjkatz
Copy link
Author

hjkatz commented Mar 6, 2024

I gotcha. I was writing up a test case to get some real data.

Summary:
1ms: 143323
100us: 856294
10ms: 374
100ms: 9

Here's the summary of 1 million attempts. It's looking much better today than when I originally opened the ticket. For whatever reason I'm not seeing anything take longer than ~100ms but in the past I would feel the delay more in the ~1-2s range, so maybe something's improved. (It could be my network as I'm at my parents' place atm)

How about I run this test again daily and get back to you with more data?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Mar 7, 2024
@hjkatz
Copy link
Author

hjkatz commented Mar 7, 2024

Today's summary also seems fine:

Summary:
10ms: 359
1ms: 141759
100us: 857875
100ms: 7

I'm suspecting that the way I'm using a backoff after getting the credentials should be a naive loop with context.WithTimeout() rather than some arbitrary number of attempts. Like maybe it takes 10ms, but that's 20 attempts, yet I'm exiting after 5 attempts. I'll keep tracking this.

@hjkatz
Copy link
Author

hjkatz commented Mar 11, 2024

Today seems the same too.

Summary:
10ms: 455
1ms: 160674
100us: 838859
100ms: 12

I'm going to add additional logging into our CLI and see if I can reproduce any inconsistency today, but my suspicion is that 5 attempts isn't enough and we just need to try more times.

@lucix-aws
Copy link
Contributor

My recommendation in general there would not be to fix the number of attempts and instead write your "waiting" construct to accept a timeout.

This is how modeled waiters are written (well, generated, but obviously we wrote the code to do that). The waiter then just retries "infinitely" (with an increasing backoff) until it either hits the success case or exceeds the caller-provided deadline.

@lucix-aws lucix-aws added guidance Question that needs advice or information. and removed bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. service-api This issue is due to a problem in a service API, not the SDK implementation. labels Mar 11, 2024
@aws aws locked and limited conversation to collaborators Mar 11, 2024
@lucix-aws lucix-aws converted this issue into discussion #2543 Mar 11, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

3 participants