-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate opaque OAuth2 access tokens #5897
Conversation
f51d91d
to
1261ccc
Compare
Ran some benchmarks using On master, these are the results:
On this branch:
So it's certainly slower |
After adding caching, I'm seeing much better performance:
Not sure if the increased requests/sec is indicative of better performance, or if my CPU was just a bit busier when running the previous benchmarks - but either way, at least it doesn't seem to negatively impact API performance, so I'm happy |
@@ -0,0 +1,3 @@ | |||
BEGIN; | |||
DROP TABLE IF EXISTS access_tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't remember if this worked on Postgres 9 so I deployed and then ran migrate --migrate-db-to-version=1591737494
and it worked fine.
I remember now that it's DROP COLUMN IF EXISTS
that only exists in certain versions of Postgres.
Small note that this will break community resources that use the undocumented api: https://github.com/alphagov/terraform-provider-concourse/blob/5ff605de4afef50551bf97ed56ba0c9f4c2730aa/pkg/client/client.go#L34-L37 I guess since it's undocumented we can get away with a minor release? |
@chenbh huh, good find. Looks like they already need to play catch up with concourse versions (that |
@aoldershaw @chenbh Yeah sounds good. 👍 |
} | ||
body = bytes.NewReader(newResp) | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not clear end-to-end on what happens if I happen to unilaterally POST to /sky/issuer/token
with a body of my choosing - perhaps containing all sorts of claims. I haven't tested this PR but it looks like as written, there is no signature checking for the claims in the passed IDToken - we just go on to stuff the extracted data into the DB, under the newly chosen random string, and return that string. Later auth attempts using the token should succeed, because they look it up in the DB and retrieve the claims that were given. So I could use this to grant myself access to anything.
It is also likely that I haven't understood the entire dex interaction, and that there is some place where the authenticity of the dex payload is checked, or that there's a nuance I'm missing about how this handler is exposed.
Hoping that you can enlighten me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The /sky/issuer/token
endpoint (and the OAuth2 token endpoint in general) supports several different flows, some of which require previous "hops" to get there (e.g. logging in via GitHub using the authorization_code
grant type). For each of the possible flows, Dex validates that the flow was "successful" (whatever that means for the particular grant type - e.g. correct password for the password
grant, valid auth code for the authorization_code
grant, etc). None of the grant types, however, support providing your own claims - Dex tells you what your claims are.
My point being, Dex is still doing all the validation to determine whether the request to /sky/issuer/token
is valid, and won't grant a token otherwise.
This new middleware delegates to Dex, parses and stores the claims without validation, and sends back a modified response. I didn't see any value in validating the claims against Dex's signing keys for the following reasons:
- We aren't delegating to an unknown third-party - this is a piece of code that's bundled into the Concourse binary itself
- We aren't hitting a network interface that could be tampered with - the Dex handlers and request/response handoff is all done in memory
That said, if you think there's still a vulnerability here, I'd be happy to hear you out. I'm far from a security expert, so I could be wrong on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I think I did not understand the "delegation to dex" part - that's the httptest.NewRecorder
business where it has the inner ServeHTTP
? So this ends up just tweaking the response from Dex, and Dex still does whatever it does to authenticate the total flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly
#5851 Since ID tokens grow along with the claims of the user, we've run into issues where users in hundreds of groups have ID tokens that, when sent in a header, cause problems at load balancers that reject large headers. Rather than using ID tokens as access tokens, Concourse can generate an access token (random bytes) and store a mapping of that access token to the claims on the ID token from dex. On each request, we'd query the database to find the claims for that given access token. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 ...in order to store/fetch tokens in the database. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 Let me start by acknowledging that this feels really hacky, but I couldn't think of a better way without modifying dex. All auth flows go through the `/sky/issuer/token` endpoint of dex. Rather than trying to write our own OAuth2 token endpoint, the approach I took was to forward the request directly to dex, and to intercept and modify the response before sending it to the client (p.s. all OIDC token endpoints return the same response body, so it's not tied directly to dex). On a successful authentication, we do the following: 1. Parse the claims from the ID token. I opted to not perform verification - if this is a bad idea, let me know, but my thinking was that we can't possibly be reaching out to an unknown server (or a server on some network interface at all, since it's all in memory) - so if dex approved the request, it must be legit, and we should trust the id token that dex gave back 2. Generate a new access token 3. Store the access token/claims in the database 4. Remarshal the response and forward it to the client, along with any headers that were set from the initial response Note that with this change, auth still works (despite us never fetching the access token from the DB in the API). This is because we choose to replace the access token with the ID token, which the API currently expects. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 The auth cookies are now going to be a constant < 50 bytes, so there's no need to keep this feature Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 This commit has the skymarshal and TSA use the access token from the /token response rather than using the ID token as an access token. This breaks things - we're still expecting to receive the ID token in our API handlers - but is a necessary first step to migrate over Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 ...rather than just a select few. This'll make it easier to validate the claims in the API. Also, don't keep track of "Extra" claims - keep track of all claims in a raw form. There's some redundancy between the fields in `jwt.Claims`, but eh, it's more convenient to work with. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 Previously, the token verifier parsed the JWT in the access token and verified it against the public keyset. Now that it's just an opaque string, fetch the claims from the DB and validate the timestamp. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 Adds a GC component that removed expired access tokens from the database. Supports a leeway parameter that says "if the access token is expired by less than the leeway, don't remove it yet". This is in the oauth spec, and indeed in our claims validation - we allow tokens expired by < 1 minute. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 Initially, I followed the typical pattern of database objects being interfaces - but it's kind of an awkward pattern to represent data as an interface. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 Based on some benchmarking, the performance of the API dropped fairly substantially (at least for the /api/v1/user endpoint) - ~40% fewer requests per second. After caching was added, the benchmarks claim we're actually getting better performance - this may just be that my CPU was a bit busier before, but at least it doesn't seem to have a noticable performance degradation. The cache is an LRU cache with a 1MiB capacity for claims. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 This info could have been extracted from the raw claims, but this makes it easier to work with. Signed-off-by: Aidan Oldershaw <[email protected]>
...rather than on every request. #5851 Now that we're back to creating our own access tokens, we can simplify the user tracking behaviour. With the auth changes to 6.1, in order to track user activity, we updated the "users.last_login" field in the DB on every request. Since that was found to be pretty slow, we added an optimization where we keep track of user updates in memory, and every 15 seconds (or if the number of users exceeds some limit), dump that to the database. In an effort to remove complexity from auth, just track the "last login" time when users actually login Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 No longer needed since the user batcher was removed Signed-off-by: Aidan Oldershaw <[email protected]>
They broke after I removed the UserTracker Signed-off-by: Aidan Oldershaw <[email protected]>
#5912 Compute teamRoles and isAdmin just once on initialization Signed-off-by: Aidan Oldershaw <[email protected]>
#5912 Now that the `IsAdmin` call is really cheap, this isn't necessary - but as was pointed out in the linked issue, this is completely unnecessary. The accessor takes into account whether the user is an admin, so this is a redundant call. Signed-off-by: Aidan Oldershaw <[email protected]>
#5851 As @jamieklassen pointed out in PR review, it was odd that the AccessHandler had dependencies on a TokenVerifier and TeamsFetcher since it just passed on the computed results to AccessFactory. While I'm making changes here anyway, might as well make the dependencies a bit more reasonable by having AccessFactory delegate to TokenVerifier and TeamsFetcher directly. Now, AccessFactory is a bit less dumb than just being a wrapper over the NewAccessor constructor. Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of anything else. Maybe we should just start dogfooding this!
#5897 In the PR linked above, I had missed some changes to fly that expect the token to be a JWT. This commit fixes the status command Signed-off-by: Aidan Oldershaw <[email protected]>
It's unfortunate, but in order to maintain parity with earlier versions of `fly targets` (that report on the expiry of the access token), we can't treat the access token as opaque. This breaks the `Bearer` token contract for clients. However, `fly` is sort of a "special" client anyway, in that it's bundled with Concourse itself, and is versioned alongside it. In #5897, we encoded the expiry into the access token, making this feature easy to implement. This commit also adds a helper for generating a `.flyrc` that can be useful in other `fly` integration tests as well. Signed-off-by: Aidan Oldershaw <[email protected]>
#5897 In the PR linked above, I had missed some changes to fly that expect the token to be a JWT. This commit fixes the status command Signed-off-by: Aidan Oldershaw <[email protected]>
It's unfortunate, but in order to maintain parity with earlier versions of `fly targets` (that report on the expiry of the access token), we can't treat the access token as opaque. This breaks the `Bearer` token contract for clients. However, `fly` is sort of a "special" client anyway, in that it's bundled with Concourse itself, and is versioned alongside it. In #5897, we encoded the expiry into the access token, making this feature easy to implement. This commit also adds a helper for generating a `.flyrc` that can be useful in other `fly` integration tests as well. Signed-off-by: Aidan Oldershaw <[email protected]>
Broken after #5897 Signed-off-by: Aidan Oldershaw <[email protected]>
I believe this may have introduced a race during concurrent access/auth 🤔 #6019 |
What does this PR accomplish?
Bug Fix | Feature | Documentation
closes #5851 .
closes #5912 .
Changes proposed by this PR:
Previously, we were using the ID tokens from dex as access tokens. While this allows for stateless validation of requests, we ran into several issues where the ID token was massive for users that were in many groups, which caused several problems in practice (e.g. load balancers rejecting requests with such large headers).
The approach this PR takes is:
/sky/issuer/token
response from dex before it goes back to the clientaccess_tokens
fly login
, that shouldn't revoke your web UI token.It also changes the user tracking behaviour - rather than updating a user's last login on every authenticated request, only update it on login. This affords us some simplification in the API auth flow (in particular, we no longer need to batch user update requests) - and #5915 indicates this approach has some serious performance implications.
Release Note
fly active-users
Notes to reviewer:
TODO list:
This may be scope creep, but maybe provide a mechanism for revoking tokens? Previously, operators could rotate the dex signing keys, and all existing sessions would be invalidated. With this change, we no longer validate against the JWT signature (since the token isn't a JWT), so rotating the signing keys will have no effect. With this change, we could introduce afly revoke-token (--all) (--user=connector:username) (--sub=sub)
without too much effort - I wonder if this is something people will want/needI left instructions for repro-ing the issue (on master) using an HAProxy load balancer here: #5851 (comment). Note that you may need to mess around with the
tune.bufsize
param so that it just barely allows unauthenticated requests to succeed, but fails when there's a long JWT as the access token. I found 2800 bytes was the sweet spot for me, but it may vary.Contributor Checklist
Reviewer Checklist
BOSH and
Helm packaging; otherwise, ignored for
the integration
tests
(for example, if they are Garden configs that are not displayed in the
--help
text).