Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate opaque OAuth2 access tokens #5897

Merged
merged 22 commits into from
Aug 4, 2020
Merged

Generate opaque OAuth2 access tokens #5897

merged 22 commits into from
Aug 4, 2020

Conversation

aoldershaw
Copy link
Contributor

@aoldershaw aoldershaw commented Jul 20, 2020

What does this PR accomplish?

Bug Fix | Feature | Documentation

closes #5851 .
closes #5912 .

Changes proposed by this PR:

Previously, we were using the ID tokens from dex as access tokens. While this allows for stateless validation of requests, we ran into several issues where the ID token was massive for users that were in many groups, which caused several problems in practice (e.g. load balancers rejecting requests with such large headers).

The approach this PR takes is:

  • Intercept the /sky/issuer/token response from dex before it goes back to the client
  • On each API call, fetch the claims for the corresponding access token from the database
    • Cache this on the ATC so we don't need to always hit the DB (up to 1MiB of claims is stored in an LRU cache)
  • GC component that clears out expired access tokens from the DB
    • Side note: a single user can have multiple valid access tokens if they log in multiple times, and indeed they'll have multiple rows in the database. This is consistent with dex, and makes sense - if you fly login, that shouldn't revoke your web UI token.

It also changes the user tracking behaviour - rather than updating a user's last login on every authenticated request, only update it on login. This affords us some simplification in the API auth flow (in particular, we no longer need to batch user update requests) - and #5915 indicates this approach has some serious performance implications.

Release Note

  • There were several issues that users encountered (particularly after v6.1.0) as a result of long access tokens. Concourse now generates much shorter access tokens rather than using the raw user data.
  • Users' last activity is now tracked on login rather than on every request. Updating the last activity on every request caused database problems at scale. Note: last activity is only relevant to fly active-users
  • This is only a breaking change for any custom automation built around Concourse that authenticates with the Concourse API

Notes to reviewer:

TODO list:

  • caching for access token lookup
  • GC to clear expired access tokens from DB
  • This may be scope creep, but maybe provide a mechanism for revoking tokens? Previously, operators could rotate the dex signing keys, and all existing sessions would be invalidated. With this change, we no longer validate against the JWT signature (since the token isn't a JWT), so rotating the signing keys will have no effect. With this change, we could introduce a fly revoke-token (--all) (--user=connector:username) (--sub=sub) without too much effort - I wonder if this is something people will want/need
    • we can add this as an enhancement later if people want it

I left instructions for repro-ing the issue (on master) using an HAProxy load balancer here: #5851 (comment). Note that you may need to mess around with the tune.bufsize param so that it just barely allows unauthenticated requests to succeed, but fails when there's a long JWT as the access token. I found 2800 bytes was the sweet spot for me, but it may vary.

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the
    BOSH and
    Helm packaging; otherwise, ignored for
    the integration
    tests

    (for example, if they are Garden configs that are not displayed in the
    --help text).

@aoldershaw aoldershaw added the bug label Jul 20, 2020
@aoldershaw aoldershaw force-pushed the issue/5851 branch 2 times, most recently from f51d91d to 1261ccc Compare July 21, 2020 02:08
@aoldershaw aoldershaw changed the title concourse generates opaque access tokens generate opaque access tokens Jul 21, 2020
@aoldershaw
Copy link
Contributor Author

aoldershaw commented Jul 22, 2020

Ran some benchmarks using wrk on the /api/v1/user endpoint to determine whether caching would be necessary.

On master, these are the results:

$ wrk -d5s -H 'Authorization: bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjgxNTY3ODRiZGVjODY1NWNjMmYwODljMjJjMzhlYjEyNmRiNDcwMGYifQ.eyJpc3MiOiJodHRwOi8vbG9jYWxob3N0OjgwODAvc2t5L2lzc3V
lciIsInN1YiI6IkNnUjBaWE4wRWdWc2IyTmhiQSIsImF1ZCI6ImZseSIsImV4cCI6MTU5NTQ2NTc1OSwiaWF0IjoxNTk1Mzc5MzU5LCJhdF9oYXNoIjoiaHNSeUdVaGZTdVk0NnNOS1FQQjFYQSIsImVtYWlsIjoidGVzdCIsImVtYWlsX3Zl
cmlmaWVkIjp0cnVlLCJmZWRlcmF0ZWRfY2xhaW1zIjp7ImNvbm5lY3Rvcl9pZCI6ImxvY2FsIiwidXNlcl9pZCI6InRlc3QiLCJ1c2VyX25hbWUiOiJ0ZXN0In19.WENjhqcTdNxl0lU_99RON3DY1inS3hCu_dPBYzIMptquhW_d0YGCKwuZ
Xh2xhFx2070BvumFMBF8K7ucx5xnIooV2QDnCDNk1Oz-0eZFFg-TbI4r5BPf2Pk6wW49oKd_OwWItbSpDuf5T0Q3K7eXKJgEgAQzmxKvKYbhJS3GtYA' http://docker.for.mac.localhost:8080/api/v1/user
Running 5s test @ http://docker.for.mac.localhost:8080/api/v1/user
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    17.94ms    7.83ms  76.28ms   73.87%
    Req/Sec   282.35     49.34   434.00     70.00%
  2829 requests in 5.06s, 1.23MB read
  Socket errors: connect 0, read 0, write 0, timeout 10
Requests/sec:    558.90
Transfer/sec:    248.34KB

On this branch:

$ wrk -d5s -H 'Authorization: bearer x' http://docker.for.mac.localhost:8080/api/v1/user
Running 5s test @ http://docker.for.mac.localhost:8080/api/v1/user
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.87ms   21.40ms 163.88ms   83.78%
    Req/Sec   183.26     45.68   280.00     67.00%
  1834 requests in 5.04s, 814.91KB read
Requests/sec:    364.20
Transfer/sec:    161.83KB

So it's certainly slower

@aoldershaw
Copy link
Contributor Author

After adding caching, I'm seeing much better performance:

$ wrk -d5s -H 'Authorization: bearer mepTHqP0WPpFEbEuUtdOj2oM/dZn3hhfAAAAAA' http://docker.for.mac.localhost:8080/api/v1/user
Running 5s test @ http://docker.for.mac.localhost:8080/api/v1/user
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.67ms    5.40ms  93.56ms   75.67%
    Req/Sec   341.00     42.46   450.00     72.00%
  3404 requests in 5.02s, 1.48MB read
Requests/sec:    677.89
Transfer/sec:    301.21KB

Not sure if the increased requests/sec is indicative of better performance, or if my CPU was just a bit busier when running the previous benchmarks - but either way, at least it doesn't seem to negatively impact API performance, so I'm happy

@aoldershaw aoldershaw marked this pull request as ready for review July 22, 2020 03:07
@@ -0,0 +1,3 @@
BEGIN;
DROP TABLE IF EXISTS access_tokens;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't remember if this worked on Postgres 9 so I deployed and then ran migrate --migrate-db-to-version=1591737494 and it worked fine.

I remember now that it's DROP COLUMN IF EXISTS that only exists in certain versions of Postgres.

atc/atccmd/command.go Outdated Show resolved Hide resolved
@chenbh
Copy link
Contributor

chenbh commented Jul 23, 2020

Small note that this will break community resources that use the undocumented api: https://github.com/alphagov/terraform-provider-concourse/blob/5ff605de4afef50551bf97ed56ba0c9f4c2730aa/pkg/client/client.go#L34-L37

I guess since it's undocumented we can get away with a minor release?

@aoldershaw
Copy link
Contributor Author

@chenbh huh, good find. Looks like they already need to play catch up with concourse versions (that token.NewTokenSource was added in alphagov/terraform-provider-concourse@43515b0 to support 6.1.0 changes). I'm inclined to say that it warrants a minor release for the reason you bring up, but we can probably mark it as breaking in the release notes. @vito do you agree?

@vito
Copy link
Member

vito commented Jul 23, 2020

@aoldershaw @chenbh Yeah sounds good. 👍

}
body = bytes.NewReader(newResp)
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not clear end-to-end on what happens if I happen to unilaterally POST to /sky/issuer/token with a body of my choosing - perhaps containing all sorts of claims. I haven't tested this PR but it looks like as written, there is no signature checking for the claims in the passed IDToken - we just go on to stuff the extracted data into the DB, under the newly chosen random string, and return that string. Later auth attempts using the token should succeed, because they look it up in the DB and retrieve the claims that were given. So I could use this to grant myself access to anything.

It is also likely that I haven't understood the entire dex interaction, and that there is some place where the authenticity of the dex payload is checked, or that there's a nuance I'm missing about how this handler is exposed.

Hoping that you can enlighten me :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /sky/issuer/token endpoint (and the OAuth2 token endpoint in general) supports several different flows, some of which require previous "hops" to get there (e.g. logging in via GitHub using the authorization_code grant type). For each of the possible flows, Dex validates that the flow was "successful" (whatever that means for the particular grant type - e.g. correct password for the password grant, valid auth code for the authorization_code grant, etc). None of the grant types, however, support providing your own claims - Dex tells you what your claims are.

My point being, Dex is still doing all the validation to determine whether the request to /sky/issuer/token is valid, and won't grant a token otherwise.

This new middleware delegates to Dex, parses and stores the claims without validation, and sends back a modified response. I didn't see any value in validating the claims against Dex's signing keys for the following reasons:

  1. We aren't delegating to an unknown third-party - this is a piece of code that's bundled into the Concourse binary itself
  2. We aren't hitting a network interface that could be tampered with - the Dex handlers and request/response handoff is all done in memory

That said, if you think there's still a vulnerability here, I'd be happy to hear you out. I'm far from a security expert, so I could be wrong on this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I think I did not understand the "delegation to dex" part - that's the httptest.NewRecorder business where it has the inner ServeHTTP? So this ends up just tweaking the response from Dex, and Dex still does whatever it does to authenticate the total flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly

@aoldershaw aoldershaw changed the title generate opaque access tokens Generate opaque access tokens Jul 28, 2020
#5851

Since ID tokens grow along with the claims of the user, we've run into
issues where users in hundreds of groups have ID tokens that, when sent
in a header, cause problems at load balancers that reject large headers.
Rather than using ID tokens as access tokens, Concourse can generate an
access token (random bytes) and store a mapping of that access token to
the claims on the ID token from dex. On each request, we'd query the
database to find the claims for that given access token.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

...in order to store/fetch tokens in the database.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

Let me start by acknowledging that this feels really hacky, but I
couldn't think of a better way without modifying dex.

All auth flows go through the `/sky/issuer/token` endpoint of dex.
Rather than trying to write our own OAuth2 token endpoint, the approach
I took was to forward the request directly to dex, and to intercept and
modify the response before sending it to the client (p.s. all OIDC token
endpoints return the same response body, so it's not tied directly to
dex). On a successful authentication, we do the following:

1. Parse the claims from the ID token. I opted to not perform
   verification - if this is a bad idea, let me know, but my thinking was
   that we can't possibly be reaching out to an unknown server (or a server
   on some network interface at all, since it's all in memory) - so if dex
   approved the request, it must be legit, and we should trust the id
   token that dex gave back
2. Generate a new access token
3. Store the access token/claims in the database
4. Remarshal the response and forward it to the client, along with any
   headers that were set from the initial response

Note that with this change, auth still works (despite us never fetching
the access token from the DB in the API). This is because we choose to
replace the access token with the ID token, which the API currently
expects.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

The auth cookies are now going to be a constant < 50 bytes, so there's
no need to keep this feature

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

This commit has the skymarshal and TSA use the access token from the
/token response rather than using the ID token as an access token. This
breaks things - we're still expecting to receive the ID token in our API
handlers - but is a necessary first step to migrate over

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

...rather than just a select few. This'll make it easier to validate the
claims in the API. Also, don't keep track of "Extra" claims - keep track
of all claims in a raw form. There's some redundancy between the fields
in `jwt.Claims`, but eh, it's more convenient to work with.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

Previously, the token verifier parsed the JWT in the access token and
verified it against the public keyset. Now that it's just an opaque
string, fetch the claims from the DB and validate the timestamp.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

Adds a GC component that removed expired access tokens from the
database. Supports a leeway parameter that says "if the access token is
expired by less than the leeway, don't remove it yet". This is in the
oauth spec, and indeed in our claims validation - we allow tokens
expired by < 1 minute.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

Initially, I followed the typical pattern of database objects being
interfaces - but it's kind of an awkward pattern to represent data as an
interface.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

Based on some benchmarking, the performance of the API dropped fairly
substantially (at least for the /api/v1/user endpoint) - ~40% fewer
requests per second.

After caching was added, the benchmarks claim we're actually getting
better performance - this may just be that my CPU was a bit busier
before, but at least it doesn't seem to have a noticable performance
degradation.

The cache is an LRU cache with a 1MiB capacity for claims.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

This info could have been extracted from the raw claims, but this makes
it easier to work with.

Signed-off-by: Aidan Oldershaw <[email protected]>
...rather than on every request.

#5851

Now that we're back to creating our own access tokens, we can simplify
the user tracking behaviour. With the auth changes to 6.1, in order to
track user activity, we updated the "users.last_login" field in the DB
on every request. Since that was found to be pretty slow, we added an
optimization where we keep track of user updates in memory, and every 15
seconds (or if the number of users exceeds some limit), dump that to the
database.

In an effort to remove complexity from auth, just track the "last login"
time when users actually login

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

No longer needed since the user batcher was removed

Signed-off-by: Aidan Oldershaw <[email protected]>
They broke after I removed the UserTracker

Signed-off-by: Aidan Oldershaw <[email protected]>
#5912

Compute teamRoles and isAdmin just once on initialization

Signed-off-by: Aidan Oldershaw <[email protected]>
#5912

Now that the `IsAdmin` call is really cheap, this isn't necessary - but
as was pointed out in the linked issue, this is completely unnecessary.
The accessor takes into account whether the user is an admin, so this is
a redundant call.

Signed-off-by: Aidan Oldershaw <[email protected]>
#5851

As @jamieklassen pointed out in PR review, it was odd that the
AccessHandler had dependencies on a TokenVerifier and TeamsFetcher since
it just passed on the computed results to AccessFactory. While I'm
making changes here anyway, might as well make the dependencies a bit
more reasonable by having AccessFactory delegate to TokenVerifier and
TeamsFetcher directly. Now, AccessFactory is a bit less dumb than just
being a wrapper over the NewAccessor constructor.

Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
Signed-off-by: Aidan Oldershaw <[email protected]>
Copy link
Member

@jamieklassen jamieklassen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of anything else. Maybe we should just start dogfooding this!

@aoldershaw aoldershaw merged commit 6a58961 into master Aug 4, 2020
@aoldershaw aoldershaw deleted the issue/5851 branch August 4, 2020 18:18
aoldershaw added a commit that referenced this pull request Aug 5, 2020
#5897

In the PR linked above, I had missed some changes to fly that expect the
token to be a JWT. This commit fixes the status command

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit that referenced this pull request Aug 5, 2020
It's unfortunate, but in order to maintain parity with earlier versions
of `fly targets` (that report on the expiry of the access token), we
can't treat the access token as opaque. This breaks the `Bearer` token
contract for clients. However, `fly` is sort of a "special" client
anyway, in that it's bundled with Concourse itself, and is versioned
alongside it.

In #5897, we encoded the expiry into the access token, making this
feature easy to implement.

This commit also adds a helper for generating a `.flyrc` that can be
useful in other `fly` integration tests as well.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit that referenced this pull request Aug 5, 2020
#5897

In the PR linked above, I had missed some changes to fly that expect the
token to be a JWT. This commit fixes the status command

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit that referenced this pull request Aug 5, 2020
It's unfortunate, but in order to maintain parity with earlier versions
of `fly targets` (that report on the expiry of the access token), we
can't treat the access token as opaque. This breaks the `Bearer` token
contract for clients. However, `fly` is sort of a "special" client
anyway, in that it's bundled with Concourse itself, and is versioned
alongside it.

In #5897, we encoded the expiry into the access token, making this
feature easy to implement.

This commit also adds a helper for generating a `.flyrc` that can be
useful in other `fly` integration tests as well.

Signed-off-by: Aidan Oldershaw <[email protected]>
aoldershaw added a commit that referenced this pull request Aug 7, 2020
Broken after #5897

Signed-off-by: Aidan Oldershaw <[email protected]>
@aoldershaw aoldershaw added the release/undocumented This didn't warrant being documented or put in release notes. label Aug 18, 2020
@aoldershaw aoldershaw changed the title Generate opaque access tokens Generate opaque OAuth2 access tokens Aug 18, 2020
@chrisfarms
Copy link
Contributor

I believe this may have introduced a race during concurrent access/auth 🤔 #6019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking bug release/undocumented This didn't warrant being documented or put in release notes.
Projects
None yet
8 participants