Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload authentication #92

Closed
wants to merge 3 commits into from
Closed

Upload authentication #92

wants to merge 3 commits into from

Conversation

chelseatroy
Copy link
Contributor

@chelseatroy chelseatroy commented Jan 29, 2020

Before the holiday break, I found myself attempting to boot and run theia for the purpose of getting a pretty skeletal pipeline to run start to finish. This would accomplish 2 goals:

  1. Determine how to use the app once it's running (now described in the wiki: https://github.com/zooniverse/theia/wiki/How-to-Use-the-App-Once-It's-Running)
  2. Establish that the end-to-end process works.

Going into this, my interpretation of the information I had collected was that the app should work all the way through the pipeline, but that an important authentication gap remained: essentially, theia would allow anyone to upload subjects to any project.

I have a skeletal pipeline running locally that successfully fetches LANDSAT images, remaps them, and resizes them. It fails to upload them to Panoptes because the inauthenticated client cannot find the project, as it is not public.

In short, Theia does not allow anyone to upload subjects to any project because Panoptes (rightfully) does not allow that. Theia must cooperate with Panoptes authentication in order for this step to work.

This is good news and bad news.

The good news: The problem here isn't that a previously working thing broke. It's that something never worked, and authentication was, in fact, on the radar as a thing that needed to be done. So we haven't lost time to a step we didn't anticipate here. The only difference is that Theia cannot function without this step, as opposed to being functional but not deployable without this step.

And that's the bad news, if you want to call it that. To get the app to run anywhere (not just prod), it has to do this right.

So this PR will handle that, and will in so doing address this ticket (which, according to the ticket, "has been a struggle," but is nonetheless a necessity): #77

Challenges

Here's the first issue: admin users (from the authentication schema implemented in Django, where information about the authenticated user is available everywhere) are used to create Pipeline Stages and Pipelines. They are not the same as Zooniverse-authenticated users (from the authentication schema implemented in Panoptes that theia accesses through Oauth2 authentication with a Python library called social-auth. This provides us with a bearer token, a refresh token, and an expiration). We use these to upload to Panoptes, but they're not automatically universally available in the app.

No problem, right? We'll just take the authentication information and pass it through to the upload task. Unfortunately, it's not that simple. Each task, called a pipeline stage, is independently instantiated based on its position in a Pipeline assigned to each ImageryRequest.

So, here's what we're doing: we authenticate through social-auth, we store them with Django's sessions API designed precisely for saving things like this, and we pass them to any ImageryRequest made by this client, so they're available at upload time.

+ only users logged in through social auth to an authorized Zooniverse account can upload subjects to a project
@chelseatroy chelseatroy requested a review from zwolf January 29, 2020 20:31
Copy link
Member

@zwolf zwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's a good idea to store full user credentials (bearer and refresh tokens) in the db. Can they just be stored in and pulled from the session (or regenerated via the API) whenever a user creates a new ImageryRequest? Maybe @camallen has a take on this.

@@ -34,6 +34,10 @@ class ImageryRequest(models.Model):
project = models.ForeignKey(Project, related_name='imagery_requests', on_delete=models.CASCADE)
pipeline = models.ForeignKey(Pipeline, related_name='imagery_requests', on_delete=models.CASCADE)

bearer_token = models.CharField(max_length=2048, default="tist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of hard to tell--does this actually store the tokens in the db as fields on an imagery_request table? Or are these just built in memory, per-request?

Copy link
Contributor

@camallen camallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's a good idea to store full user credentials (bearer and refresh tokens) in the db. Can they just be stored in and pulled from the session (or regenerated via the API) whenever a user creates a new ImageryRequest? Maybe @camallen has a take on this.

It appears to be session storage (cookie?) for the token data.

I think folks know this already but i'll note it anyway. Our normal client initiated requests to server apps is, client app supplies the token either through credentials auth flows (PFE, etc) or through oauth implicit flow (CFEs). Once the client has a token is stores it in memory for use and relies on panoptes session cookies for persisted auth flows.

The app is then responsible for validating the client tokens the authorization rules (Tove / Caesar etc do this).

If authorization passes then the app uses it's own oauth application to authenticate and interact with the API resources. I.e. the app acts as a proxy for the initiating client requests with it's own user / auth setup.

happy to answer questions / review here as well.

p.logged_in = True
p.refresh_token = response['refresh_token']
p.bearer_expires = (datetime.now() + timedelta(seconds=response['expires_in']))
authenticated_panoptes = Panoptes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this request will return the user details of zooniverse oauth application owner, not the actual user that is initiating the request (unless they own the oauth application).

Unless i'm missing how the Panoptes social module works, normally client creds flow returns the token for the owning user.

@chelseatroy
Copy link
Contributor Author

Indeed, we associate the tokens with individual imagery requests.

I tried for a few weeks to find a way to rely on django's built-in session concept to store these. The rub is this: django-rest isn't really designed such that its built-in concepts are accessible to the worker threads. This becomes a problem when, say, the worker that initiated the imagery request is not the same worker that picks up the job to upload to Panoptes. This was evidently an issue for a long time: Amy's tickets suggest that she tried to find a way to get this to work, too, and ultimately left the ticket undone with the note that it had been "a struggle." At the time, I think she thought we could upload photos to any project without authenticating, making deployment of an operational version of theia dangerous but possible without authentication. As it happens, that's not the case (nor should it be), because Panoptes doesn't let just anyone throw images into a project.

Attaching the credentials to the imagery request allows us to pass them along with the request itself such that images can be uploaded with an authenticated Panoptes call.

Not ideal, but here's why I think it's okay, from a security perspective:

  1. The only thing that knows about these tokens is the imagery request, and it's a sensible delegation of responsibility: this is an individual request to fetch some images, do stuff to them, and put them on Panoptes, and those credentials will be used for the putting-on-Panoptes part.

  2. The tokens expire after two hours. So if a hacker were to pull our whole db, they can't use or refresh credentials outside that window.

  3. Additionally, the tokens don't encode information about the users themselves, so their utility as an exploitative tool is zero after expiration.

Eventually, I'd like to make a new ticket to do as @zwolf suggests: essentially, re-authenticate. For it to work with social_auth, though, users will have to be asked to manually re-authenticate with a button or something immediately before posting an imagery request. If we attempt to do it automatically, we are violating EULA privacy rules about accessing users' other accounts without their explicit knowledge and permission for a particular application.

However, to create a flow where a user gives us that permission and then makes an imagery request, we need our own UI, because django-rest's default one won't do that for us. I didn't want this ticket to scope creep all the way to "create a UI for an app we said we're not creating a UI for in this iteration."

@zwolf
Copy link
Member

zwolf commented Feb 12, 2020

It does look like it's using the session, but the app is backed by a postgres db and those are fields on the ImageryRequest table. I'd have to dive further in to check if they're actually being stored, but at a glance, it's possible.

edit: jinx, that was like at the same exact time.

The tokens expire after two hours. So if a hacker were to pull our whole db, they can't use or refresh credentials outside that window.

Our refresh tokens don't expire. That hacker would be able pull fresh bearer tokens from Panoptes with any stored refresh tokens.

Additionally, the tokens don't encode information about the users themselves, so their utility as an exploitative tool is zero after expiration.

The bearer tokens are JWTs that include user id, login, display name, a few other things. Decodable with the Panoptes public key and all public data, but not nothing.

For it to work with social_auth, though, users will have to be asked to manually re-authenticate with a button or something immediately before posting an imagery request.

As @camallen mentioned, another option is to use the credentials in the request header or Panoptes cookie to pull user roles and determine permissions therefrom, at which point the app can use its own client creds to perform the action. What happens if a user creates an ImageryRequest that they don't actually have permission on Panoptes to perform? Ensuring the request is allowed first in the controller/request layer and then backgrounding the whole thing using client creds would mean you don't need to store anything at all.

@camallen
Copy link
Contributor

yep - Zach's correct about the token & refresh token details

Additionally, the tokens don't encode information about the users themselves, so their utility as an exploitative tool is zero after expiration.

Our JWT encode public information about a user via https://github.com/zooniverse/Panoptes/blob/38dd4acf94be8f8b16df552e53561bbb08ad5ccc/config/initializers/doorkeeper.rb#L76-L81

Zach points out what i think is a simpler application data flow (and one we use in caesar and other apps).

  1. User requests authentication protected resource on theia app
  2. Theia requests user authentication via zoo auth API or redirects (implicit oauth flow) to zoo auth API
  3. Theia receives the user bearer token and stores it in a session cookie for the requesting user for reuse (browser will send this on every request)
  4. User then requests authenicated routes again and is Theai app allowed through to authorization layer
  5. Theia app checks the user token against authorization rules (cached state in theia or zoo API call)
  6. If authorization is approved the request is backgrounded for future work noting the resources to link to at API (project, subject set etc)
  7. Background worker uses another OAuth application (with rights on the project or admin) to get a token to talk to the API (client creds oauth flow) and this oatuh application is controlled by the team (and Theia can keep it's secret safe as it's server side - we can revoke this quickly if needed)

Thus any interaction passed the authentication / authorization layer in Theia is by non-user oauth applications / tokens. Instead we use a special oauth application that the team has setup that allow the correct access rights on the desired project resources (subjects sets, etc).

The way we've done this in the past is to add special oauth apps for each custom service. Specifically the special oauth application is owned by a special zoo user that we setup and control. We then set that special user up with rights on the desired projects to allow access to the desired project resources.

I'd be happy to jump on a call to discuss this flow and any trade offs that come with it. I think switching to this flow would help you solve the problem your facing but it might come with significant development cost as well.

That said the tradeoff currently is we increase our attack surface and provide a possibly way to attack the platform users.

@chelseatroy
Copy link
Contributor Author

chelseatroy commented Feb 17, 2020

These are all good points: thank you @zwolf and @camallen!

Zach, good point about the tokens. Now that I think about it, of course the way JWT tokens are created is by encoding user data. Thank you for reminding me of that!

And Cam, your numbered list above gave me an idea as to how I might make this work for theia, by essentially performing social auth and the upload step in the same flow. I don't believe we need the social auth until the upload step, so what remains is to determine whether the UI for this will work out of the box or would require a new flow.

So I'll run a couple of experiments with shorter feedback looks to determine if that will work before I propose another solution there. I'll be back on this ticket just as soon as I get the file system changes between tasks sorted out :).

But, all messages so far are seen, appreciated, and knocking around in my brain. Thank you!

@chelseatroy
Copy link
Contributor Author

Okay. I spent the later part of today trying out some approaches. Given that the plan is to fundamentally change the implementation, I'm going to close this PR and make a new one once I have the change up, and link back to this PR in that PR's description for context.

@chelseatroy
Copy link
Contributor Author

So I tried the fundamental change to the implementation. I tried it three different ways over the weekend. Unless I'm missing something, we'd have to rearchitect theia to make it work :(.

Here's why: the client (a researcher) makes a request through the app, in the browser. Django-rest stores auth tokens in sessions on the browser.

The actual work (including uploading to Panoptes) happens on a worker somewhere; not the browser, no access to the session.

The app and the worker both have access to two things: the file system and the database. App puts some stuff in the database, worker fetches it out. App can save downloaded ESPA data to the file system, workers can access it to begin the processing tasks. Communication between app and worker does not happen in ViewSets (django version of controllers). It happens in post-save hooks on the database models themselves. I tried two separate pass-through approaches to get around this. Neither one worked. I don't see a way, in theia's current architecture, not to have the session in either the file system or the database for the worker to access when it's time to upload to Panoptes, because it can't just re-authenticate on account of it needs the client (researcher)'s social auth, which of course the researcher must access, and the researcher's contact with the system is through the app...in the browser.

However, here's what we can do (@trouille, we spoke about this a bit today). We can both limit the amount of time that that info is in the database, and we can also encrypt it.

Currently it's in there in plaintext, and each token expires after 2 hours. So a theoretical unauthorized party has two hours' time in which a session is valid, and they only need access to the database to get it...not the file system.

So my next plan is to do this:

  1. When imagery request is made, generate a one-time-use key. Store the key on the file system with the other files specific to that request.
  2. Encrypt the session data with the key. Store the encrypted data in the database.
  3. "Upload to Panoptes" is always, in our system, the last step of a pipeline. So when that step completes, we delete both the key from the file system and the encrypted session from the database.

This brings the attack surface area down by somewhere above 75%. That's because a job generally takes 20-25 minutes: less than a quarter of the two hour window in which a session is valid. Also, in those minutes, an unauthorized party would need access to both the file system and the database. One or the other would not do.

I'll take a crack at this in a PR starting tomorrow.

@zwolf
Copy link
Member

zwolf commented Mar 17, 2020

I think the flow that Cam and I are describing could still work. It relies on requiring the user's credentials at a single point: when the initial request to Theia is made. Right then, you can check their permissions on the specific project via the Panoptes AP project_roles endpoint. Only if they have whatever permission you require would you save the request and start the whole pipeline.

Once that happens, though, you no longer need their specific credentials to use the API. You can use a different OAuth application that also has permission on the project and assume that since the request was authorized initially that you can go ahead. Nothing you're doing needs to be done with the individual user's token.

The way this is done in our other apps is that the OAuth/Doorkeeper app's credentials (client_id & client_secret) are stored in a Kubernetes secret and loaded via the Dockerfile as environment variables when the container image is built. That way, you don't need to store credz in the db or on the filesystem, they're loaded up by docker and you use them however you use env vars in python to instantiate your API client.

Happy to talk about this tomorrow on our call, just wanted to get this to you while I was thinking about it.

@chelseatroy
Copy link
Contributor Author

chelseatroy commented Mar 17, 2020

You can use a different OAuth application that also has permission on the project and assume that since the request was authorized initially that you can go ahead. Nothing you're doing needs to be done with the individual user's token.

@zwolf thank you for those details. Are we able to create an OAuth application that has permission to upload data to any project in Panoptes?

My understanding, from having considered something like this when initially setting up the OAuth client for theia that day at the Zooniverse table in November, was that we could not do this. That's why I had ruled it out. But if that can be done, then I can try it this way.

@camallen
Copy link
Contributor

If it's one project then we can add an oauth app for a special user and add that user as a collab on the project.

If this pipeline will be more widely used for other projects, we get the project owners to create an oauth app as a special user / themselves and they can add the client id / secrets to the theia system so it can be used to get a token as them.

I'd prefer the former as we control it but long term the later is viable, though we will have to secure the credentials in the theia db.

@chelseatroy
Copy link
Contributor Author

chelseatroy commented Mar 18, 2020

I'd prefer the former as we control it but long term the later is viable, though we will have to secure the credentials in the theia db.

So, hold up. My understanding was that we want to not be putting credentials in the theia db for security reasons. That is what the current implementation does, and we determined that my plan for securing those credentials in the theia db was not enough.

I guess I thought I understood what we want to do here, but based on the above I think I still don't.

My understanding was that we should:

  1. Authenticate the client (researcher) with social auth in theia
  2. When they make a request, ensure that they (the researcher) has access to the specific project they have done this request for
  3. Once that is done, authentication with the client (researcher) is over, and theia has its own credentials with upload rights to that project, and all projects, that are serviced through theia. And that is what is used for upload. Those credentials are stored in environment variables. Each time another project wants to use theia, we have to add this theia user to their project with upload rights. But this one theia user has upload rights to all projects using theia to get images to Panoptes, and we don't have to make new oauth apps for every project that wants to use theia.

@zwolf
Copy link
Member

zwolf commented Mar 18, 2020

Option 2, as @camallen outlines above, I didn't mention and I wouldn't recommend either. My suggestion is to combine the two: have an app created by a special user that is added as a collaborator to any project you want to enable in Theia (could also give it admin rights, but that's not strictly necessary). We would have to keep the credentials in the theia app, not its actual DB. See paragraph 3 in my comment above.

I'm not sure why that wasn't one of @camallen's suggestions, I'm confused by what he means also.

@camallen
Copy link
Contributor

camallen commented Mar 18, 2020

Apologies for muddying the water, we don't want to store private data in the db if we can avoid it.

I thought i did address the optimal path forward but i could have been clearer

If it's one project then we can add an oauth app for a special user and add that user as a collab on the project.

We should pursue the clear path @chelseatroy outlined above

  1. Authenticate the client (researcher) with social auth in theia
  2. when they make a request, ensure that they (the researcher) has access to the specific project they have done this request for
  3. Once that is done, authentication with the client (researcher) is over, and theia has its own credentials with upload rights to that project, and all projects, that are serviced through theia. And that is what is used for upload. Those credentials are stored in environment variables. Each time another project wants to use theia, we have to add this theia user to their project with upload rights. But this one theia user has upload rights to all projects using theia to get images to Panoptes, and we don't have to make new oauth apps for every project that wants to use theia.

In summary:

  1. we add a special theia user to the known user accounts in the API
  2. we make an oauth app for that user
  3. we add that user to theia projects with the correct rights
  4. the theia system uses the oauth client credentials flow (using python client) to get tokens to act for the user on their project

@yuenmichelle1 yuenmichelle1 deleted the upload-authentication branch April 12, 2024 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants