Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Key and User-Agent should preferrably be passed as headers #192

Closed
shoeffner opened this issue Jul 17, 2024 · 6 comments · Fixed by #201
Closed

API Key and User-Agent should preferrably be passed as headers #192

shoeffner opened this issue Jul 17, 2024 · 6 comments · Fixed by #201
Assignees
Labels
status:incoming Newly created issue to be forwarded

Comments

@shoeffner
Copy link
Collaborator

shoeffner commented Jul 17, 2024

Issue

The API key and User-Agent are currently passed as GET parameters, ideally they should be passed as headers. User-Agents are commonly passed as headers, and for the key this is actually preferred.
It also is slightly more secure, as the dataverse API will return the full request URL on error, possibly exposing API keys in logs:

{"status":"ERROR","code":404,"message":"API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"http://localhost:8080/api/v1/access/datafile/:persistentId/?persistentId=23&User-Agent=pydataverse&key=<the API token>","requestMethod":"GET"}

(taken from datalad/datalad-dataverse#310, where I found out about this; it also contains a little bit of context)

I am not sure if this request is a bug or feature, so I decided to open a normal issue. The pyDataverse version I use is 0.3.3, and the code doing this is in pyDataverse/api.py:

def get_request(self, url, params=None, auth=False):
"""Make a GET request.
Parameters
----------
url : str
Full URL.
params : dict
Dictionary of parameters to be passed with the request.
Defaults to `None`.
auth : bool
Should an api token be sent in the request. Defaults to `False`.
Returns
-------
class:`requests.Response`
Response object of requests library.
"""
params = {}
params["User-Agent"] = "pydataverse"
if self.api_token:
params["key"] = str(self.api_token)
if self.client is None:
return self._sync_request(
method=httpx.get,
url=url,
params=params,
)
else:
return self._async_request(
method=self.client.get,
url=url,
params=params,
)

The same is done for post_request, etc.

A possible change would be to use headers and pass them as headers:

        headers = {}
        headers["User-Agent"] = "pydataverse"
        if self.api_token:
            headers["X-Dataverse-key"] = str(self.api_token)

        if self.client is None:
            return self._sync_request(
                method=httpx.get,
                url=url,
                headers=headers,
            )

Additionally, I noticed that neither the params nor the auth attributes are used.

For the params a fix could be, especially after moving the key and user-agent to the headers:

self._sync_request(
    method=httpx.get,
    url=url,
    headers=headers,
    params=params or {},  # maybe httpx.get even handles params=None, in that case they could simply be passed through
)

For the auth parameter it's a bit more difficult, since it currently does not affect the behavior although it should.
One possible way would be to update the if statement before adding the token to the header:

if auth and self.api_token:

However, it might make sense to change the default value to True in this case, to keep the behavior stable. However, I am not 100% sure what is the best way to handle this. Of course, one option would also be to remove the auth argument completely.

Lastly, the docstring of the function mentions the requests.Response, while it's now a httpx.Response. While they do have a similar interface, they have some minor differences; e.g., requests has an .ok property and uses iter_content, while httpx does not have .ok and uses iter_bytes.

I think I could try to "fix" this and draft a PR.

@shoeffner shoeffner added the status:incoming Newly created issue to be forwarded label Jul 17, 2024
@JR-1991 JR-1991 self-assigned this Jul 17, 2024
@JR-1991 JR-1991 moved this from Todo to QA in PyDataverse Working Group Jul 17, 2024
@JR-1991
Copy link
Member

JR-1991 commented Jul 17, 2024

Thanks for pointing these issues out, @shoeffner! I completely agree – the API Key should be included in the headers, and it looks like auth and params are either redundant or unused. Here’s what I’m thinking:

  • Option 1: We put the API-Token directly into the headers and remove the auth parameter.
  • Option 2: We create a custom class using httpx.Auth to add the key to the header, like in this example. The auth parameter stays, and if it’s None, our custom class handles authentication.

The custom class seems more future-proof, especially if Dataverse introduces new authentication methods. The direct fix is quicker, but I’m leaning towards the custom class and keeping auth.

As for the params argument, we haven’t used it yet since query parameters are added to the URL string directly. However, I think we’re missing out on some great httpx features that could make our code more readable and stable. This would mean refactoring most endpoints, and since we’ve already discussed a refactor in #189, I suggest we include this in the api.py update.

@shoeffner
Copy link
Collaborator Author

Reworking auth to be non-boolean but instead httpx.Auth or None sounds like a great idea! In fact, in this case it might be as simple as adding

A callable, accepting a request and returning an authenticated request instance,
so there might not be a class required after all to get the mechanism going.

I can't say much about the refactoring in #189, as I am not familiar with the code base in general. But I agree that using a dict for params instead of manually building up the URLs is a good idea.

Let me know if I can help with anything.

@JR-1991
Copy link
Member

JR-1991 commented Jul 17, 2024

Great, that sounds like a plan! Would you be willing to open a PR for this issue? I’d be happy to help out with it.

By the way, we talked about this in today’s PyDataverse Working Group meeting with @pdurbin and @donsizemore. You can watch the recording here and check out our upcoming meetings on py.gdcc.io if you'd like to join next time. For discussions outside of these meetings, we also have a Zulip channel for everything related to Python and Dataverse.

@shoeffner
Copy link
Collaborator Author

Thanks @JR-1991, I'll take a chance implementing that.
I checked the recording (here's my gist from the relevant parts of the recording, for those in a rush and cannot access the notes or do not want to go through the recording):

  • It's in general a good idea to refactor the authentication to use httpx.Auth
  • While using the header (X-Dataverse-key) is a good first step, an even better but "create-new-issue-and-check-out-in-the-future" way is to use signed URLs

There was not much talk about the User-Agent or params, but as @JR-1991 already mentioned that the params might become part of #189, I'll leave that part out and focus on the first step for the authentication, that is using the X-Dataverse-key header.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2024

focus on the first step for the authentication, that is using the X-Dataverse-key header

Sounds perfect. Thanks, @shoeffner! ❤️

You're welcome to test against https://demo.dataverse.org

Or you can run Dataverse in Docker, as mentioned in the README.

@shoeffner
Copy link
Collaborator Author

Thanks, I'll focus on the docker-setup for the moment, and once it's done and green, I'll go for the live instance :)

shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 18, 2024
It is still possible to use the api_token parameter.
In this case we assume an API token was grabbed from the
user profile of a Dataverse instance.

If both are specified, an api_token and an explicit auth method,
we will warn the user and use the auth method.

Closes gdcc#192.
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 18, 2024
It is still possible to use the api_token parameter.
In this case assume an API token was copied from the
user profile of a Dataverse instance.

If both are specified, an api_token and an explicit auth method,
warn the user and use the auth method.

Closes gdcc#192.
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 18, 2024
It is still possible to use the api_token parameter.
In this case assume an API token was copied from the
user profile of a Dataverse instance.

If both are specified, an api_token and an explicit auth method,
warn the user and use the auth method.

Closes gdcc#192.
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 19, 2024
It is still possible to use the api_token parameter.
In this case assume an API token was copied from the
user profile of a Dataverse instance.

If both are specified, an api_token and an explicit auth method,
warn the user and use the auth method.

Closes gdcc#192.
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Jul 23, 2024
shoeffner added a commit to shoeffner/pyDataverse that referenced this issue Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:incoming Newly created issue to be forwarded
Projects
Development

Successfully merging a pull request may close this issue.

3 participants