Skip to content

Commit

Permalink
simply Dataverse content provider (support datasets only) jupyterhub#…
Browse files Browse the repository at this point in the history
…1388

When the Dataverse content provider was added in jupyterhub#739 it had the
flexibility to operate directly on Dataverse files like this:

repo2docker https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/6ZXAGT/3YRRYJ

However, being able to operate only on datasets (files are stored
in datasets in Dataverse) is enough. That is, this will still work:

repo2docker doi:10.7910/DVN/TJCLKP

And that's all we need.

This simplification builds upon the work in jupyterhub#1388 where the content
of the dataset landing page is not retrieved from the DOI of the
dataset. Instead, the redirect location is fetched, which is all
the Dataverse content provider needs to determine which of the
100+ installations of Dataverse hosts the DOI.

This change should be a no-op for any installation of Datavese with
Binder integration enabled.

Harvard Dataverse (one of the 100+ installations) specifically is
not working with Binder due to a firewall that is blocking
https://dataverse.harvard.edu/citation
The simplification in this commit means that the Dataverse
content provider no longer needs to follow `/citation` to determine
what is on the other side (dataset.xhtml, file.xhtml, etc.). It
assumes that the DOI is always for a dataset (not a file), which
is the expectation we have always set for the Binder tool.

We are tracking Binder not working with Harvard Dataverse here:
IQSS/dataverse.harvard.edu#328
  • Loading branch information
pdurbin committed Dec 16, 2024
1 parent 0e6f3b8 commit b54ec1d
Showing 1 changed file with 1 addition and 31 deletions.
32 changes: 1 addition & 31 deletions repo2docker/contentproviders/dataverse.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,37 +54,7 @@ def detect(self, doi, ref=None, extra_args=None):
return

query_args = parse_qs(parsed_url.query)
# Corner case handling
if parsed_url.path.startswith("/file.xhtml"):
# There's no way of getting file information using its persistentId, the only thing we can do is assume that doi
# is structured as "doi:<dataset_doi>/<file_doi>" and try to handle dataset that way.
new_doi = doi.rsplit("/", 1)[0]
if new_doi == doi:
# tough luck :( Avoid inifite recursion and exit.
return
return self.detect(new_doi)
elif parsed_url.path.startswith("/api/access/datafile"):
# Raw url pointing to a datafile is a typical output from an External Tool integration
entity_id = os.path.basename(parsed_url.path)
search_query = "q=entityId:" + entity_id + "&type=file"
# Knowing the file identifier query search api to get parent dataset
search_url = urlunparse(
parsed_url._replace(path="/api/search", query=search_query)
)
self.log.debug("Querying Dataverse: " + search_url)
data = self.urlopen(search_url).json()["data"]
if data["count_in_response"] != 1:
self.log.debug(
f"Dataverse search query failed!\n - doi: {doi}\n - url: {url}\n - resp: {json.dump(data)}\n"
)
return

self.record_id = deep_get(data, "items.0.dataset_persistent_id")
elif (
parsed_url.path.startswith("/dataset.xhtml")
and "persistentId" in query_args
):
self.record_id = deep_get(query_args, "persistentId.0")
self.record_id = deep_get(query_args, "persistentId.0")

if hasattr(self, "record_id"):
return {"record": self.record_id, "host": host}
Expand Down

0 comments on commit b54ec1d

Please sign in to comment.