Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add localizer for DRS URIs, and update GDC handler to use it when available #146

Open
dheiman opened this issue Feb 14, 2024 · 7 comments
Open
Labels
enhancement New feature or request Triage New issues which haven't been assigned to a project and need attention

Comments

@dheiman
Copy link
Contributor

dheiman commented Feb 14, 2024

Is your feature request related to a problem? Please describe.
A lot of dbGaP data is available on buckets via signed URLs that can be generated by having a Terra account linked to the appropriate provider, and using the DRSHub API. Currently, for data hosted by the GDC, we are using the GDC API, which is slow and prone to crashing.

Describe the solution you'd like
Use the above API to get a signed URL, and if it's available, use it rather than the GDC API.

Additional context
Manual testing to confirm it works outside of the Broad network:

% curl --request POST  --url "https://drshub.dsde-prod.broadinstitute.org/api/v4/drs/resolve"  \
--header "authorization: Bearer $(gcloud auth print-access-token)" \
--header 'content-type: application/json' \
--data '{ "url": "drs://dg.4dfc:7d1726dc-1261-4db9-adea-3adbcb2ffa28", "fields": ["size", "name", "accessUrl", "hashes"] }'
{
  "size" : 9434312,
  "name" : "7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai",
  "accessUrl" : {
    "url" : "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7",
    "headers" : null
  },
  "hashes" : {
    "md5" : "5290f15d8e95e2660fe6d15a5f4e9dd9"
  }
}

% curl -OJ -H 'Connection: keep-alive' --keepalive-time 2 "https://gdc-tcga-phs000178-controlled.storage.googleapis.com/7d1726dc-1261-4db9-adea-3adbcb2ffa28/7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai?x-goog-algorithm=GOOG4-RSA-SHA256&x-goog-credential=dheiman-666%40dcf-prod.iam.gserviceaccount.com%2F20240214%2Fauto%2Fstorage%2Fgoog4_request&x-goog-date=20240214T144326Z&x-goog-expires=3600&x-goog-signedheaders=host&x-goog-signature=6e20a8b207483ba7dfdcf62bcefd15e357157fe379942054e4e6380e30537bacfbc372c4cd0142102941137df831bc27eb699c7b8dad99fbd76b9b9473af38419f69830a32cc1ccee512057c35fc9c512c82576a15c9182489d61511b46f384e1a52a895786cc0f6d8d4e71883e0b24c6d5e90c778d65d8a860b8abb511b7016b2f00cff5ee489aef44bd1c03130e57b20002ed9542c933e01e9ab8211a605a6a6a0b16ae5a8e072b1ad9477c0e7ccb6bb92464f242014d4b387e4f6e3cc5e9d28f455d757f4e43a74dbd9292ddaa1e74558462f87fff7e9e328b87adcc90fe8d007c95abf5ed770cdad1012ea2fcf8c72a22a9c57eb52e6ab16d76111986bd7"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9213k  100 9213k    0     0  10.6M      0 --:--:-- --:--:-- --:--:-- 10.6M

% ls -la
total 18992
drwxr-xr-x  15 dheiman  staff      480 Feb 14 09:40 .
drwxr-xr-x@ 19 dheiman  staff      608 Jan 11 14:27 ..
-rw-r--r--@  1 dheiman  staff  9434312 Feb 14 09:40 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai

% md5 7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai
MD5 (7812f73a-4603-44b4-84c4-cefd5fc654a8_wgs_gdc_realn.bai) = 5290f15d8e95e2660fe6d15a5f4e9dd9
@dheiman dheiman added enhancement New feature or request Triage New issues which haven't been assigned to a project and need attention labels Feb 14, 2024
@dheiman
Copy link
Contributor Author

dheiman commented Feb 14, 2024

Note: The signed URL only lasts for an hour, so if there is a failure, a new one should be generated.

@julianhess
Copy link
Collaborator

julianhess commented Feb 14, 2024

Do we know what the failure modes of that API endpoint are? We'd want to catch those in the localization plugin. Alternatively, if it's an easy API to reverse engineer, maybe do that (à la what Qing did)? I don't exactly trust DSP APIs to be reliable under heavy load.

@dheiman
Copy link
Contributor Author

dheiman commented Feb 14, 2024

Failure modes are response codes 404 or 500, success is 200.

I would only want to use the resolver API.

Once we have the signed URL, we're only dealing with google.

I somewhat prefer this method, because it enables use of all drs URI's. Otherwise, we need to keep up with every new one, and any time a user needs access to a certain one, they won't have to figure out which provider to use, manually login, and download yet another credentials file. In these cases, Terra is doing that for us, and the user only needs to ensure that external account links are up-to-date.

@julianhess
Copy link
Collaborator

Good point. OK, let's see how it holds up under pressure! 🤞

@dheiman
Copy link
Contributor Author

dheiman commented Feb 15, 2024

Code written but not yet tested - master...drs_via_drshub
It's a little hacky - I test if a signed URL can be created by instantiating a new DRS URI resolver within a try-except block in the GDC resolver, falling back to the old code if it fails, and updating the internal variables if it succeeds

@julianhess
Copy link
Collaborator

Thanks so much!

Can you also please add an appropriate regex here so this automagically gets invoked for DRS URLs?

url_map = {
r"^gs://" : HandleGSURL,
r"^s3://" : HandleAWSURL,
r"^https://api.gdc.cancer.gov" : HandleGDCHTTPURL,
r"^https://api.awg.gdc.cancer.gov" : HandleGDCHTTPURL,
r"^rodisk://" : HandleRODISKURL,
r"^(?:ftp|https|http)://" : HandleOtherURL
} if url_map is None else url_map

@dheiman
Copy link
Contributor Author

dheiman commented Feb 15, 2024

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Triage New issues which haven't been assigned to a project and need attention
Projects
None yet
Development

No branches or pull requests

2 participants