Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow file uploads for non-zip files through the API #1612

Closed
rliebz opened this issue Mar 6, 2015 · 33 comments
Closed

Allow file uploads for non-zip files through the API #1612

rliebz opened this issue Mar 6, 2015 · 33 comments

Comments

@rliebz
Copy link

rliebz commented Mar 6, 2015

Currently uploading files to a Dataset on Dataverse 4.0 is only possible using the Sword API described at http://guides.dataverse.org/en/latest/api/sword.html#add-files-to-a-dataset-with-a-zip-file. This has the disadvantage of requiring that files be zipped before upload.

I imagine allowing file uploads through the native API could allow files to be uploaded without first having to zip them.

@pdurbin pdurbin added this to the In Review - 4.0 milestone Mar 6, 2015
@samchrisinger
Copy link

👍

@scolapasta scolapasta modified the milestones: In Review - 4.0.x, In Review - 4.0 Mar 7, 2015
@chrisseto
Copy link

👍 I think having an endpoint that would accept multipart forms containing files would make the most sense.

@mercecrosas
Copy link
Member

I agree - the native upload API should allow both individuals files and zipped files (as it is the case through the web UI for data upload).

@pdurbin
Copy link
Member

pdurbin commented Mar 25, 2015

@rliebz @samchrisinger @chrisseto in Dataverse you can fill in a description for a file when you upload it via the GUI. Do you have any ideas of how you would ideally fill in a description for a file uploaded via an API endpoint? Would it be two curl commands or one? I'm talking about a non-zipped file, a PNG for example. Maybe the description is "a picture of my cat". How do we get that description to the Dataverse server?

@chrisseto
Copy link

@pdurbin I would suggest using multipart forms and then parsing them based on a the name attribute or just nesting them inside of each other. That way you could upload a any amount of files with metadata in a single request.

Some relevant RFCs are here and here.

Ideally a request would look like that from the bottom of the former link.

Content-Type: multipart/form-data; boundary=AaB03x

--AaB03x
Content-Disposition: form-data; name="file[]"
Content-Type: multipart/mixed; boundary=BbC04y
--BbC04y
Content-Disposition: name="description"
Content-Type: text/plain
This is my cat
--BbC04y
Content-Disposition: file; filename="cat.gif"
Content-Type: image/gif
Content-Transfer-Encoding: binary
   ...contents of cat.gif...
--BbC04y--
--AaB03x
Content-Disposition: form-data; name="file[]"
Content-Type: multipart/mixed; boundary=GEI93n
--GEI93n
Content-Disposition: name="description"
Content-Type: text/plain
This is my dog
--GEI93n
Content-Disposition: file; filename="dog.gif"
Content-Type: image/gif
Content-Transfer-Encoding: binary
   ...contents of dog.gif...
--GEI93n--
--AaB03x--

@samchrisinger
Copy link

👍

1 similar comment
@rliebz
Copy link
Author

rliebz commented Mar 26, 2015

👍

@pdurbin
Copy link
Member

pdurbin commented Mar 26, 2015

@chrisseto I like this idea! However, rather than...

Content-Disposition: name="description"
Content-Type: text/plain
This is my cat

... we actually might want to take this one step further since we support more metadata at the file level than just "description". We also support categories so perhaps instead of Content-Type: text/plain we could specify JSON that includes both the description and the categories like this:

{
  "description": "This is my cat",
  "categories": [
    "fuzzy",
    "cute",
    "moody"
  ]
}

And I suppose we could support more metadata at the file level in the future this way.

See also this related ticket that has more of an emphasize on the SWORD API than the Native API: Let file metadata (i.e. description) be specified during zip upload #723

@samchrisinger
Copy link

This sounds great.

@chrisseto
Copy link

@pdurbin from an API point of view allowing json file metadata would be a great idea!

You may want to keep in mind that if you kept everything as form fields then this endpoint could become the canonical file upload endpoint.
So an API upload would be a form post and when uploading from the website interface its just submitting a form to the same endpoint.

This way you can can have slightly DRYer code and only have to worry about testing a single endpoint.

Either way is great IMO 👍

@pdurbin
Copy link
Member

pdurbin commented May 22, 2015

Since I was just looking at the SWORD spec again for @posixeleni I'll mention that while the Dataverse implementation of SWORDv2 only accepts zip files, this was an implementation decision on my part.

Here's what the SWORD spec says:

"If the server does not supply any sword:packaging elements, the client MUST assume that the server only supports a simple zip file packaging format" -- http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#packaging

It was for this reason that I had Dataverse support a simple zip format for upload via SWORD. I figured I had to if clients are supposed to assume that format if a format is not specified.

All this is to say that we probably could support various formats in the SWORD API.

My preference, however, is to develop a "native" API endpoint for uploading files so we won't be constrained by SWORD (but that spec if full of good ideas about format negotiation and such... good job @richard-jones).

@keflavich
Copy link

👍

@raprasad
Copy link
Contributor

Having this API available is a dependency for the Sloan grant and adding subsets to the Dataverse.
@kcondon @scolapasta

@pdurbin
Copy link
Member

pdurbin commented Jan 19, 2017

@richarda23 hi! Perhaps you saw IQSS/dataverse-client-java#8 where I mentioned that the new "native" add and replace APIs are available for testing on https://dev1.dataverse.org

Everyone should please check out the announcement I made at https://groups.google.com/d/msg/dataverse-community/8WTs3wYF6dc/AzPOxzRKFwAJ for details about how we'd love API users out there to try out the new APIs before we ship them.

@richarda23 don't worry, the SWORD API will remain unchanged. We're adding new APIs based on feedback on this issue from @rliebz @samchrisinger @chrisseto @keflavich and others.

As usual, the API endpoints are documented in the API Guide, so please look for "/add" and "/replace" under "Native API" on this page for now http://guides.dataverse.org/en/2290-file-replace/api/native-api.html . We'll make a pull request soon for #2290 and this issue (#1612) will be associated with that pull request.

@otter606
Copy link

I had a look at the new file add/replace methods - this will be great if we can eventually drop the Sword dependency, as the Sword library we use in the Java client is moribund and not available on Maven Central. If we can use pure native Dataverse API, we can remove this dependency and we'll be able to more easily make the Java client available on Maven.

Anyway.. several comments.. probably some of these are not unique to these 2 methods:

  1. The documentation for 'add' says it's a PUT request but it seems it has to be a 'POST'

  2. Python examples are good but some basic curl examples would be nice too as people may not be too familiar with the 'requests' library . E.g. for adding a file to an existing dataset:

    curl -X POST -F [email protected] https://dev1.dataverse.org/api/v1/datasets/DATASETID/add?key=myKey

works nicely.
3. Can't quite figure out how to do a replace. Attempting to replace a file with another file that has identical content, but a different name is rejected ('already exists' - is this intentional?) Running what I think should work:

curl -X POST -F [email protected] https://dev1.dataverse.org/api/v1/files/FILEID/replace?key=myKey

fails with output:

{"status":"ERROR","message":"No JSON data"}

which I'm not quite sure what to do about - could the error message perhaps say something about what's needed to make the query correct?

@pdurbin
Copy link
Member

pdurbin commented Jan 20, 2017

@otter606 thanks for the feedback and I see you're working away on IQSS/dataverse-client-java#8 ! Much appreciated!

Yeah, the PUT vs. POST was a problem with the API Guide so I fixed it in a784a40. Thanks for catching that. Both "add" and "replace" use POST but now you've got me wondering if "replace" should use PUT instead. I need to refresh my REST thinking and maybe check with you, @michbarsinai @raprasad @landreev @scolapasta and others.

As for the No JSON data error, I can't see any reason why JSON should be required on replace so I dropped that requirement in b9a6172 and deployed this to https://dev1.dataverse.org . Since you're a Java developer, you may be interested in checking out that commit because it contains some API tests that I beefed up while I was in there making changes.

It is by design that a file with identical content (md5) is rejected by the "replace" API endpoint.

I hear you on the lack of curl examples for "add" and "replace". I need to switch gears but I'll try to come back to this. Honestly, I think the API Guide could be much better organized but I don't want to bite off too much.

Please keep the feedback coming! That goes for everyone! 😄 Thanks, @otter606 !!

@otter606
Copy link

I tried the replace method again but got a new error saying I didn't have permission, any ideas?:

Given an original successful command to add a file:
curl -X POST -F [email protected] https://dev1.dataverse.org/api/v1/datasets/352/add?key=d62c8

which produces
{ "status": "OK", "data": { "files": [ { "description": "", "label": "build3.gradle", "version": 1, "datasetVersionId": 132, "dataFile": { "id": 368, "filename": "build3.gradle", "contentType": "application/octet-stream", "filesize": 139, "description": "", "storageIdentifier": "159c2cf44ad-5b6c223c0e64", "originalFormatLabel": "UNKNOWN", "rootDataFileId": -1, "md5": "7d23b963111cfe818132e7a589642a36", "checksum": { "type": "MD5", "value": "7d23b963111cfe818132e7a589642a36" } } } ] } }
and using the id '368' as the id of the file to replace:
curl -X POST -F [email protected] https://dev1.dataverse.org/api/v1/files/368/replace?&key=d62....

gives an error message:
{"status":"ERROR","message":"You do not have permission to edit this dataset."}

@michbarsinai
Copy link
Member

Hi Phil,

From the REST point of view - I'm not super familiar with the issue itself - these are the semantics:

  • POST $url => "dear server, store the content of this message somewhere under $url. So if I take a resource and send it to a server in POST a/b/c/, I can reasonably expect it to have a the URL a/b/c/12345, but not a/x/7777 (it can have that URL additionally, but a URL under a/b/c/ is "mandatory"). Ideally, the server should return a 201 CREATED with the actual URL, rather than 200 OK*.
  • PUT $url => "dear server, store the content of this message at $url". Subsequent GET $url should return the sent content.

Generally, in the context of files, you'd expect new files to be POSTed to some dataset/files URL (so the server can generate a unique, legal, URL, if it decides to accept the files). File replaces could be done using PUT, since a replace action changes the content of a known, legal URL.

But again - that's the "HTTP purist view", in some cases it may make sense to bend these rules.

-- M

* The Dataverse API returns 201 for some endpoints and 200 on others - we should improve on that but it's not very pressing. At least, I'm not aware of anyone asking for a proper 201 response.

@otter606
Copy link

otter606 commented Jan 23, 2017 via email

@michbarsinai
Copy link
Member

michbarsinai commented Jan 24, 2017 via email

pdurbin added a commit that referenced this issue Jan 24, 2017
pdurbin added a commit that referenced this issue Jan 24, 2017
Also "add" example to phoenix scripts.
@pdurbin
Copy link
Member

pdurbin commented Jan 24, 2017

@otter606 I added some curl examples for "add" and "replace" in cc1d614 which you can see at http://guides.dataverse.org/en/2290-file-replace/api/native-api.html . I left the Python examples in there.

@pdurbin
Copy link
Member

pdurbin commented Jan 25, 2017

@michbarsinai linked to the HTTP 1.1 RFC (2616) which I think is a fine place for guidance on POST vs. PUT in the context of the new "replace" endpoint. (I think we all agree that POST is fine for the new "add" endpoint.) Here's how each section starts:

To me, PUT feels wrong for "replace" because what's in the URI simply identifies the database id of the file to be replaced but content of the file (md5) is not actually changed. What we're really doing is uploading a file with different content (different md5) and that newly uploaded file will get a new database id (the next id available).

As @michbarsinai says, perhaps we should make sure we're returning 201 instead of 200 since something new is being created, a new database id of a file, for both "add" and "replace". Pull request #3579 is currently in QA, however, so we aren't touching the branch at this time. Also, as @michbarsinai mentioned, API users aren't exactly clamoring for 201 return codes. To me, the important thing is that we don't break people's code by changing the API. See the recent Spec-ulation Keynote by Rich Hickey for more on not breaking people's code. 😄

I'll end by showing the diagram @raprasad drew on my whiteboard after absorbing what @michbarsinai was saying about if the end URL is known or not.

Verb End URL Known Example
POST no create dataset
PUT yes edit dataset metadata

@kcondon
Copy link
Contributor

kcondon commented Jan 26, 2017

Works, issue with not respecting file upload size limit but that is mentioned in #2290 and being worked on/ updated there.

@kcondon kcondon closed this as completed Jan 26, 2017
@kcondon kcondon self-assigned this Jan 26, 2017
pdurbin added a commit that referenced this issue Jan 26, 2017
@pdurbin
Copy link
Member

pdurbin commented Jan 26, 2017

I tested and documented the forceReplace boolean in 2a56e58.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests