Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading files larger than 2GB does not work #137

Open
pallinger opened this issue Jul 14, 2021 · 10 comments
Open

Uploading files larger than 2GB does not work #137

pallinger opened this issue Jul 14, 2021 · 10 comments
Labels
status:incoming Newly created issue to be forwarded type:bug Something isn't working

Comments

@pallinger
Copy link

pallinger commented Jul 14, 2021

Bug report

1. Describe your environment

  • OS: Debian 10 (buster) 64bit
  • pyDataverse: 0.3.1
  • Python: 3.7.3
  • Dataverse: 4.20-dev

2. Actual behaviour:

Trying to upload a file larger than 2GB causes an error. Uploading the same file using curl works fine.

3. Expected behaviour:

To upload the file. Or at least say that this will not work because the file is too big.

4. Steps to reproduce

The program and stack trace are as follows:

from pyDataverse.models import Datafile
from pyDataverse.api import NativeApi
df = Datafile()
api = NativeApi(SERVER_URL,API_KEY)
ds_pid=ID_OF_EXISTING_DATASET 
df_filename = PATH_TO_FILENAME_OF_BIG_FILE
df.set({"pid": ds_pid, "filename": df_filename})
api.upload_datafile(ds_pid, df_filename, df.json())

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 1685, in upload_datafile
    url, data={"jsonData": json_str}, files=files, auth=True
  File "/usr/local/lib/python3.7/dist-packages/pyDataverse/api.py", line 174, in post_request
    resp = post(url, data=data, params=params, files=files)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 116, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.7/http/client.py", line 1260, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1069, in _send_output
    self.send(chunk)
  File "/usr/lib/python3.7/http/client.py", line 991, in send
    self.sock.sendall(data)
  File "/usr/lib/python3.7/ssl.py", line 1015, in sendall
    v = self.send(byte_view[count:])
  File "/usr/lib/python3.7/ssl.py", line 984, in send
    return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes`

5. Possible solution

Some possible solutions streaming upload or chunk-encoded request) are written here:

https://stackoverflow.com/questions/53095132/how-to-upload-chunks-of-a-string-longer-than-2147483647-bytes

I am not very versed in python, but I will try to fix this in the following week, and submit a pull request. If I fail, feel free to fix this bug!

@pallinger pallinger added status:incoming Newly created issue to be forwarded type:bug Something isn't working labels Jul 14, 2021
@jmjamison
Copy link

Forgive me it this isn't relevant. Uploading really large files - in my case Lidar data - I use an s3 bucket set for direct-upload. Now that doesn't work with pyDataverse but for uploading really large files individually a direct-upload bucket is helpful.

@pallinger
Copy link
Author

pallinger commented Jul 15, 2021

I understand that this is not relevant for you. However, if the dataverse installation in question does not use an s3 storage backend, then this becomes instantly relevant.

@skasberger
Copy link
Member

The issue is, i am on parental leave right now (until may 2022), and we at AUSSDA do not use S3 - so I can not test this.

The best way to move forward, would be to resolve the issue by yourselves.

@poikilotherm
Copy link
Member

We also just ran into this. From looking at the Dataverse side, uploads using multipart/form-data should be available.

For the sending side, looks like "requests-toolbelt" has something we could use: https://toolbelt.readthedocs.io/en/latest/uploading-data.html

Maybe it would be good to detect the filesize and either go for a normal upload when <2GB or multipart for larger?

(I don't have the capacity right now to look into this.)

@pdurbin
Copy link
Member

pdurbin commented Aug 26, 2022

Can this bug be reproduced at https://demo.dataverse.org ? Currently the file upload limit there is 2.5 GB, high enough for a proper test, it would seem.

@skasberger
Copy link
Member

Also related to #136

@skasberger
Copy link
Member

Update: I left AUSSDA, so my funding for pyDataverse development has stopped.

I want to get some basic funding to implement the most urgent updates (PRs, Bug fixes, maintenance work). If you can support this, please reach out to me. (www.stefankasberger.at). If you have feature requests, the same.

Another option would be, that someone else helps with the development and / or maintenance. For this, also get in touch with me (or comment here).

@poikilotherm
Copy link
Member

I know I shall not expect movement here (unless someone else picks it up or we find funding).

But to not let newly found insights slip away and for what it's worth: how about exchanging requests for aiohttp?

I know aiohttp is much larger as a dependency, but it does support multipart uploads. https://docs.aiohttp.org/en/stable/multipart.html

@qqmyers
Copy link
Member

qqmyers commented Jan 22, 2023

Not sure that helps out-of-the-box since our multipart direct upload involves contacting Dataverse to get signed URLs for the S3 parts, etc. FWIW, I think @landreev implemented our mechanism in python, it just hasn't been integrated with pyDataverse.

@poikilotherm
Copy link
Member

poikilotherm commented Jan 22, 2023

@qqmyers you are right - direct upload needs more. Maybe one day we also extend pyDataverse for this.

That said: this issue here is about uploading with simple HTTP upload via API. As requests is not capable of using multipart upload, you are limited to 2GB filesize (same limitation as our SWORD 2.0 library). The API endpoint itself is capable of using multipart uploads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:incoming Newly created issue to be forwarded type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants