Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading over 2 gb #4

Open
jmjamison opened this issue Jun 8, 2020 · 4 comments
Open

Uploading over 2 gb #4

jmjamison opened this issue Jun 8, 2020 · 4 comments

Comments

@jmjamison
Copy link

I'm trying to use DVUploader for large geodatabases (gdb files). This works fine with files around 2 gb but over that, for example 2.87 gb.

Over that and i get:
Jun 08, 2020 1:32:54 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://dataverse.ucla.edu:443: Software caused connection abort: socket write error
Jun 08, 2020 1:32:54 PM org.apache.http.impl.execchain.RetryExec execute

This is an aws s3 bucket and I've raised :MaxFileUploadSizeInBytes to 8gb but that doesn't seem to help.

@qqmyers
Copy link
Member

qqmyers commented Jun 8, 2020

FWIW: The current DVUploader is limited to < 5GB on AWS S3 buckets when using direct upload (because AWS doesn't allow uploads above that without splitting the upload into multiple pieces.) I'm currently testing code to use multipart uploads that will remove that limit.

That said, 2.87 GB should work. Are you using direct upload? If not, my guess would be that some software in your setup has a timeout that is cutting off the upload - either in the web server or the ajp connection to glassfish, a load balancer, etc. Or, it could be that you're running out of space in your temp directory for Dataverse (it will have two temporary copies somewhere on disk). If you are using direct upload, I'm not sure what could be timing out - possibly a proxy server if you use that. I'm no
One way you might get some info on what software is timing out - the response headers that can be seen in the browser console (when uploading via the Dataverse UI rather than DVUploader) or using curl with the -v option usually has a Server: entry. If something is timing out the response header will show which software responded in that Server entry. For example, I think we saw the AWS LB responding when QDR had timeouts (versus 'Server:Apache' for successful calls).

I'm not sure that the DVUploader reports that information. The DVUploader does write more information to it's log file than it prints to the console when you run it so it may be there's a clue there. If not you may want to try using the Dataverse UI or curl as a way to debug (and we may want to add more debug info to DVUploader. If it turns out not to be a timeout issue, I can certainly go into DVUploader and see what other information we might be able to print out when a failure happens.)

@jmjamison
Copy link
Author

At your suggestion I tried uploading from the Dataverse UI and I got a size error. So, that gives me someplace to start looking. Thank you for the suggestions. If/when I can track this down I'll post the answer here - in case someone else runs into this.

@pkiraly
Copy link

pkiraly commented May 19, 2021

I hve a similar problem. There is no :MaxFileUploadSizeInBytes so according to the Dataverse manual: "If the MaxFileUploadSizeInBytes is NOT set, uploads, including SWORD may be of unlimited size." We use normal file system, not S3. When I try uploading an 8 GB file, on the client side I get the following error:

PROCESSING(F): oa_status_by_doi.csv.gz
               Does not yet exist on server.
May 19, 2021 6:01:44 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://test.data.gro.uni-goettingen.de:443: Broken pipe (Write failed)
May 19, 2021 6:01:44 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://test.data.gro.uni-goettingen.de:443
May 19, 2021 6:17:15 PM org.apache.http.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {s}->https://test.data.gro.uni-goettingen.de:443: Broken pipe (Write failed)
May 19, 2021 6:17:15 PM org.apache.http.impl.execchain.RetryExec execute
INFO: Retrying request to {s}->https://test.data.gro.uni-goettingen.de:443
...

On the server log I found these errors:

[2021-05-19T18:16:45.246+0200] [Payara 5.2021.1] [SEVERE] [] [] [tid: _ThreadID=99 
      _ThreadName=http-thread-pool::jk-connector(1)] [timeMillis: 1621441005246] [levelValue: 1000] [[
  java.io.IOException: java.lang.InterruptedException
	at org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:68)
        ....
	at edu.harvard.iq.dataverse.api.ApiBlockingFilter.doFilter(ApiBlockingFilter.java:168)
        ....

then

[2021-05-19T18:16:45.249+0200] [Payara 5.2021.1] [SEVERE] [] [edu.harvard.iq.dataverse.api.errorhandlers.ThrowableHandler]
 [tid: _ThreadID=99 _ThreadName=http-thread-pool::jk-connector(1)]
 [timeMillis: 1621441005249] [levelValue: 1000] [[
  _status="ERROR";
  _code=500;
  _message="Internal server error. More details available at the server logs.";
  _incidentId="65718191-522f-4ef0-be10-df3b471d0534";
  _interalError="IOException";
  _internalCause="InterruptedException";
  _requestUrl="https://test.data.gro.uni-goettingen.de/api/v1/datasets/:persistentId/add?persistentId=...&key=...";
  _requestMethod="POST"|]]

(I remove identifiers from this snippet, and added some formatting.)

@jmjamison
Copy link
Author

Keep in mind that I'm a user not a developer. That said I was able to manage large uploads by setting up a direct upload s3 store.
As I understand the problem - using the web interface, uploads go through temp storage on the way to the s3 store. The temp storage runs out of space. [Here is where a developer can jump in and correct my description.] Hope this helps some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants