Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix singlepart direct upload #8

Merged
merged 15 commits into from
Mar 4, 2024
Merged

Fix singlepart direct upload #8

merged 15 commits into from
Mar 4, 2024

Conversation

JR-1991
Copy link
Member

@JR-1991 JR-1991 commented Feb 12, 2024

Overview

In issue #7, it was highlighted and discussed that direct upload of a single file (not multipart) to an S3 storage raises a Not implemented exception on AWS side. This issue is related to streaming files for POSTing to the S3 storage. To tackle this issue, the file_sender function has been removed and replaced with a simple open function to upload a file. Additionally, this PR introduces some printing enhancements and allows to force native upload.

Changes

  • Use open instead of file_sender for file uploads.
  • Use a progress bar for file preparation to indicate hashing progress (useful if there are many files).
  • Option to force native upload, even when direct upload is activated.
  • If more than 50 files are uploaded, progress bars are removed upon finishing.

Closes

closes #7

@JR-1991 JR-1991 added the bug Something isn't working label Feb 12, 2024
@JR-1991 JR-1991 self-assigned this Feb 12, 2024
@JR-1991 JR-1991 mentioned this pull request Feb 12, 2024
@DonRichards
Copy link

That worked!!! Thanks for this.
Screenshot from 2024-02-12 16-34-09

@DonRichards
Copy link

I haven't tried those updates yet but I have noticed a significant slowdown with the "Registering files".

Screenshot of the Registering files

@JR-1991
Copy link
Member Author

JR-1991 commented Feb 21, 2024

@DonRichards, sorry for the delay in response. Yes, this is a bottleneck, unfortunately. I have tried to extend the maximum concurrency of registration tasks, but it failed. Dataverse likely struggles to process many requests simultaneously and simply errors out if there are too many.

I have added a soft fix for this by allowing requests to be retried upon failure. Although this is not a guaranteed speed-up, it might be helpful to increase performance slightly. Would you mind trying it out to see if it helped in your case?

If this is still too slow, an option would be to divide your files into multiple tar archives and upload each. This way, there are fewer requests to process.

@DonRichards
Copy link

Any suggestions on how to trace why the registration of files has stopped working suddenly? Is there a way to see what's causing the registration to fail?
"An error occurred with uploading: Connector is closed."

@JR-1991
Copy link
Member Author

JR-1991 commented Feb 27, 2024

@DonRichards this is most likely due to Dataverse shutting down the connection due to too many requests. I am still trying to find a sweet spot, but it varies greatly between instances. You can only traceback the actual error within the logs of your Dataverse instance.

@JR-1991
Copy link
Member Author

JR-1991 commented Feb 29, 2024

@DonRichards good news! I have talked to the Dataverse Dev Team, and there is a way to register bulk data at Dataverse without requiring a request per file. Hence, the registration is now way faster and more stable.

I have just pushed the changes to this PR and prior tested it with 10k small files locally without any issues. Do you mind testing the updated PR?

@DonRichards
Copy link

Tested it with batches of 200 files at a time and it works as expected.

@JR-1991
Copy link
Member Author

JR-1991 commented Mar 1, 2024

@DonRichards thanks for testing! Does this resolve your issue #7?

@DonRichards
Copy link

I do believe so. Thanks! I really appreciate the work.

@JR-1991
Copy link
Member Author

JR-1991 commented Mar 4, 2024

@DonRichards perfect! Will merge this PR then to close the issue #7

@JR-1991 JR-1991 merged commit 3a4df04 into main Mar 4, 2024
4 checks passed
@JR-1991 JR-1991 deleted the fix-singlepart-direct-upload branch May 12, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Message 'Not Implemented'
2 participants