Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSpace S3 and local storage customization #467

Closed
27 of 28 tasks
milanmajchrak opened this issue Dec 6, 2023 · 0 comments
Closed
27 of 28 tasks

DSpace S3 and local storage customization #467

milanmajchrak opened this issue Dec 6, 2023 · 0 comments
Assignees

Comments

@milanmajchrak
Copy link
Collaborator

milanmajchrak commented Dec 6, 2023

Original issue: ufal#1065

Use cases:

  • S3 with CESNET must be configured in the clarin-dspace.cfg
  • S3 CESNET configuration must be documented
  • Upload a regular file
  • Upload a tens of GBs file - (6BG, 5GB is CESNET limit) - the bitstream is added into DB after upload
  • Download as a normal user
  • Download as a admin
  • Download as a Anonymous user
  • Downloading - pause and continue
  • Delete bitstream -> run cleanup -> the file should be removed from the S3
  • Create a new version of the Item and check the storeNumber, if the bitstream was added into S3, try download a bitstream from new version

Questions:

  • 0. Is some S3 issues fixed in the DSpace7.6.1.?
  • 1. Is the bitstream first stored locally and synchronized to S3 at some later point?
  • 2. If a user uploads a 40GB file does the system need the storage capacity to hold that temporarily?
  • 3. Can frequently accessed bitstreams be cached on the repository system or is a download not going through the repository system at all?
  • 4. Is the bitstream’s checksum computed by the repository system or fetched from S3 metadata?
  • 5. Is there an option to use both a local assetstore and an S3 assetstore?
  • 6. What happens when the software can’t connect to S3, or when the connection fails during an upload/download? Does the user notice? Is it possible to resume?
  • 7. When a curation task runs, is the bitstream first downloaded locally?
  • 8. When is created checksum?

Answers:

  1. Yes, the file is copied into local (tomcat) temp file and then it is uploaded as a multipart file to the S3
  2. Yes the system needs 40GB capacity because the file is copied into tomcat/temp folder.
  3. Bitstreams are not cached. Every time it is downloading using Stream.
  4. Object is fetched from the S3 with checksum, then the checksum is retrieved from that object. Checksum is computed before upload.
  5. No
  6. I've started uploading and then removed the Ethernet cable. The user will see Upload failed message and the uploading must be started from the beginning.
  7. Which curation task?
  8. S3: Checksum is computed before upload. The checksum value is fetched from the S3.

TODO:

  • Configure S3 with CESNET
  • Try upload/download/delete some file from CESNET
  • Extends S3BitStoreService and update store, remove method to add/delete data from the local assetstore
  • Create tests
  • Admin UI - bitstream
  • Create synchronization checker between local and S3 assetstore
  • Healthcheck
  • ChecksumChecker
  • CherryPick fix (Further S3 large file optimization DSpace/DSpace#8500) from the Vanilla 7.6.

NOTE:

  • Checksum CRON job - I think it works properly, if the file is changed the most_recent_checksum result is changed to CHECKSUM_NO_MATCH

S3 CESNET limits:

  • Size of max upload file

Wiki: https://github.com/dataquest-dev/DSpace/wiki/S3-%E2%80%90-CESNET

@milanmajchrak milanmajchrak self-assigned this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant