Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable transfers for invenio-records-resources #76

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mesemus
Copy link

@mesemus mesemus commented Apr 2, 2024

❤️ Thank you for your contribution!

Description

This pull request discusses changing the transfer types to be extensible with new transfers

Checklist

Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:

Third-party code

If you've added third-party code (copy/pasted or new dependencies), please reach out to an architect.

Reminder

By using GitHub, you have already agreed to the GitHub’s Terms of Service including that:

  1. You license your contribution under the same terms as the current repository’s license.
  2. You agree that you have the right to license your contribution under the current repository’s license.

@mesemus mesemus marked this pull request as draft April 2, 2024 12:13
Copy link
Member

@slint slint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the implementation and the nice write-up of the questions! I've answered most, but left some that might need a 3rd-party opinion. I browsed through the PRs as well, but I'd prefer to thoroughly review after seeing all the questions/comments here closed.

Comment on lines +78 to +86
- [ ] Clarify the "storage_class" vs "transfer_type"

It seems that inside invenio_files the storage_class was meant to specify
the details of the storage backend (as the tests suggest, at least) - for example
'A' as 'Archive'.

In invenio_records_resources, the storage_class on the initial schema means the type of
transfer that will be used (Local, Fetch, Remote). This is independent of the storage
backend and is a bit confusing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 We should really make it clear that transfer_type is the canonical term, both in terms of user input in the Marshmallow schema, but also throughout the API and data model. I think that also means that we should store transfer_type in the DraftFileRecord.json or even a new DraftFileRecord.transfer_type column... My feeling is that storage_class was "overloaded"/multi-purposed to avoid having to store the transfer type in a new field (which is not an ideal solution given the semantics).

This rename is a backward-incompatible change, but given the low usage of the feature (and the fact that we'll probably go with a major version bump) I think it's reasonable to get it right now.


storage_class is something that the user shouldn't be able to control or know about (though IIRC there were discussions in the workshop about having a choice for it in the upload UI).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original intention of storage class was to be able to record if e.g. a file was online/offline so you could use it to make decisions if it should be fetched, or check for integrity etc.....was never really used however.

Comment on lines +88 to +100
- [ ] FileSchema.uri vs. links section

For the initialization of files, 'uri' can be used to specify the location of the file
for REMOTE/FETCH transfers. Later on, clients can use this URI for these transfers to get
a direct access to the referenced data.

During "dumping" the file record, a links section with `content` uri is created.
This is a bit confusing as clients do not know which URL should be used.

Proposal: keep the uri only inside the initial file record. Do not serialize it
afterwards. Clients should use the links section for all the URIs. The content
url can return a HTTP 302 redirect to the actual file location. This way counters
will work as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on not exposing uri, and keeping links.content pointing to /api/records/{id}/files/{filename}/content to be able to keep track of usage statistics (cc @tmorrell I've updated the File Linking API discussion to take usage statistics into account).

Some things to keep in mind:

  • For FETCH, the uri is stored to know the location from where to fetch the file, that will later be stored as a regular file and have a FileInstance.uri of a path "on disk".
  • For REMOTE, the uri is actually passed all the way through and stored in the FileInstance.uri (see also the PoC PR for RemoteTransfer).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will hide it


Proposal: either change the meaning of "Locality" and keep the permissions as they are
or change the permissions to take into account all the files. Example from RDM that I do
not understand (can_draft_set_content_files - why there is the can_review permission and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_review is a bit convoluted since it also represents community review request curators who have edit rights to the unpublished draft (e.g., for fixing metadata, or possibly correcting file formats from XLS to CSV, etc.)

Comment on lines +102 to +110
- [ ] FETCH transfer type and private URIs

The FETCH transfer type is used to fetch the file from a remote location. The uri
of the external location is visible to anyone who has a read access to the record,
not just the uploader. As the uri might contain credentials (e.g. S3 signed URL),
this might be a security issue.

Proposal: do not serialize the uri of the FETCH transfer type. the fetch_file task
would have to be changed to not to use the service but fetch the uri directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, agree on hiding uri. There's still a question about if we should also be completely removing the uri after the fetch, since one might want to know the original provenance of the fetched file.

But I think this is out of scope for this RFC (unless we see it heavily affecting the implementation).


The same problem is in invenio-rdm-records/services/generators.py

Proposal: either change the meaning of "Locality" and keep the permissions as they are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the definition of "Locality" here is flawed/confusing and was probably labeled as such to be able to enforce via permissions the FETCH workflow, i.e. that while fetching the file, only the system (via the background Celery task) can set the contents of the draft file.

I think the more logical and explicit way to deal with permissions, would be to have some sort of "switch"-style permissions generator for each transfer type, e.g. something like:

can_draft_set_content_files = [
    IfTransferType({
        TransferType.LOCAL: can_review,
        TransferType.MULTIPART: [Disabled()],
        TransferType.FETCH: [SystemProcess()],
        TransferType.REMOTE: [Disabled()],
    })
]

I see some problems though:

  • In OOP, having a switch on an enum is a bit of a code smell and hints towards actually moving the code implementation on each class...
    • On the other hand, permissions should be configurable by each instance, so if we moved the permissions logic to the BaseTransfer implementation for each transfer type, end-users wouldn't be able to customize them easily.
  • When adding a new transfer type or modifying permissions for an existing type, one has to do a lot of work to update all the files-related permissions, without a good solution for reusing/extending existing values.

An elegant solution might be to have some sort of helper class that auto-generates and bundles permissions for each transfer type and action, and can be easily extended/overridden.

So to move forward, a possible path would be:

  1. Go with the explicit IfTransferType generator proposed above, which solves the problem
  2. See how it feels for someone implementing a new transfer type in the future
  3. Re-evaluate if we need something fancier

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can not directly implement this interface in storage backends (for example invenio-s3) as
it would introduce a dependency in the wrong direction.

Proposal: either move the MultipartStorageExt to invenio-files-rest/storage/pyfs or do not check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics/naming of MultipartStorageExt look very much bound to the init/content/commit terminology, so they would look somewhat out of place in invenio-files-rest. Thus I'm leaning more towards keeping a "Protocol"-style coupling.

Another solution, might be to move the relevant invenio-s3 parts that implement the MultipartStorageExt interface in this module under "contrib" or similar and do some try/except around imports to do feature detection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep the protocol style, I've made it more readable in the code.

]
```

- [ ] MultipartStorageExt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Invenio-Files-REST comes with some existing models for storing multipart-related info, like e.g. MultipartObject and Part. It also implements the REST API for working with this, but it's very much bound to the Angular.js client-side library we picked at the time. For Zenodo (and all Invenio v3-based services at CERN) we also never used this implementation, but because of performance issues for multipart on our storage cluster at CERN.

I see that you've used ObjectVersionTags for organizing the parts; was there a specific reason for not reusing the MultipartObject/Part classes? In any case, what matters is if it works, but I could also see it as an opportunity to consider purging all the multipart models from files-rest :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason for not using the MultipartObject was that the uploadId is uuid type, which is not the case for S3 - there it is a long string. From the S3 perspective, only the "MultipartObject" is relevant as parts are managed directly by calls to S3 and Invenio does not know about them at all.

The implementation of MultipartObject/Part also merges "model" and "service" level stuff - the service level is the "generic" implementation for multipart unaware backends, that I've got, for example, at https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR184 - there is this "if contract, call storage, else generic impl" pattern.

So if the ObjectVersionTag solution is ok for you, I'd be happy to keep it like that.

Btw, having looked at expired upload task at https://github.com/inveniosoftware/invenio-files-rest/blob/master/invenio_files_rest/tasks.py#L272 - do you think that it is necessary to clean the uploads that have not been touched for a time? Just now I handle only delete - if a file is deleted, it's multipart upload (if pending) is aborted as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just extracted the global code to previously unused interface (going the "protocol" way), imho looks better. https://github.com/mesemus/invenio-records-resources/blob/multipart-transfers/invenio_records_resources/services/files/transfer/providers/multipart.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants