-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable transfers for invenio-records-resources #76
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the implementation and the nice write-up of the questions! I've answered most, but left some that might need a 3rd-party opinion. I browsed through the PRs as well, but I'd prefer to thoroughly review after seeing all the questions/comments here closed.
- [ ] Clarify the "storage_class" vs "transfer_type" | ||
|
||
It seems that inside invenio_files the storage_class was meant to specify | ||
the details of the storage backend (as the tests suggest, at least) - for example | ||
'A' as 'Archive'. | ||
|
||
In invenio_records_resources, the storage_class on the initial schema means the type of | ||
transfer that will be used (Local, Fetch, Remote). This is independent of the storage | ||
backend and is a bit confusing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We should really make it clear that transfer_type
is the canonical term, both in terms of user input in the Marshmallow schema, but also throughout the API and data model. I think that also means that we should store transfer_type
in the DraftFileRecord.json
or even a new DraftFileRecord.transfer_type
column... My feeling is that storage_class
was "overloaded"/multi-purposed to avoid having to store the transfer type in a new field (which is not an ideal solution given the semantics).
This rename is a backward-incompatible change, but given the low usage of the feature (and the fact that we'll probably go with a major version bump) I think it's reasonable to get it right now.
storage_class
is something that the user shouldn't be able to control or know about (though IIRC there were discussions in the workshop about having a choice for it in the upload UI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original intention of storage class was to be able to record if e.g. a file was online/offline so you could use it to make decisions if it should be fetched, or check for integrity etc.....was never really used however.
- [ ] FileSchema.uri vs. links section | ||
|
||
For the initialization of files, 'uri' can be used to specify the location of the file | ||
for REMOTE/FETCH transfers. Later on, clients can use this URI for these transfers to get | ||
a direct access to the referenced data. | ||
|
||
During "dumping" the file record, a links section with `content` uri is created. | ||
This is a bit confusing as clients do not know which URL should be used. | ||
|
||
Proposal: keep the uri only inside the initial file record. Do not serialize it | ||
afterwards. Clients should use the links section for all the URIs. The content | ||
url can return a HTTP 302 redirect to the actual file location. This way counters | ||
will work as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on not exposing uri
, and keeping links.content
pointing to /api/records/{id}/files/{filename}/content
to be able to keep track of usage statistics (cc @tmorrell I've updated the File Linking API discussion to take usage statistics into account).
Some things to keep in mind:
- For
FETCH
, theuri
is stored to know the location from where to fetch the file, that will later be stored as a regular file and have aFileInstance.uri
of a path "on disk". - For
REMOTE
, theuri
is actually passed all the way through and stored in theFileInstance.uri
(see also the PoC PR for RemoteTransfer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will hide it
|
||
Proposal: either change the meaning of "Locality" and keep the permissions as they are | ||
or change the permissions to take into account all the files. Example from RDM that I do | ||
not understand (can_draft_set_content_files - why there is the can_review permission and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can_review
is a bit convoluted since it also represents community review request curators who have edit rights to the unpublished draft (e.g., for fixing metadata, or possibly correcting file formats from XLS to CSV, etc.)
- [ ] FETCH transfer type and private URIs | ||
|
||
The FETCH transfer type is used to fetch the file from a remote location. The uri | ||
of the external location is visible to anyone who has a read access to the record, | ||
not just the uploader. As the uri might contain credentials (e.g. S3 signed URL), | ||
this might be a security issue. | ||
|
||
Proposal: do not serialize the uri of the FETCH transfer type. the fetch_file task | ||
would have to be changed to not to use the service but fetch the uri directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, agree on hiding uri
. There's still a question about if we should also be completely removing the uri
after the fetch, since one might want to know the original provenance of the fetched file.
But I think this is out of scope for this RFC (unless we see it heavily affecting the implementation).
|
||
The same problem is in invenio-rdm-records/services/generators.py | ||
|
||
Proposal: either change the meaning of "Locality" and keep the permissions as they are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, the definition of "Locality" here is flawed/confusing and was probably labeled as such to be able to enforce via permissions the FETCH
workflow, i.e. that while fetching the file, only the system (via the background Celery task) can set the contents of the draft file.
I think the more logical and explicit way to deal with permissions, would be to have some sort of "switch"-style permissions generator for each transfer type, e.g. something like:
can_draft_set_content_files = [
IfTransferType({
TransferType.LOCAL: can_review,
TransferType.MULTIPART: [Disabled()],
TransferType.FETCH: [SystemProcess()],
TransferType.REMOTE: [Disabled()],
})
]
I see some problems though:
- In OOP, having a
switch
on an enum is a bit of a code smell and hints towards actually moving the code implementation on each class...- On the other hand, permissions should be configurable by each instance, so if we moved the permissions logic to the
BaseTransfer
implementation for each transfer type, end-users wouldn't be able to customize them easily.
- On the other hand, permissions should be configurable by each instance, so if we moved the permissions logic to the
- When adding a new transfer type or modifying permissions for an existing type, one has to do a lot of work to update all the files-related permissions, without a good solution for reusing/extending existing values.
An elegant solution might be to have some sort of helper class that auto-generates and bundles permissions for each transfer type and action, and can be easily extended/overridden.
So to move forward, a possible path would be:
- Go with the explicit
IfTransferType
generator proposed above, which solves the problem - See how it feels for someone implementing a new transfer type in the future
- Re-evaluate if we need something fancier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can not directly implement this interface in storage backends (for example invenio-s3) as | ||
it would introduce a dependency in the wrong direction. | ||
|
||
Proposal: either move the MultipartStorageExt to invenio-files-rest/storage/pyfs or do not check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics/naming of MultipartStorageExt
look very much bound to the init/content/commit
terminology, so they would look somewhat out of place in invenio-files-rest
. Thus I'm leaning more towards keeping a "Protocol"-style coupling.
Another solution, might be to move the relevant invenio-s3
parts that implement the MultipartStorageExt
interface in this module under "contrib" or similar and do some try/except
around imports to do feature detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's keep the protocol style, I've made it more readable in the code.
] | ||
``` | ||
|
||
- [ ] MultipartStorageExt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Invenio-Files-REST comes with some existing models for storing multipart-related info, like e.g. MultipartObject
and Part
. It also implements the REST API for working with this, but it's very much bound to the Angular.js client-side library we picked at the time. For Zenodo (and all Invenio v3-based services at CERN) we also never used this implementation, but because of performance issues for multipart on our storage cluster at CERN.
I see that you've used ObjectVersionTag
s for organizing the parts; was there a specific reason for not reusing the MultipartObject
/Part
classes? In any case, what matters is if it works, but I could also see it as an opportunity to consider purging all the multipart models from files-rest :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary reason for not using the MultipartObject was that the uploadId is uuid type, which is not the case for S3 - there it is a long string. From the S3 perspective, only the "MultipartObject" is relevant as parts are managed directly by calls to S3 and Invenio does not know about them at all.
The implementation of MultipartObject/Part also merges "model" and "service" level stuff - the service level is the "generic" implementation for multipart unaware backends, that I've got, for example, at https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR184 - there is this "if contract, call storage, else generic impl" pattern.
So if the ObjectVersionTag solution is ok for you, I'd be happy to keep it like that.
Btw, having looked at expired upload task at https://github.com/inveniosoftware/invenio-files-rest/blob/master/invenio_files_rest/tasks.py#L272 - do you think that it is necessary to clean the uploads that have not been touched for a time? Just now I handle only delete - if a file is deleted, it's multipart upload (if pending) is aborted as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just extracted the global code to previously unused interface (going the "protocol" way), imho looks better. https://github.com/mesemus/invenio-records-resources/blob/multipart-transfers/invenio_records_resources/services/files/transfer/providers/multipart.py
❤️ Thank you for your contribution!
Description
This pull request discusses changing the transfer types to be extensible with new transfers
Checklist
Ticks in all boxes and 🟢 on all GitHub actions status checks are required to merge:
Third-party code
If you've added third-party code (copy/pasted or new dependencies), please reach out to an architect.
Reminder
By using GitHub, you have already agreed to the GitHub’s Terms of Service including that: