-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pluggable transfers for invenio-records-resources #76
base: master
Are you sure you want to change the base?
Changes from all commits
e8b0cab
c09145b
515bb92
ad71f3b
2c94f24
363cfab
b73df70
b67d831
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
- Start Date: 2024-03-22 | ||
- RFC PR: [#<PR>](https://github.com/inveniosoftware/rfcs/pull/<PR>) | ||
- Authors: Mirek Simek <[email protected]> | ||
|
||
# Allow pluggable transfers for record files | ||
|
||
## Summary | ||
|
||
Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. Users might benefit from having this system extensible so that they can write/use their own transfer types. | ||
|
||
## Motivation | ||
|
||
We would like to implement at least one additional transfer type - Multipart. This transfer type will allow users to upload big files split into multiple parts and will be contained as a core functionality. | ||
|
||
Another example might be a S3Copy transfer type, which would copy the file between two S3 buckets directly on S3 cluster, without passing through invenio server. | ||
|
||
## Detailed design | ||
|
||
All the current functionality is in `transfer.py` and related modules. The `TransferType` is a non-extendable enum, transfers extensions of `BaseTransfer` and not pluggable (if/then/else inside the `Transfer` class). | ||
|
||
Proposal: | ||
|
||
- Change the `TransferType` from enum to a normal class. Create constants for the built-in types - turn `TransferType.LOCAL` to `TRANSFER_TYPE_LOCAL` | ||
- Create a `TransferRegistry` class modelled after ServiceRegistry. Add `current_transfer_registry` proxy. | ||
- Modify the `Transfer` class to use the registry | ||
- Modify the sources to use the updated class and constants | ||
|
||
- Fix the `dump_status` method on schema. The current implementation says: | ||
``` | ||
def dump_status(self, obj): | ||
"""Dump file status.""" | ||
# due to time constraints the status check is done here | ||
# however, ideally this class should not need knowledge of | ||
# the TransferType class, it should be encapsulated at File | ||
# wrapper class or lower. | ||
``` | ||
|
||
Approaches to fix this (to be decided which one to take): | ||
|
||
1. Create a `status=ModelField()` on the FileRecord class, with values `pending`, `completed`, `aborted`, `failed` . | ||
The value of the field would be stored inside the database in the file record as a new database column. | ||
2. Create a `status=DictField("status")` that would take the status from file's metadata (.model.json) | ||
3. Create a `status=StatusSystemField()` that would get the file transfer and delegate the status query to it. | ||
|
||
The first two options add a value stored to the database but are cleaner, #3 would create a reference from `invenio_records_resources.records` to `invenio_records_resources.services.files`. | ||
|
||
## Example | ||
|
||
Another example might be a specialized transport for handling sensitive data in Crypt4GH file format before they are stored in the repository (handling encryption keys etc.) | ||
|
||
## How we teach this | ||
|
||
This is rather an internal feature (the same level as services, service configs etc.) We might add a section to [developer topics](https://inveniordm.docs.cern.ch/develop/topics/) or just keep the API docs. | ||
|
||
## Drawbacks | ||
|
||
A slightly more complicated code. | ||
|
||
## Alternatives | ||
|
||
The only alternative is having separate API endpoints for handling the use cases above. This RFC provides a cleaner solution. | ||
|
||
## Unresolved questions | ||
|
||
- [ ] Decide how the `status` field should be implemented | ||
|
||
**Alex' suggestion** | ||
For the decision on the status field, I have a slight preference for #3 (i.e. computing/querying dynamically instead of storing), since I could imagine the BaseTransfer sub-classes, implementing the logic... Though I see that this introduces a dependency in the wrong direction... Would it make sense to maybe remove status from the record class, and just compute/dump it on the service marshmallow schema? Maybe a 3rd person without all the past implementations baggage could comment on a cleaner solution 🙂 | ||
|
||
- [ ] Alex: Then another less critical concern: | ||
|
||
Maybe there should be a way to configure different transfer types per file service... | ||
I could imagine e.g. not wanting to have remote files for community logos. | ||
But TBH, given how hidden the feature is, we could roll with having everything available | ||
by default across all services, and then later on adding the config if we see weird | ||
cases emerging | ||
|
||
- [ ] Clarify the "storage_class" vs "transfer_type" | ||
|
||
It seems that inside invenio_files the storage_class was meant to specify | ||
the details of the storage backend (as the tests suggest, at least) - for example | ||
'A' as 'Archive'. | ||
|
||
In invenio_records_resources, the storage_class on the initial schema means the type of | ||
transfer that will be used (Local, Fetch, Remote). This is independent of the storage | ||
backend and is a bit confusing. | ||
|
||
- [ ] FileSchema.uri vs. links section | ||
|
||
For the initialization of files, 'uri' can be used to specify the location of the file | ||
for REMOTE/FETCH transfers. Later on, clients can use this URI for these transfers to get | ||
a direct access to the referenced data. | ||
|
||
During "dumping" the file record, a links section with `content` uri is created. | ||
This is a bit confusing as clients do not know which URL should be used. | ||
|
||
Proposal: keep the uri only inside the initial file record. Do not serialize it | ||
afterwards. Clients should use the links section for all the URIs. The content | ||
url can return a HTTP 302 redirect to the actual file location. This way counters | ||
will work as well. | ||
Comment on lines
+88
to
+100
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree on not exposing Some things to keep in mind:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, will hide it |
||
|
||
- [ ] FETCH transfer type and private URIs | ||
|
||
The FETCH transfer type is used to fetch the file from a remote location. The uri | ||
of the external location is visible to anyone who has a read access to the record, | ||
not just the uploader. As the uri might contain credentials (e.g. S3 signed URL), | ||
this might be a security issue. | ||
|
||
Proposal: do not serialize the uri of the FETCH transfer type. the fetch_file task | ||
would have to be changed to not to use the service but fetch the uri directly. | ||
Comment on lines
+102
to
+110
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned above, agree on hiding But I think this is out of scope for this RFC (unless we see it heavily affecting the implementation). |
||
|
||
- [ ] IfFileIsLocal generators | ||
|
||
The permissions in https://github.com/inveniosoftware/invenio-records-resources/blob/master/invenio_records_resources/services/files/generators.py#L27 | ||
take into account only the files with storage class (== transfer type) "L". This means that | ||
the permissions will fail for "M" (Multipart transport) files. Is this intended? | ||
|
||
The same problem is in invenio-rdm-records/services/generators.py | ||
|
||
Proposal: either change the meaning of "Locality" and keep the permissions as they are | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, the definition of "Locality" here is flawed/confusing and was probably labeled as such to be able to enforce via permissions the I think the more logical and explicit way to deal with permissions, would be to have some sort of "switch"-style permissions generator for each transfer type, e.g. something like: can_draft_set_content_files = [
IfTransferType({
TransferType.LOCAL: can_review,
TransferType.MULTIPART: [Disabled()],
TransferType.FETCH: [SystemProcess()],
TransferType.REMOTE: [Disabled()],
})
] I see some problems though:
An elegant solution might be to have some sort of helper class that auto-generates and bundles permissions for each transfer type and action, and can be easily extended/overridden. So to move forward, a possible path would be:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
or change the permissions to take into account all the files. Example from RDM that I do | ||
not understand (can_draft_set_content_files - why there is the can_review permission and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
not record owner?): | ||
|
||
```python | ||
can_draft_create_files = can_review | ||
can_draft_set_content_files = [ | ||
# review is the same as create_files | ||
IfFileIsLocal(then_=can_review, else_=[SystemProcess()]) | ||
] | ||
can_draft_get_content_files = [ | ||
# preview is same as read_files | ||
IfFileIsLocal(then_=can_draft_read_files, else_=[SystemProcess()]) | ||
] | ||
can_draft_commit_files = [ | ||
# review is the same as create_files | ||
IfFileIsLocal(then_=can_review, else_=[SystemProcess()]) | ||
] | ||
``` | ||
|
||
- [ ] MultipartStorageExt | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question: Invenio-Files-REST comes with some existing models for storing multipart-related info, like e.g. I see that you've used There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The primary reason for not using the MultipartObject was that the uploadId is uuid type, which is not the case for S3 - there it is a long string. From the S3 perspective, only the "MultipartObject" is relevant as parts are managed directly by calls to S3 and Invenio does not know about them at all. The implementation of MultipartObject/Part also merges "model" and "service" level stuff - the service level is the "generic" implementation for multipart unaware backends, that I've got, for example, at https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR184 - there is this "if contract, call storage, else generic impl" pattern. So if the ObjectVersionTag solution is ok for you, I'd be happy to keep it like that. Btw, having looked at expired upload task at https://github.com/inveniosoftware/invenio-files-rest/blob/master/invenio_files_rest/tasks.py#L272 - do you think that it is necessary to clean the uploads that have not been touched for a time? Just now I handle only delete - if a file is deleted, it's multipart upload (if pending) is aborted as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've just extracted the global code to previously unused interface (going the "protocol" way), imho looks better. https://github.com/mesemus/invenio-records-resources/blob/multipart-transfers/invenio_records_resources/services/files/transfer/providers/multipart.py |
||
|
||
In order to make direct multipart upload to remote storage working we need a PyFS storage | ||
to implement multipart aware methods. These are currently summarized in the MultipartStorageExt | ||
interface that lives in invenio-records-resources - | ||
[MultipartStorageExt class in pull request](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR14) | ||
|
||
|
||
We can not directly implement this interface in storage backends (for example invenio-s3) as | ||
it would introduce a dependency in the wrong direction. | ||
|
||
Proposal: either move the MultipartStorageExt to invenio-files-rest/storage/pyfs or do not check | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The semantics/naming of Another solution, might be to move the relevant There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, let's keep the protocol style, I've made it more readable in the code. |
||
for the interface but just check for the implemented methods (currently implemented this way in POC | ||
implementation - see for example | ||
[check for multipart_set_content method](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR177) ) | ||
|
||
|
||
|
||
## Resources/Timeline | ||
|
||
Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024). | ||
Both CESNET and Munster are willing to put their resources into the implementation and testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 We should really make it clear that
transfer_type
is the canonical term, both in terms of user input in the Marshmallow schema, but also throughout the API and data model. I think that also means that we should storetransfer_type
in theDraftFileRecord.json
or even a newDraftFileRecord.transfer_type
column... My feeling is thatstorage_class
was "overloaded"/multi-purposed to avoid having to store the transfer type in a new field (which is not an ideal solution given the semantics).This rename is a backward-incompatible change, but given the low usage of the feature (and the fact that we'll probably go with a major version bump) I think it's reasonable to get it right now.
storage_class
is something that the user shouldn't be able to control or know about (though IIRC there were discussions in the workshop about having a choice for it in the upload UI).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original intention of storage class was to be able to record if e.g. a file was online/offline so you could use it to make decisions if it should be fetched, or check for integrity etc.....was never really used however.