Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable transfers for invenio-records-resources #76

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions rfcs/framework-0072-pluggable-transfers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
- Start Date: 2024-03-22
- RFC PR: [#<PR>](https://github.com/inveniosoftware/rfcs/pull/<PR>)
- Authors: Mirek Simek <[email protected]>

# Allow pluggable transfers for record files

## Summary

Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. Users might benefit from having this system extensible so that they can write/use their own transfer types.

## Motivation

We would like to implement at least one additional transfer type - Multipart. This transfer type will allow users to upload big files split into multiple parts and will be contained as a core functionality.

Another example might be a S3Copy transfer type, which would copy the file between two S3 buckets directly on S3 cluster, without passing through invenio server.

## Detailed design

All the current functionality is in `transfer.py` and related modules. The `TransferType` is a non-extendable enum, transfers extensions of `BaseTransfer` and not pluggable (if/then/else inside the `Transfer` class).

Proposal:

- Change the `TransferType` from enum to a normal class. Create constants for the built-in types - turn `TransferType.LOCAL` to `TRANSFER_TYPE_LOCAL`
- Create a `TransferRegistry` class modelled after ServiceRegistry. Add `current_transfer_registry` proxy.
- Modify the `Transfer` class to use the registry
- Modify the sources to use the updated class and constants

- Fix the `dump_status` method on schema. The current implementation says:
```
def dump_status(self, obj):
"""Dump file status."""
# due to time constraints the status check is done here
# however, ideally this class should not need knowledge of
# the TransferType class, it should be encapsulated at File
# wrapper class or lower.
```

Approaches to fix this (to be decided which one to take):

1. Create a `status=ModelField()` on the FileRecord class, with values `pending`, `completed`, `aborted`, `failed` .
The value of the field would be stored inside the database in the file record as a new database column.
2. Create a `status=DictField("status")` that would take the status from file's metadata (.model.json)
3. Create a `status=StatusSystemField()` that would get the file transfer and delegate the status query to it.

The first two options add a value stored to the database but are cleaner, #3 would create a reference from `invenio_records_resources.records` to `invenio_records_resources.services.files`.

## Example

Another example might be a specialized transport for handling sensitive data in Crypt4GH file format before they are stored in the repository (handling encryption keys etc.)

## How we teach this

This is rather an internal feature (the same level as services, service configs etc.) We might add a section to [developer topics](https://inveniordm.docs.cern.ch/develop/topics/) or just keep the API docs.

## Drawbacks

A slightly more complicated code.

## Alternatives

The only alternative is having separate API endpoints for handling the use cases above. This RFC provides a cleaner solution.

## Unresolved questions

- [ ] Decide how the `status` field should be implemented

**Alex' suggestion**
For the decision on the status field, I have a slight preference for #3 (i.e. computing/querying dynamically instead of storing), since I could imagine the BaseTransfer sub-classes, implementing the logic... Though I see that this introduces a dependency in the wrong direction... Would it make sense to maybe remove status from the record class, and just compute/dump it on the service marshmallow schema? Maybe a 3rd person without all the past implementations baggage could comment on a cleaner solution 🙂

- [ ] Alex: Then another less critical concern:

Maybe there should be a way to configure different transfer types per file service...
I could imagine e.g. not wanting to have remote files for community logos.
But TBH, given how hidden the feature is, we could roll with having everything available
by default across all services, and then later on adding the config if we see weird
cases emerging

- [ ] Clarify the "storage_class" vs "transfer_type"

It seems that inside invenio_files the storage_class was meant to specify
the details of the storage backend (as the tests suggest, at least) - for example
'A' as 'Archive'.

In invenio_records_resources, the storage_class on the initial schema means the type of
transfer that will be used (Local, Fetch, Remote). This is independent of the storage
backend and is a bit confusing.
Comment on lines +78 to +86
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 We should really make it clear that transfer_type is the canonical term, both in terms of user input in the Marshmallow schema, but also throughout the API and data model. I think that also means that we should store transfer_type in the DraftFileRecord.json or even a new DraftFileRecord.transfer_type column... My feeling is that storage_class was "overloaded"/multi-purposed to avoid having to store the transfer type in a new field (which is not an ideal solution given the semantics).

This rename is a backward-incompatible change, but given the low usage of the feature (and the fact that we'll probably go with a major version bump) I think it's reasonable to get it right now.


storage_class is something that the user shouldn't be able to control or know about (though IIRC there were discussions in the workshop about having a choice for it in the upload UI).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original intention of storage class was to be able to record if e.g. a file was online/offline so you could use it to make decisions if it should be fetched, or check for integrity etc.....was never really used however.


- [ ] FileSchema.uri vs. links section

For the initialization of files, 'uri' can be used to specify the location of the file
for REMOTE/FETCH transfers. Later on, clients can use this URI for these transfers to get
a direct access to the referenced data.

During "dumping" the file record, a links section with `content` uri is created.
This is a bit confusing as clients do not know which URL should be used.

Proposal: keep the uri only inside the initial file record. Do not serialize it
afterwards. Clients should use the links section for all the URIs. The content
url can return a HTTP 302 redirect to the actual file location. This way counters
will work as well.
Comment on lines +88 to +100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on not exposing uri, and keeping links.content pointing to /api/records/{id}/files/{filename}/content to be able to keep track of usage statistics (cc @tmorrell I've updated the File Linking API discussion to take usage statistics into account).

Some things to keep in mind:

  • For FETCH, the uri is stored to know the location from where to fetch the file, that will later be stored as a regular file and have a FileInstance.uri of a path "on disk".
  • For REMOTE, the uri is actually passed all the way through and stored in the FileInstance.uri (see also the PoC PR for RemoteTransfer).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will hide it


- [ ] FETCH transfer type and private URIs

The FETCH transfer type is used to fetch the file from a remote location. The uri
of the external location is visible to anyone who has a read access to the record,
not just the uploader. As the uri might contain credentials (e.g. S3 signed URL),
this might be a security issue.

Proposal: do not serialize the uri of the FETCH transfer type. the fetch_file task
would have to be changed to not to use the service but fetch the uri directly.
Comment on lines +102 to +110
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, agree on hiding uri. There's still a question about if we should also be completely removing the uri after the fetch, since one might want to know the original provenance of the fetched file.

But I think this is out of scope for this RFC (unless we see it heavily affecting the implementation).


- [ ] IfFileIsLocal generators

The permissions in https://github.com/inveniosoftware/invenio-records-resources/blob/master/invenio_records_resources/services/files/generators.py#L27
take into account only the files with storage class (== transfer type) "L". This means that
the permissions will fail for "M" (Multipart transport) files. Is this intended?

The same problem is in invenio-rdm-records/services/generators.py

Proposal: either change the meaning of "Locality" and keep the permissions as they are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the definition of "Locality" here is flawed/confusing and was probably labeled as such to be able to enforce via permissions the FETCH workflow, i.e. that while fetching the file, only the system (via the background Celery task) can set the contents of the draft file.

I think the more logical and explicit way to deal with permissions, would be to have some sort of "switch"-style permissions generator for each transfer type, e.g. something like:

can_draft_set_content_files = [
    IfTransferType({
        TransferType.LOCAL: can_review,
        TransferType.MULTIPART: [Disabled()],
        TransferType.FETCH: [SystemProcess()],
        TransferType.REMOTE: [Disabled()],
    })
]

I see some problems though:

  • In OOP, having a switch on an enum is a bit of a code smell and hints towards actually moving the code implementation on each class...
    • On the other hand, permissions should be configurable by each instance, so if we moved the permissions logic to the BaseTransfer implementation for each transfer type, end-users wouldn't be able to customize them easily.
  • When adding a new transfer type or modifying permissions for an existing type, one has to do a lot of work to update all the files-related permissions, without a good solution for reusing/extending existing values.

An elegant solution might be to have some sort of helper class that auto-generates and bundles permissions for each transfer type and action, and can be easily extended/overridden.

So to move forward, a possible path would be:

  1. Go with the explicit IfTransferType generator proposed above, which solves the problem
  2. See how it feels for someone implementing a new transfer type in the future
  3. Re-evaluate if we need something fancier

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or change the permissions to take into account all the files. Example from RDM that I do
not understand (can_draft_set_content_files - why there is the can_review permission and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_review is a bit convoluted since it also represents community review request curators who have edit rights to the unpublished draft (e.g., for fixing metadata, or possibly correcting file formats from XLS to CSV, etc.)

not record owner?):

```python
can_draft_create_files = can_review
can_draft_set_content_files = [
# review is the same as create_files
IfFileIsLocal(then_=can_review, else_=[SystemProcess()])
]
can_draft_get_content_files = [
# preview is same as read_files
IfFileIsLocal(then_=can_draft_read_files, else_=[SystemProcess()])
]
can_draft_commit_files = [
# review is the same as create_files
IfFileIsLocal(then_=can_review, else_=[SystemProcess()])
]
```

- [ ] MultipartStorageExt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Invenio-Files-REST comes with some existing models for storing multipart-related info, like e.g. MultipartObject and Part. It also implements the REST API for working with this, but it's very much bound to the Angular.js client-side library we picked at the time. For Zenodo (and all Invenio v3-based services at CERN) we also never used this implementation, but because of performance issues for multipart on our storage cluster at CERN.

I see that you've used ObjectVersionTags for organizing the parts; was there a specific reason for not reusing the MultipartObject/Part classes? In any case, what matters is if it works, but I could also see it as an opportunity to consider purging all the multipart models from files-rest :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary reason for not using the MultipartObject was that the uploadId is uuid type, which is not the case for S3 - there it is a long string. From the S3 perspective, only the "MultipartObject" is relevant as parts are managed directly by calls to S3 and Invenio does not know about them at all.

The implementation of MultipartObject/Part also merges "model" and "service" level stuff - the service level is the "generic" implementation for multipart unaware backends, that I've got, for example, at https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR184 - there is this "if contract, call storage, else generic impl" pattern.

So if the ObjectVersionTag solution is ok for you, I'd be happy to keep it like that.

Btw, having looked at expired upload task at https://github.com/inveniosoftware/invenio-files-rest/blob/master/invenio_files_rest/tasks.py#L272 - do you think that it is necessary to clean the uploads that have not been touched for a time? Just now I handle only delete - if a file is deleted, it's multipart upload (if pending) is aborted as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just extracted the global code to previously unused interface (going the "protocol" way), imho looks better. https://github.com/mesemus/invenio-records-resources/blob/multipart-transfers/invenio_records_resources/services/files/transfer/providers/multipart.py


In order to make direct multipart upload to remote storage working we need a PyFS storage
to implement multipart aware methods. These are currently summarized in the MultipartStorageExt
interface that lives in invenio-records-resources -
[MultipartStorageExt class in pull request](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR14)


We can not directly implement this interface in storage backends (for example invenio-s3) as
it would introduce a dependency in the wrong direction.

Proposal: either move the MultipartStorageExt to invenio-files-rest/storage/pyfs or do not check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics/naming of MultipartStorageExt look very much bound to the init/content/commit terminology, so they would look somewhat out of place in invenio-files-rest. Thus I'm leaning more towards keeping a "Protocol"-style coupling.

Another solution, might be to move the relevant invenio-s3 parts that implement the MultipartStorageExt interface in this module under "contrib" or similar and do some try/except around imports to do feature detection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's keep the protocol style, I've made it more readable in the code.

for the interface but just check for the implemented methods (currently implemented this way in POC
implementation - see for example
[check for multipart_set_content method](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR177) )



## Resources/Timeline

Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024).
Both CESNET and Munster are willing to put their resources into the implementation and testing.