From e8b0cab2062ca5542cdceef2aa49fd49e4617e20 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Fri, 22 Mar 2024 10:51:01 +0100 Subject: [PATCH 1/8] Create framework-0072-pluggable-transfers --- rfcs/framework-0072-pluggable-transfers | 59 +++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 rfcs/framework-0072-pluggable-transfers diff --git a/rfcs/framework-0072-pluggable-transfers b/rfcs/framework-0072-pluggable-transfers new file mode 100644 index 0000000..b07a17e --- /dev/null +++ b/rfcs/framework-0072-pluggable-transfers @@ -0,0 +1,59 @@ +- Start Date: 2024-03-22 +- RFC PR: [#](https://github.com/inveniosoftware/rfcs/pull/) +- Authors: Mirek Simek + +# Allow pluggable transfers for record files + +## Summary + +Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. +Users might benefit from having this system extensible so that they can write/use +their own transfer types. +We would like to implement other types - Multipart and S3Copy. Pluggable mechanism +will + +## Motivation + +> Why are we doing this? What user stories does it support? What is the expected outcome? + +- As a ..., I want to ..., so that I ... . + +## Detailed design + +> This is the bulk of the RFC. + +> Explain the design in enough detail for somebody familiar with the framework to understand, and for somebody familiar with the implementation to implement. This should get into specifics and corner-cases, and include examples of how the feature is used. Any new terminology should be defined here. + +## Example + +> Show a concrete example of what the RFC implies. This will make the consequences of adopting the RFC much clearer. + +> As with other sections, use it if it makes sense for your RFC. + +## How we teach this + +> What names and terminology work best for these concepts and why? How is this idea best presented? As a continuation of existing Invenio patterns, or as a wholly new one? + +> Would the acceptance of this proposal mean the Invenio documentation must be re-organized or altered? Does it change how Invenio is taught to new users at any level? + +> How should this feature be introduced and taught to existing Invenio users? + +## Drawbacks + +> Why should we *not* do this? Please consider the impact on teaching Invenio, on the integration of this feature with other existing and planned features, on the impact of the API churn on existing apps, etc. + +> There are tradeoffs to choosing any path, please attempt to identify them here. + +## Alternatives + +> What other designs have been considered? What is the impact of not doing this? + +> This section could also include prior art, that is, how other frameworks in the same domain have solved this problem. + +## Unresolved questions + +> Optional, but suggested for first drafts. What parts of the design are still TBD? Use it as a todo list for the RFC. + +## Resources/Timeline + +> Which resources do you have available to implement this RFC and what is the overall timeline? From c09145b6de31e62ae01361533de612430586161c Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Fri, 22 Mar 2024 10:51:20 +0100 Subject: [PATCH 2/8] Rename framework-0072-pluggable-transfers to framework-0072-pluggable-transfers.md --- ...-pluggable-transfers => framework-0072-pluggable-transfers.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfcs/{framework-0072-pluggable-transfers => framework-0072-pluggable-transfers.md} (100%) diff --git a/rfcs/framework-0072-pluggable-transfers b/rfcs/framework-0072-pluggable-transfers.md similarity index 100% rename from rfcs/framework-0072-pluggable-transfers rename to rfcs/framework-0072-pluggable-transfers.md From 515bb927f4207dcc5a95c57db292384192e9f8c9 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Sat, 23 Mar 2024 09:00:45 +0100 Subject: [PATCH 3/8] Update framework-0072-pluggable-transfers.md --- rfcs/framework-0072-pluggable-transfers.md | 44 ++++++++++++++++++---- 1 file changed, 37 insertions(+), 7 deletions(-) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index b07a17e..1d7f526 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -9,20 +9,50 @@ Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. Users might benefit from having this system extensible so that they can write/use their own transfer types. -We would like to implement other types - Multipart and S3Copy. Pluggable mechanism -will ## Motivation -> Why are we doing this? What user stories does it support? What is the expected outcome? +We would like to implement at least one additional transfer type - Multipart. This +transfer type will allow users to upload big files split into multiple parts and +will be contained as a core functionality. -- As a ..., I want to ..., so that I ... . +Another example might be a S3Copy transfer type, which would copy the file between +two S3 buckets directly on S3 cluster, without passing through invenio server. ## Detailed design -> This is the bulk of the RFC. - -> Explain the design in enough detail for somebody familiar with the framework to understand, and for somebody familiar with the implementation to implement. This should get into specifics and corner-cases, and include examples of how the feature is used. Any new terminology should be defined here. +All the current functionality is in `transfer.py` and related modules. The `TransferType` +is a non-extendable enum, transfers extensions of `BaseTransfer` and not pluggable +(if/then/else inside the `Transfer` class). + +Proposal: + +* Change the `TransferType` from enum to a normal class. Create constants for the built-in + types - turn `TransferType.LOCAL` to `TRANSFER_TYPE_LOCAL` +* Create a `TransferRegistry` class modelled after ServiceRegistry. Add `current_transfer_registry` + proxy. +* Modify the `Transfer` class to use the registry +* Modify the sources to use the updated class and constants + +* Fix the `dump_status` method on schema. The current implementation says: +``` + def dump_status(self, obj): + """Dump file status.""" + # due to time constraints the status check is done here + # however, ideally this class should not need knowledge of + # the TransferType class, it should be encapsulated at File + # wrapper class or lower. +``` + +Approaches to fix this (to be decided which one to take): + +1. Create a `status=ModelField()` on the FileRecord class, with values `pending`, `completed`, `aborted`, `failed` . + The value of the field would be stored inside the database in the file record +2. Create a `status=StatusSystemField()` that would fetch/store the status from the file's metadata (.model.json) +3. Create a `status=StatusSystemField()` that would get the file's transfer and delegate the status to it. + +The first two options add a value stored to the database but are cleaner, #3 would create a reference +from invenio_records_resources.records to invenio_records_resources.services.files. ## Example From ad71f3baa593db83d919fbbec61a5addf5a6ac5a Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Sat, 23 Mar 2024 09:02:05 +0100 Subject: [PATCH 4/8] Update framework-0072-pluggable-transfers.md --- rfcs/framework-0072-pluggable-transfers.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index 1d7f526..e8787ec 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -47,9 +47,9 @@ Proposal: Approaches to fix this (to be decided which one to take): 1. Create a `status=ModelField()` on the FileRecord class, with values `pending`, `completed`, `aborted`, `failed` . - The value of the field would be stored inside the database in the file record -2. Create a `status=StatusSystemField()` that would fetch/store the status from the file's metadata (.model.json) -3. Create a `status=StatusSystemField()` that would get the file's transfer and delegate the status to it. + The value of the field would be stored inside the database in the file record as a new database column. +2. Create a `status=StatusSystemField()` that would take the status from file's metadata (.model.json) +3. Create a `status=StatusSystemField()` that would get the file transfer and delegate the status query to it. The first two options add a value stored to the database but are cleaner, #3 would create a reference from invenio_records_resources.records to invenio_records_resources.services.files. From 2c94f24e8ec0e46d35fc145283774cacaeb1f9b5 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Sat, 23 Mar 2024 10:53:28 +0100 Subject: [PATCH 5/8] Added more info to the RFC --- rfcs/framework-0072-pluggable-transfers.md | 54 +++++++--------------- 1 file changed, 17 insertions(+), 37 deletions(-) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index e8787ec..7b953dc 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -6,35 +6,26 @@ ## Summary -Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. -Users might benefit from having this system extensible so that they can write/use -their own transfer types. +Currently the transfer types are fixed to Local, Fetch and Remote and can not be extended. Users might benefit from having this system extensible so that they can write/use their own transfer types. ## Motivation -We would like to implement at least one additional transfer type - Multipart. This -transfer type will allow users to upload big files split into multiple parts and -will be contained as a core functionality. +We would like to implement at least one additional transfer type - Multipart. This transfer type will allow users to upload big files split into multiple parts and will be contained as a core functionality. -Another example might be a S3Copy transfer type, which would copy the file between -two S3 buckets directly on S3 cluster, without passing through invenio server. +Another example might be a S3Copy transfer type, which would copy the file between two S3 buckets directly on S3 cluster, without passing through invenio server. ## Detailed design -All the current functionality is in `transfer.py` and related modules. The `TransferType` -is a non-extendable enum, transfers extensions of `BaseTransfer` and not pluggable -(if/then/else inside the `Transfer` class). +All the current functionality is in `transfer.py` and related modules. The `TransferType` is a non-extendable enum, transfers extensions of `BaseTransfer` and not pluggable (if/then/else inside the `Transfer` class). Proposal: -* Change the `TransferType` from enum to a normal class. Create constants for the built-in - types - turn `TransferType.LOCAL` to `TRANSFER_TYPE_LOCAL` -* Create a `TransferRegistry` class modelled after ServiceRegistry. Add `current_transfer_registry` - proxy. -* Modify the `Transfer` class to use the registry -* Modify the sources to use the updated class and constants +- Change the `TransferType` from enum to a normal class. Create constants for the built-in types - turn `TransferType.LOCAL` to `TRANSFER_TYPE_LOCAL` +- Create a `TransferRegistry` class modelled after ServiceRegistry. Add `current_transfer_registry` proxy. +- Modify the `Transfer` class to use the registry +- Modify the sources to use the updated class and constants -* Fix the `dump_status` method on schema. The current implementation says: +- Fix the `dump_status` method on schema. The current implementation says: ``` def dump_status(self, obj): """Dump file status.""" @@ -48,42 +39,31 @@ Approaches to fix this (to be decided which one to take): 1. Create a `status=ModelField()` on the FileRecord class, with values `pending`, `completed`, `aborted`, `failed` . The value of the field would be stored inside the database in the file record as a new database column. -2. Create a `status=StatusSystemField()` that would take the status from file's metadata (.model.json) +2. Create a `status=DictField("status")` that would take the status from file's metadata (.model.json) 3. Create a `status=StatusSystemField()` that would get the file transfer and delegate the status query to it. -The first two options add a value stored to the database but are cleaner, #3 would create a reference -from invenio_records_resources.records to invenio_records_resources.services.files. +The first two options add a value stored to the database but are cleaner, #3 would create a reference from `invenio_records_resources.records` to `invenio_records_resources.services.files`. ## Example -> Show a concrete example of what the RFC implies. This will make the consequences of adopting the RFC much clearer. - -> As with other sections, use it if it makes sense for your RFC. +Another example might be a specialized transport for handling sensitive data in Crypt4GH file format before they are stored in the repository (handling encryption keys etc.) ## How we teach this -> What names and terminology work best for these concepts and why? How is this idea best presented? As a continuation of existing Invenio patterns, or as a wholly new one? - -> Would the acceptance of this proposal mean the Invenio documentation must be re-organized or altered? Does it change how Invenio is taught to new users at any level? - -> How should this feature be introduced and taught to existing Invenio users? +This is rather an internal feature (the same level as services, service configs etc.) We might add a section to [developer topics](https://inveniordm.docs.cern.ch/develop/topics/) or just keep the API docs. ## Drawbacks -> Why should we *not* do this? Please consider the impact on teaching Invenio, on the integration of this feature with other existing and planned features, on the impact of the API churn on existing apps, etc. - -> There are tradeoffs to choosing any path, please attempt to identify them here. +A slightly more complicated code. ## Alternatives -> What other designs have been considered? What is the impact of not doing this? - -> This section could also include prior art, that is, how other frameworks in the same domain have solved this problem. +The only alternative is having separate API endpoints for handling the use cases above. This RFC provides a cleaner solution. ## Unresolved questions -> Optional, but suggested for first drafts. What parts of the design are still TBD? Use it as a todo list for the RFC. +- [ ] Decide how the `status` field should be implemented ## Resources/Timeline -> Which resources do you have available to implement this RFC and what is the overall timeline? +Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024). Both CESNET and Munster are willing to put their resources into the implementation and testing. From 363cfab99403fa044d8a9b0bf8c38dce8a7c9c52 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Tue, 2 Apr 2024 14:17:48 +0200 Subject: [PATCH 6/8] Alex's input --- rfcs/framework-0072-pluggable-transfers.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index 7b953dc..a04de36 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -64,6 +64,13 @@ The only alternative is having separate API endpoints for handling the use cases - [ ] Decide how the `status` field should be implemented +**Alex:** +For the decision on the status field, I have a slight preference for #3 (i.e. computing/querying dynamically instead of storing), since I could imagine the BaseTransfer sub-classes, implementing the logic... Though I see that this introduces a dependency in the wrong direction... Would it make sense to maybe remove status from the record class, and just compute/dump it on the service marshmallow schema? Maybe a 3rd person without all the past implementations baggage could comment on a cleaner solution 🙂 + +Then another less critical concern: + +Maybe there should be a way to configure different transfer types per file service... I could imagine e.g. not wanting to have remote files for community logos. But TBH, given how hidden the feature is, we could roll with having everything available by default across all services, and then later on adding the config if we see weird cases emerging + ## Resources/Timeline Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024). Both CESNET and Munster are willing to put their resources into the implementation and testing. From b73df70136cf579f173b85ce2d58f1d7fb3c4fc1 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Mon, 8 Apr 2024 08:39:22 +0200 Subject: [PATCH 7/8] Updated rfc with questions --- rfcs/framework-0072-pluggable-transfers.md | 76 ++++++++++++++++++++-- 1 file changed, 72 insertions(+), 4 deletions(-) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index a04de36..5929de2 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -64,13 +64,81 @@ The only alternative is having separate API endpoints for handling the use cases - [ ] Decide how the `status` field should be implemented -**Alex:** +**Alex' suggestion** For the decision on the status field, I have a slight preference for #3 (i.e. computing/querying dynamically instead of storing), since I could imagine the BaseTransfer sub-classes, implementing the logic... Though I see that this introduces a dependency in the wrong direction... Would it make sense to maybe remove status from the record class, and just compute/dump it on the service marshmallow schema? Maybe a 3rd person without all the past implementations baggage could comment on a cleaner solution 🙂 -Then another less critical concern: +- [ ] Alex: Then another less critical concern: -Maybe there should be a way to configure different transfer types per file service... I could imagine e.g. not wanting to have remote files for community logos. But TBH, given how hidden the feature is, we could roll with having everything available by default across all services, and then later on adding the config if we see weird cases emerging +Maybe there should be a way to configure different transfer types per file service... +I could imagine e.g. not wanting to have remote files for community logos. +But TBH, given how hidden the feature is, we could roll with having everything available +by default across all services, and then later on adding the config if we see weird +cases emerging + +- [ ] Clarify the "storage_class" vs "transfer_type" + +It seems that inside invenio_files the storage_class was meant to specify +the details of the storage backend (as the tests suggest, at least) - for example +'A' as 'Archive'. + +In invenio_records_resources, the storage_class on the initial schema means the type of +transfer that will be used (Local, Fetch, Remote). This is independent of the storage +backend and is a bit confusing. + +- [ ] FileSchema.uri vs. links section + +For the initialization of files, 'uri' can be used to specify the location of the file +for REMOTE/FETCH transfers. Later on, clients can use this URI for these transfers to get +a direct access to the referenced data. + +During "dumping" the file record, a links section with `content` uri is created. +This is a bit confusing as clients do not know which URL should be used. + +Proposal: keep the uri only inside the initial file record. Do not serialize it +afterwards. Clients should use the links section for all the URIs. The content +url can return a HTTP 302 redirect to the actual file location. This way counters +will work as well. + +- [ ] FETCH transfer type and private URIs + +The FETCH transfer type is used to fetch the file from a remote location. The uri +of the external location is visible to anyone who has a read access to the record, +not just the uploader. As the uri might contain credentials (e.g. S3 signed URL), +this might be a security issue. + +Proposal: do not serialize the uri of the FETCH transfer type. the fetch_file task +would have to be changed to not to use the service but fetch the uri directly. + +- [ ] IfFileIsLocal generators + +The permissions in https://github.com/inveniosoftware/invenio-records-resources/blob/master/invenio_records_resources/services/files/generators.py#L27 +take into account only the files with storage class (== transfer type) "L". This means that +the permissions will fail for "M" (Multipart transport) files. Is this intended? + +The same problem is in invenio-rdm-records/services/generators.py + +Proposal: either change the meaning of "Locality" and keep the permissions as they are +or change the permissions to take into account all the files. Example from RDM that I do +not understand (can_draft_set_content_files - why there is the can_review permission and +not record owner?): + +```python + can_draft_create_files = can_review + can_draft_set_content_files = [ + # review is the same as create_files + IfFileIsLocal(then_=can_review, else_=[SystemProcess()]) + ] + can_draft_get_content_files = [ + # preview is same as read_files + IfFileIsLocal(then_=can_draft_read_files, else_=[SystemProcess()]) + ] + can_draft_commit_files = [ + # review is the same as create_files + IfFileIsLocal(then_=can_review, else_=[SystemProcess()]) + ] +``` ## Resources/Timeline -Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024). Both CESNET and Munster are willing to put their resources into the implementation and testing. +Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024). +Both CESNET and Munster are willing to put their resources into the implementation and testing. From b67d831030137271dcc1214101ab48afe9de7cd6 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Mon, 8 Apr 2024 10:24:46 +0200 Subject: [PATCH 8/8] More questions --- rfcs/framework-0072-pluggable-transfers.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/rfcs/framework-0072-pluggable-transfers.md b/rfcs/framework-0072-pluggable-transfers.md index 5929de2..bbeb143 100644 --- a/rfcs/framework-0072-pluggable-transfers.md +++ b/rfcs/framework-0072-pluggable-transfers.md @@ -138,6 +138,24 @@ not record owner?): ] ``` +- [ ] MultipartStorageExt + +In order to make direct multipart upload to remote storage working we need a PyFS storage +to implement multipart aware methods. These are currently summarized in the MultipartStorageExt +interface that lives in invenio-records-resources - +[MultipartStorageExt class in pull request](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR14) + + +We can not directly implement this interface in storage backends (for example invenio-s3) as +it would introduce a dependency in the wrong direction. + +Proposal: either move the MultipartStorageExt to invenio-files-rest/storage/pyfs or do not check +for the interface but just check for the implemented methods (currently implemented this way in POC +implementation - see for example +[check for multipart_set_content method](https://github.com/mesemus/invenio-records-resources/pull/3/files#diff-ab7d7ef448eaae0f109622e167d9168d7f0ad56dbba383d6d681b9cc17b62bcfR177) ) + + + ## Resources/Timeline Both CESNET and Munster need multipart upload that depends on this feature in a short time (May 2024).