Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to show file upload related backend errors to the user during submission. #206

Open
hbayindir opened this issue Nov 13, 2024 · 8 comments

Comments

@hbayindir
Copy link

hbayindir commented Nov 13, 2024

Is your feature request related to a problem? Please describe.

We have implemented a file validation procedure for uploaded files in our fork of invenio-files-rest package to enhance user security and improve sustainability of the uploaded data in our local instance, Aperta. In short, we make sure that the files are in the formats we like, there are no nasty surprises inside, and files are intact and openable.

However, when a user encounters an error, or rejected by one of these filters, we can't notify them about what happened.

This feature is a carryover from our old Aperta installation, which was a fork of Zenodo, and we were able to show these notifications on that version.

Describe the solution you'd like

We want to be able to notify the user about problems in their file(s), and what failed. So they can fix the problem and send the correct version/alternative of the file instead.

Describe alternatives you've considered

Since there's no way to communicate these errors and float them towards the UI and the user, we were unable to find any other alternatives.

Additional context

The things we do are as follows:

  1. The process starts with file extension verification, which is performed through magic byte inspection, utilizing the open-source filetype library to infer file types.
  2. We also perform health checks on tar (tar, tar.gz, tar.bz2, tar.xz) and zip compressed files to detect any issues during extraction. Rar and 7z formats are not supported; rar is a proprietary format, and reliable health checks for both formats would require third-party libraries, which we aim to avoid.
  3. After health check, we verify the total uncompressed size not exceed 10 GiB. This limit is in place to protect against unusually large zip bombs, further ensuring end-user security. Password protected compressed files rejected as well due security and policy reasons.

If files pass all these checks, they're accepted as valid download, otherwise we reject them with an error. For a detailed explanation of what we do, please see the flowchart added below.

The file validation checks are executed once the file has fully uploaded, allowing us to perform integrity checks on compressed formats. If any validation fails, the file is immediately deleted from storage, and a FileCheckError —a custom exception inheriting from StorageError— is raised to indicate the failure.

This is where the problem surfaces. Unfortunately there is no support in InvenioRDM for showing this kind custom upload errors in UI. To be able to inform the user about the problem in a detailed manner, we want these errors to be visible in the UI, and want to introduce this feature into the file uploader(s) InvenioRDM has. There're talks about a new file uploader, and if that one will be the default, we'd love to add the support for it. Otherwise we want to introduce the feature to the old one and hopefully backport it to the newer one in due course.

Any help and guidance is greatly appreciated.

Flowchart:

flowchart TD
        flow_start@{ shape: circle, label: "Start" }
        flow_end@{ shape: circle, label: "End" }
        flow_end_end@{ shape: circle, label: "End" }
        upload@{ shape: lean-r, label: "Upload file" }
        is_not_allowed@{ shape: diamond, label: "if file_ext in [\"7z\", \"rar\"]" }
        file_ext@{ shape: rectangle, label: "file_ext = filename.split('.')[-1]" }
        file_check_error@{ shape: rectangle, label: "raise FileCheckError" }
        file_ext_guess@{ shape: rectangle, label: "guessed_file_ext = file_ext_guess(fp)" }
        is_file_ext_matched@{ shape: diamond, label: "if guessed_file_ext == file_ext" }
        is_zip@{ shape: diamond, label: "file_ext == \"zip\"" }
        is_tar@{ shape: diamond, label: "if file_ext in [\"tar\", \"tar.gz\", \"tar.bz2\", \"tar.xz\"]" }
        zip_integrity_test@{ shape: rectangle, label: "test_zip_file(fp)" }
        tar_integrity_test@{ shape: rectangle, label: "test_tar_file(fp)" }
        get_zip_uncompressed_size@{ shape: rectangle, label: "uncompressed_size = get_zip_uncompressed_size(fp)" }
        get_tar_uncompressed_size@{ shape: rectangle, label: "uncompressed_size = get_tar_uncompressed_size(fp)" }
        is_large_zip@{ shape: diamond, label: "uncompressed_size > 10 GB" }
        is_large_tar@{ shape: diamond, label: "uncompressed_size > 10 GB" }

        flow_start --> upload
        upload --> file_ext
        subgraph check_file["check_file(fp, filename)"]
        file_ext --> is_not_allowed
        is_not_allowed --True--> file_check_error
        is_not_allowed --False--> file_ext_guess
        file_ext_guess --> is_file_ext_matched
        is_file_ext_matched --True-->is_zip
        is_file_ext_matched --False-->file_check_error
        is_zip --True--> zip_integrity_test
        is_zip --False-->is_tar
        is_tar--True--> tar_integrity_test
        is_tar--False--> flow_end
        subgraph check_zip_file["check_zip_file(fp)"]
        zip_integrity_test --> get_zip_uncompressed_size
        get_zip_uncompressed_size --> is_large_zip
        end
        is_large_zip --True--> file_check_error
        is_large_zip --False--> flow_end
        subgraph check_tar_file["check_tar_file(fp)"]
        tar_integrity_test --> get_tar_uncompressed_size
        get_tar_uncompressed_size --> is_large_tar
        end
        is_large_tar --True--> file_check_error
        is_large_tar --False--> flow_end
        end
        check_file --> flow_end_end
Loading

P.S.: You can ping me and @geekdinazor about this issue. As I aforementioned, we want to implement and contribute this, if accepted. We may need some help from you.

@ntarocco
Copy link

Unfortunately there is no support in InvenioRDM for showing this kind custom upload errors in UI

What is the exact issue here?
Are you running this validation asynchronously? Are you expecting to respond with an error to the upload request?

@hbayindir
Copy link
Author

Sorry for the late reply. I didn't receive the notification of your reply somehow.

In short, yes.

We run these checks after the upload is completed in the frontend, and the checks are done at the backend. If these checks fail somehow (file type, size, etc.), we can't tell this to the user, since there's no way to communicate to the user interface with an error from backend to frontend.

In our previous installation (which was a Zenodo clone), it was possible to do that, and we showed the errors to the user with a message, so the user can just re-upload a replacement file for the problematic files. In current form, these are silent errors, which we can't communicate the user in any way.

So, we want to make this possible, implement the capability while satisfying InvenioRDM design and code quality standards and contribute back the capability to core.

@ntarocco
Copy link

It is still not 100% clear to me how you would like to solve this. Let's take an example: I create a new upload, add a file, and a few seconds later, publish. The backend files checks happen afterwards, asynchronously.
How would like to notify the user? Maybe via e-mail? Or any other way?

BTW, I am moving this to the Product's discussions, it is more relevant there as a global feature.

@ntarocco ntarocco transferred this issue from inveniosoftware/invenio-app-rdm Nov 25, 2024
@hbayindir
Copy link
Author

In our current (Zenodo based) implementation, the checks run as soon as a file is uploaded, and we don't allow a user to hit Publish until all checks are passed.

If check is failed, a red error message box appears on top of the page, saying that "File X is not suitable for upload", and the file is deleted from the list and the backend immediately.

Ideally, we'd like to replicate the exact functionality. If you want to look at how its done, I can ask for the code from the guy who implemented it.

@geekdinazor
Copy link

geekdinazor commented Nov 26, 2024

File check procedure is not working asynchronously, After file uploaded completely, as soon as we fetch the filename and file content from storage (local or s3) and check it via our check_file function on save function (https://github.com/inveniosoftware/invenio-files-rest/blob/v2.2.1/invenio_files_rest/storage/pyfs.py#L106) in invenio-files-rest. If we raise FileCheckError, in this step, file upload api (e.g.: https://localhost/api/records/85/draft/files/broken.tar.gz/content) returns something like that:

{"status": 400, "message": "File type not matched."}

@hbayindir
Copy link
Author

File check procedure is not working asynchronously...

By async, I meant that we don't block the user's ability to continue working on the form, but yes, it's a blocking action at the end of day, since user can't publish anything until all files are cleared for saving.

@slint
Copy link
Member

slint commented Nov 28, 2024

Sharing here overview of the discussion we had in the 26/11 InvenioRDM telecon with @hbayindir and @Samk13 (cc @ntarocco):

  • If the checks are lightweight/fast enough to run after the file finishes uploading (in the three-legged file upload process in InvenioRDM that would be the commit request), a viable option would be to hook into this flow using a service component.
  • Major blockers/questions at the moment are mainly around how the UI file uploader component handles failures and displayed errors messages
    • E.g. if multiple file uploads are initiated and some of them fail, error messages for each of them should be visible in the UI
    • Given that there are some ongoing efforts on a new Uppy-based file uploads UI by @mirekys, it might be worth checking in, since there's more advanced state management being tackled there.
  • Ideally, the code for the file checks should be general enough to be reused.
    • @max-moser has done some similar work on "offline" file checks to evaluate archival/preservation, which could be reused here.
  • There is a more generalized version for automated curation checks in design phase, described on a high-level here, but it might be overkill for this use-case right now.

@hbayindir
Copy link
Author

Thanks a lot for the summary. Keeping it here will be very useful going forward. I'll also steal some links for my internal version which I need for internal dissemination. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants