Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File upload: Add task for file format detection on file commit #553

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

max-moser
Copy link
Contributor

Overview

Right now, InvenioRDM offers information about the MIME types of files (that are associated with records), but the logic for coming up with this information is quite simple.
This PR intends to improve the file type detection capabilities of InvenioRDM by utilizing the signature-based file format identification tool siegfried (which is permissively licensed under Apache-2.0).

Some more context

Every record file in InvenioRDM has some high-level information stored in their ObjectVersion and some "physical" file-related information in the associated FileInstance (both defined in Invenio-Files-REST).
The former has a field for the MIME type (object_version._mimetype), but there doesn't seem to be a code path which populates this field to any non-null value.
Instead, the object_version.mimetype property usually falls back to guessing the MIME type through the standard library function mimetypes.guess_type().
This function bases its guess purely on the file extension, which is fine as fallback value but isn't ideal as the primary source.

Outline of this PR

This PR hooks into the file upload process via the files service.
Whenever a file is "committed", a background task is being scheduled which calls the external sf binary on the uploaded file and interprets its output.
This is done because the file format identification step is a potentially long-running operation (e.g. for large files).

If a MIME type is reported by siegfried, it will be used to populate the object_version._mimetype field.
Additionally, the PRONOM identifier is stored as an ObjectVersionTag.

Alternatives considered

Feeding the file stream into siegfried during upload

Instead of identifying the file format by calling sf on the file commit, it could potentially be done during the file upload by feeding a duplicate of the upload stream into siegfried on the go.
This could eliminate the waiting time until the MIME type is set for larger files, but would require a deeper integration of the external tool into core functionalities (probably in Invenio-Files-REST).
Given that this functionality is more of a nice-to-have, I'm not sure if that trade-off would be worth it.

Let external applications handle this information

There are external tools such as Archivematica and FITS which (among others) specialize in detecting file formats.
Solutions for overviews over the file format landscape in InvenioRDM could be built externally with such tools.
However, InvenioRDM has some built-in capabilities for this already, and they are actively being used (even if only for display purposes in the REST API).
Thus, I think enhancing these existing capabilities rather than building external pipelines is worthwhile.

To do

  • check how non-local files could be handled
  • some manual testing
  • write test cases
  • make the integration configurable (e.g. disabling the step, setting a custom path for the sf binary, etc.)
  • polish the code
  • address the comments and remarks

Test results

In a fresh v12 my-site, I created a new draft and uploaded a PDF file (renamed to have a JPG extension).
Siegfried correctly identifies the file format as PDF:

image

In our own v11 deployment (without this PR), the same file still has a reported MIME type of "image/jpeg", which is incorrect.


mimetype, pronom_id = None, None
try:
sf_bin = "sf"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if installed with go install, the $GOPATH needs to be added to the $PATH with this logic

@max-moser max-moser force-pushed the mm/file-format-detection branch from ce9ce98 to f3badce Compare January 11, 2024 15:25
if mimetype is not None:
ov.mimetype = mimetype
if pronom_id is not None:
ObjectVersionTag.create_or_update(ov, "PUID", pronom_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an appropriate place to store the PRONOM identifier?
or should it be stored somewhere else, e.g. the file "metadata" which is used to report dimensions of images files and such?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant