File upload: Add task for file format detection on file commit #553
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Right now, InvenioRDM offers information about the MIME types of files (that are associated with records), but the logic for coming up with this information is quite simple.
This PR intends to improve the file type detection capabilities of InvenioRDM by utilizing the signature-based file format identification tool siegfried (which is permissively licensed under Apache-2.0).
Some more context
Every record file in InvenioRDM has some high-level information stored in their
ObjectVersion
and some "physical" file-related information in the associatedFileInstance
(both defined in Invenio-Files-REST).The former has a field for the MIME type (
object_version._mimetype
), but there doesn't seem to be a code path which populates this field to any non-null value.Instead, the
object_version.mimetype
property usually falls back to guessing the MIME type through the standard library functionmimetypes.guess_type()
.This function bases its guess purely on the file extension, which is fine as fallback value but isn't ideal as the primary source.
Outline of this PR
This PR hooks into the file upload process via the files service.
Whenever a file is "committed", a background task is being scheduled which calls the external
sf
binary on the uploaded file and interprets its output.This is done because the file format identification step is a potentially long-running operation (e.g. for large files).
If a MIME type is reported by siegfried, it will be used to populate the
object_version._mimetype
field.Additionally, the PRONOM identifier is stored as an
ObjectVersionTag
.Alternatives considered
Feeding the file stream into siegfried during upload
Instead of identifying the file format by calling
sf
on the file commit, it could potentially be done during the file upload by feeding a duplicate of the upload stream into siegfried on the go.This could eliminate the waiting time until the MIME type is set for larger files, but would require a deeper integration of the external tool into core functionalities (probably in Invenio-Files-REST).
Given that this functionality is more of a nice-to-have, I'm not sure if that trade-off would be worth it.
Let external applications handle this information
There are external tools such as Archivematica and FITS which (among others) specialize in detecting file formats.
Solutions for overviews over the file format landscape in InvenioRDM could be built externally with such tools.
However, InvenioRDM has some built-in capabilities for this already, and they are actively being used (even if only for display purposes in the REST API).
Thus, I think enhancing these existing capabilities rather than building external pipelines is worthwhile.
To do
sf
binary, etc.)Test results
In a fresh v12
my-site
, I created a new draft and uploaded a PDF file (renamed to have a JPG extension).Siegfried correctly identifies the file format as PDF:
In our own v11 deployment (without this PR), the same file still has a reported MIME type of "image/jpeg", which is incorrect.