-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add originalFileName field to json #2734
Comments
Odum is running into this as well. I see our Dataverse instance storing the original filename as version 1; the ingested/renamed as version 2:
but tied to the datasetversion_id while the native dataset metadata endpoint returns datafile_id. My script for Thu-Mai returns all files in a dataset in original format but with modified file extensions. The automation of an original-format bundle download per dataset would save her a lot of time. |
@donsizemore it's interesting that the original filename is stored in the database at all. It was under the impression that the original filename is never stored. It should be. |
@pdurbin I suspect that Don misidentified "ERA21980b.DAT" as an original data file such as a stata dta file; it is not a statistical data file. As you said, the original filename is not stored on the DB by design. |
So, is this still something we want to add to the JSON metadata we output for Datafiles? The problem makes perfect sense as described by the original requester. But then it sounds like they worked around it by inferring the full original filename from the information already in the JSON. I.e., if filename = "myfile.tab", and the originalFormatLabel = "Stata Binary", you can (unambiguously) assume that the original file in the bundle will have the name "myfile.dta". This is how the filename is generated in the application; once the file is ingested, the stored file name has the ".tab" extension. For the stored original that extension is modified on the fly based on the original type saved in the database. If we were to add this extra field to the JSON output - "originalFileName" - it would take one extra line in JsonPrinter.java: .add("originalFileName", FileUtil.replaceExtension(fileName, FileUtil.generateOriginalExtension(df.getOriginalFileFormat())); |
@landreev cool. Sounds like an easy fix. Thanks. |
A comment was just added at #4044 (comment) about this: "we do not store the extension but recreate it by examining the mime type" |
Related: Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006 |
|
Upon further discussion it was decided that for preservation purposes we should retain the original file name and extension as it was uploaded. (not rely on the conversion from content type to extension utility) |
We are not going to load the new originalFileName field for existing files. We will use the file extension converter described by Leonid above in the json printer. that way we will preserve the fact that we did not actually save the original file name on upload. |
As per discussion at https://groups.google.com/forum/#!topic/dataverse-community/zsC4yltISS0: when a bundle is retrieved through the data access api, the json metadata contain two fields relating to the original file from which the .tab format is derived: "originalFileFormat" and "originalFormatLabel". Unless the bundle is unzipped and the package contents extracted, the filename of the original file has to be inferred from these fields. This is causing a few issues when we ingest content from Dataverse into Archivematica, because Archivematica needs to know the name of the original file before the unpackaging micro-service takes place.
The text was updated successfully, but these errors were encountered: