Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

djbrooke · 2019-07-10T19:43:33Z

We discussed #6002 and #2720 in sprint planning today and plan to work on both during a future sprint. I'm closing out both of those in favor of this one.

For reproducibility, we want to maintain the original file format for tabular files because .tab may be referenced in scripts
For preservation, .tsv is preferred as it is more recognized in the community

We should determine a way to meet both needs.

pdurbin · 2019-12-05T03:57:14Z

For preservation, .tsv is preferred as it is more recognized in the community

Not just for preservation. For simple things like opening the tab-separated file in Excel. Please see https://twitter.com/Ray_J__/status/1202296388618457089 and the screenshot below:

mheppler · 2020-03-05T20:57:28Z

Review as part of Add originalFileName field to json #2734 when that is picked up in development.

pdurbin · 2022-10-05T01:22:12Z

For reproducibility, we want to maintain the original file format for tabular files because .tab may be referenced in scripts

I'm not sure I understand what the change would be. We always maintain the original file.

For preservation, .tsv is preferred as it is more recognized in the community

I re-opened #2720 because I feel strongly that we should use .tsv instead of .tab

Given the above, is there any reason to keep this issue open?

Vote to close.

jggautier · 2022-10-05T17:39:40Z

About the first comment about reproducibility, the Dataverse software always maintains the original file but the file and information about it is not always easily accessible. I think this has improved since this issue was opened, but I can think of at least one case where it could be handled better:

The last time I ran the Binder integration on a dataset I uploaded, Binder ignored my dataset's .csv files and tried instead to use the .tab files that were created by the Dataverse software's ingest process. But my dataset's Python script was written to do things with the .csv files. It assumed the files would be .csv files.

To work around this, I had to replace the .csv files in my dataset with .tab files and adjust my Python script to do things with .tab files instead. I would imagine that a researcher who wants to make their computational workflow reproducible by uploading it to a Dataverse repository and using something like Binder would not anticipate needing to use .tab files instead of .csv files.

pdurbin · 2022-10-05T21:27:31Z

@jggautier you'd definitely right that there's something to fix for Binder. I just launched my dataset there and what I see is the .tab version, like you're saying.

Binder uses repo2docker under the covers and here's where Dataverse support was added: jupyterhub/repo2docker#739

We could submit a PR to repo2docker to change the behavior so that original files rather than preservation (.tab) files are downloaded from Dataverse. I'd be worried about backward compatibility though.

Anyway, we need a specific, actionable plan. I'm happy to talk about this whenever.

pdurbin · 2023-10-08T10:09:42Z

We could submit a PR to repo2docker to change the behavior so that original files rather than preservation (.tab) files are downloaded from Dataverse.

This is exactly what I did:

Fix Binder and Whole Tale (repo2docker) to download original files rather than archival .tab files #9374
[MRG] download original file formats from Dataverse #1242 jupyterhub/repo2docker#1253

pdurbin · 2023-11-11T15:21:40Z

Closing in favor of this issue:

Switch preservation format extension from .tab to .tsv #2720

This was referenced Jul 10, 2019

Dataverse mangles "original format" filename extensions #6002

Closed

Switch preservation format extension from .tab to .tsv #2720

Closed

djbrooke added the Medium label Jul 10, 2019

djbrooke self-assigned this Jul 23, 2019

pdurbin mentioned this issue Jul 24, 2019

Add originalFileName field to json #2734

Closed

adam3smith mentioned this issue Jun 20, 2021

Display deposited (rather than ingested) copy of tabular files #7956

Open

pdurbin added the Vote to Close: pdurbin label Oct 5, 2022

mreekie removed the sz.Medium label Jan 11, 2023

pdurbin unassigned djbrooke Oct 7, 2023

pdurbin closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

djbrooke commented Jul 10, 2019

pdurbin commented Dec 5, 2019

mheppler commented Mar 5, 2020

pdurbin commented Oct 5, 2022

jggautier commented Oct 5, 2022

pdurbin commented Oct 5, 2022

pdurbin commented Oct 8, 2023

pdurbin commented Nov 11, 2023

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

Comments

djbrooke commented Jul 10, 2019

pdurbin commented Dec 5, 2019

mheppler commented Mar 5, 2020

pdurbin commented Oct 5, 2022

jggautier commented Oct 5, 2022

pdurbin commented Oct 5, 2022

pdurbin commented Oct 8, 2023

pdurbin commented Nov 11, 2023