Concept of a `File` necessary? #14

mih · 2023-12-21T15:36:22Z

This is a common concept, and seems to suggest itself naturally. #58 also includes it.

However, it comes with problems too, in particular in the datalad context.

what would be a suitable identifier for a File?
is this a versioned construct (implying that an identifier also needs include/reference a version)?
is there are difference between a File and its content? If not, what about two files with different names and identical content?

The text was updated successfully, but these errors were encountered:

mih · 2023-12-21T15:40:07Z

https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/datalad-datasets.yaml tries to avoid the concept of File (not entirely, but almost).

The closest equivalent is a DirectoryItem. Its role is simply and solely to assign a (unique) name in a namespace, and that namespace is a single directory. Any DirectoryItem has content, and that content is either a directory or file content (blob).

With such a concept, the majority of metadata is attributed to the file content, and DirectoryItem is a merely a contextual helper that registers content in a container.

jsheunis · 2023-12-21T19:54:09Z

Thanks for the pointer. I looked at DirectoryItem and FileContent and I think these classes cover all bases, at least from the perspective of a ResearchDataset from #58 which only defines md5sum, url, path and size. I can refactor my code to make use of these existing classes.

One thought I had was whether these concepts of DirectoryItem and FileContent etc fit specifically into the context of datalad-datasets (where they are currently located), or whether they are generic enough to be defined together with the generic Dataset and DatasetVersion concepts?

mih · 2023-12-21T19:59:39Z

The location of any of these is temporary. All classes are drafts. If we found a second use case for them now, it makes sense to more the elsewhere.

jsheunis · 2023-12-22T13:17:40Z

There's something that I don't quite grasp how to map onto existing concepts and procedures yet. I also don't know yet exactly what my question is, so I'm putting down a progression of thoughts.

Let's look at the concept of a file (and a dataset being a collection of files) from the perspective of a user generating metadata from local files or entering metadata into some GUI (web-based or not). The ideal is to make them do the least amount of work necessary to generate the maximum amount of useful metadata. Let's say they want the metadata to include the complete file list. What they could conceivably do would be to:

run a script that generates a complete file list such as basic command line tools or something like status2tabby), or
point the GUI to a local directory where the GUI would run a similar script, or
hand-edit a sheet with a list of files

The important part is that the resulting file list is in a format that will validate against the dataset schema. But also, such a format should not be too complicated for users to generate. A flat file list would be the simplest, with files as rows and a column each for properties such as path (relative to root), file_size, checksum, access_url, etc.

So the question becomes, is the format that the users provide their file lists in (with help from a machine or not) the exact same as the format that the schema defines? Or is there some translation layer in between? Or can the schema be defined in such a way, using classes that inherit from superclasses, that the translation of a complicated structure to a flat list is implicitly dealt with inside the schema?

Using our existing work, can Directory and DirectoryItem and Filecontent somehow be brought into a single class File used by the research-dataset-schema? Or do we need a translation step?

Something else to keep in mind is the high likelihood that automated processes will run on top of the schema to generate e.g. online forms. And a form that asks you to enter a flat list of files is much more desirable than a form that asks you to enter a directory, then several directory items, etc, etc.

mih · 2023-12-22T15:58:53Z

From my POV the needs and solutions you describe are "front-end". To put it bluntly, the input convenience is bought by ignoring the true nature of the underlying concepts.

If a tool facilitates the entry of metadata on an unversioned data "archive", in can make shortcuts and it can use a simplified schema (geared towards simplicity and usage such as form generation). But this would be different from a structure and terminology used for a generic data model (which also must be able to capture more complex cases, such as nesting, versioning, redundant availability), yet still yield a sensible, homogeneous representation.

In short: yes, translation/mapping needed.

This should not be an uncommon need, hence needs to and will be supported well

mih · 2023-12-24T14:54:03Z

psychoinformatics-de/datalad-schema#15 brings another case like this: a model of a Git commit. From the Git data model perspective things are simple. A commit is

a tree
a user record (plus timestamp) for the commit
a second user record (plus timestamp) for the authorship
a list of any parent commits

A fairly sensible model could be a flat set of properties for each of these aspects. However, those would have quite complex (or narrow) semantics.

psychoinformatics-de/datalad-schema#15 uses a PROV inspired approach. Rather than direct properties, it records the provenance of a commit as two activities (the authoring of the new state vs the committing). This yields a more complex data structure, but each element has simpler (more genericly understood) semantics.

mih · 2024-01-03T11:22:11Z

#31 brings some changes in this regard. It follows the model of DCAT that distinguishes abstract/conceptual resources that are realized with concrete distributions.

For datalad we can keep that distinction to express how one and the same file can be available from multiple remotes. The DCAT notion is more flexible, it allows for a resource's nature to change considerably (file formats, etc) between distributions.

For DataLad we do not need this flexibility, but it does not hurt to have the base model offer this expressiveness.

It is not necessary, as far as I can see now. Closes #14

mih changed the title ~~Concept of a File necessary~~ Concept of a File necessary? Dec 22, 2023

mih mentioned this issue Dec 28, 2023

List of Files vs Tree of directories #20

Closed

jsheunis mentioned this issue Feb 22, 2024

Elements of a non-datalad dataset schema #46

Closed

This was referenced Feb 25, 2024

Update qualified_part pattern for datasets #49

Closed

Various concept simplifications #54

Merged

mih added a commit that referenced this issue Feb 26, 2024

Remove gthe concept of a File

2afc2ce

It is not necessary, as far as I can see now. Closes #14

mih closed this as completed in #54 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concept of a `File` necessary? #14

Concept of a `File` necessary? #14

mih commented Dec 21, 2023

mih commented Dec 21, 2023

jsheunis commented Dec 21, 2023

mih commented Dec 21, 2023

jsheunis commented Dec 22, 2023 •

edited

Loading

mih commented Dec 22, 2023 •

edited

Loading

mih commented Dec 24, 2023

mih commented Jan 3, 2024

Concept of a File necessary? #14

Concept of a File necessary? #14

Comments

mih commented Dec 21, 2023

mih commented Dec 21, 2023

jsheunis commented Dec 21, 2023

mih commented Dec 21, 2023

jsheunis commented Dec 22, 2023 • edited Loading

mih commented Dec 22, 2023 • edited Loading

mih commented Dec 24, 2023

mih commented Jan 3, 2024

Concept of a `File` necessary? #14

Concept of a `File` necessary? #14

jsheunis commented Dec 22, 2023 •

edited

Loading

mih commented Dec 22, 2023 •

edited

Loading