Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concept of a File necessary? #14

Closed
mih opened this issue Dec 21, 2023 · 7 comments · Fixed by #54
Closed

Concept of a File necessary? #14

mih opened this issue Dec 21, 2023 · 7 comments · Fixed by #54

Comments

@mih
Copy link
Contributor

mih commented Dec 21, 2023

This is a common concept, and seems to suggest itself naturally. #58 also includes it.

However, it comes with problems too, in particular in the datalad context.

  • what would be a suitable identifier for a File?
  • is this a versioned construct (implying that an identifier also needs include/reference a version)?
  • is there are difference between a File and its content? If not, what about two files with different names and identical content?
@mih
Copy link
Contributor Author

mih commented Dec 21, 2023

https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/linkml/datalad-datasets.yaml tries to avoid the concept of File (not entirely, but almost).

The closest equivalent is a DirectoryItem. Its role is simply and solely to assign a (unique) name in a namespace, and that namespace is a single directory. Any DirectoryItem has content, and that content is either a directory or file content (blob).

With such a concept, the majority of metadata is attributed to the file content, and DirectoryItem is a merely a contextual helper that registers content in a container.

@jsheunis
Copy link
Contributor

Thanks for the pointer. I looked at DirectoryItem and FileContent and I think these classes cover all bases, at least from the perspective of a ResearchDataset from #58 which only defines md5sum, url, path and size. I can refactor my code to make use of these existing classes.

One thought I had was whether these concepts of DirectoryItem and FileContent etc fit specifically into the context of datalad-datasets (where they are currently located), or whether they are generic enough to be defined together with the generic Dataset and DatasetVersion concepts?

@mih
Copy link
Contributor Author

mih commented Dec 21, 2023

The location of any of these is temporary. All classes are drafts. If we found a second use case for them now, it makes sense to more the elsewhere.

@mih mih changed the title Concept of a File necessary Concept of a File necessary? Dec 22, 2023
@jsheunis
Copy link
Contributor

jsheunis commented Dec 22, 2023

There's something that I don't quite grasp how to map onto existing concepts and procedures yet. I also don't know yet exactly what my question is, so I'm putting down a progression of thoughts.

Let's look at the concept of a file (and a dataset being a collection of files) from the perspective of a user generating metadata from local files or entering metadata into some GUI (web-based or not). The ideal is to make them do the least amount of work necessary to generate the maximum amount of useful metadata. Let's say they want the metadata to include the complete file list. What they could conceivably do would be to:

  1. run a script that generates a complete file list such as basic command line tools or something like status2tabby), or
  2. point the GUI to a local directory where the GUI would run a similar script, or
  3. hand-edit a sheet with a list of files

The important part is that the resulting file list is in a format that will validate against the dataset schema. But also, such a format should not be too complicated for users to generate. A flat file list would be the simplest, with files as rows and a column each for properties such as path (relative to root), file_size, checksum, access_url, etc.

So the question becomes, is the format that the users provide their file lists in (with help from a machine or not) the exact same as the format that the schema defines? Or is there some translation layer in between? Or can the schema be defined in such a way, using classes that inherit from superclasses, that the translation of a complicated structure to a flat list is implicitly dealt with inside the schema?

Using our existing work, can Directory and DirectoryItem and Filecontent somehow be brought into a single class File used by the research-dataset-schema? Or do we need a translation step?

Something else to keep in mind is the high likelihood that automated processes will run on top of the schema to generate e.g. online forms. And a form that asks you to enter a flat list of files is much more desirable than a form that asks you to enter a directory, then several directory items, etc, etc.

@mih
Copy link
Contributor Author

mih commented Dec 22, 2023

From my POV the needs and solutions you describe are "front-end". To put it bluntly, the input convenience is bought by ignoring the true nature of the underlying concepts.

If a tool facilitates the entry of metadata on an unversioned data "archive", in can make shortcuts and it can use a simplified schema (geared towards simplicity and usage such as form generation). But this would be different from a structure and terminology used for a generic data model (which also must be able to capture more complex cases, such as nesting, versioning, redundant availability), yet still yield a sensible, homogeneous representation.

In short: yes, translation/mapping needed.

This should not be an uncommon need, hence needs to and will be supported well

@mih
Copy link
Contributor Author

mih commented Dec 24, 2023

psychoinformatics-de/datalad-schema#15 brings another case like this: a model of a Git commit. From the Git data model perspective things are simple. A commit is

  • a tree
  • a user record (plus timestamp) for the commit
  • a second user record (plus timestamp) for the authorship
  • a list of any parent commits

A fairly sensible model could be a flat set of properties for each of these aspects. However, those would have quite complex (or narrow) semantics.

psychoinformatics-de/datalad-schema#15 uses a PROV inspired approach. Rather than direct properties, it records the provenance of a commit as two activities (the authoring of the new state vs the committing). This yields a more complex data structure, but each element has simpler (more genericly understood) semantics.

@mih
Copy link
Contributor Author

mih commented Jan 3, 2024

#31 brings some changes in this regard. It follows the model of DCAT that distinguishes abstract/conceptual resources that are realized with concrete distributions.

For datalad we can keep that distinction to express how one and the same file can be available from multiple remotes. The DCAT notion is more flexible, it allows for a resource's nature to change considerably (file formats, etc) between distributions.

For DataLad we do not need this flexibility, but it does not hurt to have the base model offer this expressiveness.

mih added a commit that referenced this issue Feb 26, 2024
It is not necessary, as far as I can see now.

Closes #14
@mih mih closed this as completed in #54 Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants