Update `qualified_part` pattern for datasets #49

mih · 2024-02-25T17:48:18Z

This relates to #14

Shower-thoughts:

I believe we should make the file-tree of a dataset/commit an explicit property, rather can than a mere qualified_part. These are more specific parts -- a subset of git-tracked items: blobs or trees.
likely a dedicated Tree/Directory class will be the range of such a slot
with such a class available (and derived from GitTracked with a corresponding *SE class that features an identifier -- Git-like trees can be built, hence we can shed the notion of PosixRelPath from the GitTrackedQualifiedPart -- all that is left are direct children of a tree/directory, and those would have a simple name (maybe have a dedicated type that ensure a valid POSIX filename, ie. no /)

This change should result in a more Git-like data model. With tree being Git-tracked, we should also harvest the metadata volume savings coming from being able to rereference unchanged trees in subsequent version records.

The text was updated successfully, but these errors were encountered:

mih · 2024-02-25T21:38:59Z

It may be that this file tree is simply the .distribution of a Dataset...

This could lead to a situation, where we only list subdatasets as "parts", and their respective trees as part of the superdataset distribution.

mih · 2024-02-26T07:01:05Z

OK, after some more thinking, I believe an update in this direction has the potential to simplify things. Concretely:

Create subclasses of Distribution for GitBlob, AnnexKey, and Directory (with a DirectoryItem companion).
Remove the qualified_part property from DataladDatasetVersion
Support distribution property for DataladDatasetVersion and set range to Directory(Distribution)
limit DataladDatasetVersion.has_part to a range of DataladDatasetVersion

Conceptually, this introduces a number of changes.

From a DCAT model perspective, files in a dataset are no longer individually catalogued resources. They are merely parts of the distribution of a dataset. This is inline with our real-world concept of a DataLad dataset and also that of datalad-catalog
A tree/directory becomes a recognized entity and enables more efficient of a multi-version dataset
The technical concepts of an AnnexedFile or a GitBlob are further disentangled from the catalog-level dataset records (now confined to the realm of distributions).
Within a DataladDatasetVersion.distribution the subtree of a nested DataladDatasetVersion is represented as a Directory or Tree, while the has_part relationship to another DataladDatasetVersion is not bound to a particular "mountpoint", but that other dataset object references the same tree as its own distribution

…stribution)` It also introduces the concept of a `GitTree(FilesystemDirectory)`, and makes such an instance a required property of a `Commit`. Work towards #49

mih mentioned this issue Feb 26, 2024

Various concept simplifications #54

Merged

mih closed this as completed in #54 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `qualified_part` pattern for datasets #49

Update `qualified_part` pattern for datasets #49

mih commented Feb 25, 2024

mih commented Feb 25, 2024 •

edited

Loading

mih commented Feb 26, 2024 •

edited

Loading

Update qualified_part pattern for datasets #49

Update qualified_part pattern for datasets #49

Comments

mih commented Feb 25, 2024

mih commented Feb 25, 2024 • edited Loading

mih commented Feb 26, 2024 • edited Loading

Update `qualified_part` pattern for datasets #49

Update `qualified_part` pattern for datasets #49

mih commented Feb 25, 2024 •

edited

Loading

mih commented Feb 26, 2024 •

edited

Loading