Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update qualified_part pattern for datasets #49

Closed
mih opened this issue Feb 25, 2024 · 2 comments · Fixed by #54
Closed

Update qualified_part pattern for datasets #49

mih opened this issue Feb 25, 2024 · 2 comments · Fixed by #54

Comments

@mih
Copy link
Contributor

mih commented Feb 25, 2024

This relates to #14

Shower-thoughts:

  • I believe we should make the file-tree of a dataset/commit an explicit property, rather can than a mere qualified_part. These are more specific parts -- a subset of git-tracked items: blobs or trees.
  • likely a dedicated Tree/Directory class will be the range of such a slot
  • with such a class available (and derived from GitTracked with a corresponding *SE class that features an identifier -- Git-like trees can be built, hence we can shed the notion of PosixRelPath from the GitTrackedQualifiedPart -- all that is left are direct children of a tree/directory, and those would have a simple name (maybe have a dedicated type that ensure a valid POSIX filename, ie. no /)

This change should result in a more Git-like data model. With tree being Git-tracked, we should also harvest the metadata volume savings coming from being able to rereference unchanged trees in subsequent version records.

@mih
Copy link
Contributor Author

mih commented Feb 25, 2024

It may be that this file tree is simply the .distribution of a Dataset...

This could lead to a situation, where we only list subdatasets as "parts", and their respective trees as part of the superdataset distribution.

@mih
Copy link
Contributor Author

mih commented Feb 26, 2024

OK, after some more thinking, I believe an update in this direction has the potential to simplify things. Concretely:

  • Create subclasses of Distribution for GitBlob, AnnexKey, and Directory (with a DirectoryItem companion).
  • Remove the qualified_part property from DataladDatasetVersion
  • Support distribution property for DataladDatasetVersion and set range to Directory(Distribution)
  • limit DataladDatasetVersion.has_part to a range of DataladDatasetVersion

Conceptually, this introduces a number of changes.

  • From a DCAT model perspective, files in a dataset are no longer individually catalogued resources. They are merely parts of the distribution of a dataset. This is inline with our real-world concept of a DataLad dataset and also that of datalad-catalog
  • A tree/directory becomes a recognized entity and enables more efficient of a multi-version dataset
  • The technical concepts of an AnnexedFile or a GitBlob are further disentangled from the catalog-level dataset records (now confined to the realm of distributions).
  • Within a DataladDatasetVersion.distribution the subtree of a nested DataladDatasetVersion is represented as a Directory or Tree, while the has_part relationship to another DataladDatasetVersion is not bound to a particular "mountpoint", but that other dataset object references the same tree as its own distribution

mih added a commit that referenced this issue Feb 26, 2024
…stribution)`

It also introduces the concept of a `GitTree(FilesystemDirectory)`, and
makes such an instance a required property of a `Commit`.

Work towards #49
@mih mih closed this as completed in #54 Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant