Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset serialization to tabby-records, and deserialization of tabby-records to datasets #48

Open
2 tasks
christian-monch opened this issue Jul 10, 2023 · 2 comments

Comments

@christian-monch
Copy link
Contributor

christian-monch commented Jul 10, 2023

We would like to add the feature of serializing a dataset version, i.e. a git-commit, to a tabby-record and of deserializing a tabby record into a dataset. This issue is meant as a discussion hub to track open questions, answers and to link to implementation issues.

There are a number of open questions w.r.t. serialization (in no particular order):

  1. How is tabby-stored dataset metadata, e.g. "dataset description", represented in the deserialized form, i.e. in the dataset?
    1.1. Metadata could be stored in a location inside the created dataset, e.g. in .git/datalad/tabby/dataset-metadata.json. That would lead to different dataset versions if the metadata changes.
    1.2. Metadata could be ignored, but then deserialized datasets would not lead to correct tabby-records because mandatory fields could not be filled automatically.
  2. Should the serialization/de-serialization map directly from tabby to git, or should it map from tabby to JSON-LD and from JSON-LD to git. I.e. should it be " -> " or " -> -> ". The deserialization would then be the corresponding reverse function. That means, either " -> " or " -> -> . Using JSON-LD as an intermediary format would increase the versatility of the approach, but requires more tooling. NB both approaches should be identical in their computational power, i.e. both should be able to fulfill the requirements. What should be done?
    2.1. Since we are already converting (planning to convert) tabby-records to JSON-LD structures, it seems to make sense to base the final solution on that model. Especially because it could allow to expose information about the datasets in multiple other contexts, i.e. in external search engines. (The implementation could still start with a direct mapping from datalad-tabby to datasets and vice-versa)
  3. How should content be referenced in the serialized form?
    3.1. Content name and content location could be stored in the serialized form and deserialization could retrieve the content. That would require extended records in the tabbly-file-record, i.e. a column with "location" could be added. If JSON-LD is used, the respective columns also have to exist in the JSON-LD documents.
    3.2. Content could just be referenced and a custom remote git helper could be used to access it.
  4. Should annexed content be supported?
    4.1. It would be useful to support annexed content in order to work with a large number of datasets that use git-annex, especially with DataLad dataset. That would require extended records in the tabby-file-record and their representation in an intermediary JSON-LD format.
  5. How do we handle sub-datasets?

I would suggest to do the following small implementations to map out the problem space (an alternative would be to implement JSON-LD <-> dataset operations, with the idea that tabby <-> JSON-LD conversion is done in parallel):

  • implement deserialization, i.e. tabby-record to dataset
  • implement serialization, i.e. dataset to tabby-record

I expect this issue to become quite active in the next time. Let the discussion begin! ;-)

@christian-monch
Copy link
Contributor Author

I am currently experimenting with serialization and deserialization in this branch: https://github.com/christian-monch/datalad-tabby/tree/serialize.

The branch add two python scripts, not yet datalad commands:

  • datalad_tabby/commands/tabby-serlalize.py
  • datalad_tabby/commands/tabby-deserialize.py

While tabby-serlalize.py is working, tabby-deserialize is still under construction.

Both are meant to become datalad commands, maybe delivered with the tabby-extension. They use a "dataset_iterator", that is currently just borrowed from metalad, but should ultimately go into datalad_next.

@mih
Copy link
Contributor

mih commented Jul 29, 2023

Leaving a few notes after the developments that happen between when this issue was last updated and now.

TL;DR: We should minimize the tabby-specific aspects in any implementation of the functionalities outlined above.

Re deserialization (from a metadata record to a repo):

  • this is essentially Implement tabby-clone #4 and the same reasons for not doing that tabby-specific apply here
  • we already have means to go from a plain metadata record to a repository (e.g. https://github.com/datalad/datalad-ebrains/blob/main/datalad_ebrains/fairgraph_query.py), and having another implementation for a specific metadata format/terminology is wasteful
  • mih/datalad-mihextras@59ea085 shows a viable alternative: homogenize/standardize/normalize a metadata record on a dataset, and feed it to a dataset generator that understands that one normalized document structure
  • IOW: we rely on JSON-LD to yield a normalized (JSON) data structure, and have generic code to make the git/datalad calls based on that. This structure is likely a dict (per dataset version) with dataset metadata, where one property contains a list of file(versions) contained in the dataset(version)
  • an implemention of that should live in a place where the pyld dependency is not needlessly heavy. metalad would make sense, but it is presently a heavy drag re dependencies (unconditionally pulls in -deprecated, click, etc).

Re serialization (from a repo to a metadata record):

  • this is pretty similar to DEserialization (from a what-functionality-is-needed viewpoint)
  • we already have the concept of metadata extraction (provided by metalad), which produces metadata documents from a given repository. Some aim to be JSON-LD (but mostly are not), some are just structured data.
  • a tabby metadata extractor must be available and must provide actual JSON-LD
  • with Provide guidelines on use of metadata record identifiers datalad/datalad-metalad#389 resolved, it would be almost trivial to consolidate multi-extractor output into a single metadata document on a dataset (incl. file records)
  • a consolidated document can be subjected to homogenization/standardization/normalization
  • a normalized record can be broken down into a tabby format by framing it appropriately, using a defined tabby convention for terminology (this could be a dedicated convention for this very purpose, as long as it is shipped with -tabby)
  • there is no need to write out TSV files, tabby supports JSON "tables" now too

What is tabby-specific about this issue?

  • There should be a tabby convention to be used for serialization. The existising tby-ds1 could be amended/extended for that
  • There should be a term normalization map available for use with deserialization that recognized the terminology promoted by tabby
  • there should be a tabby metadata extractor
  • there should be code to convert a JSON-LD document (e.g. produced by tabby_load()) back into an on-disk tabby record. This could be as simple as as a dump into a dataset.json (and maybe a separate dataset.ctx.jsonld), or a more intelligent split into more fine-grained tables. But the latter is not a requirement at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants