-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset serialization to tabby-records, and deserialization of tabby-records to datasets #48
Comments
I am currently experimenting with serialization and deserialization in this branch: https://github.com/christian-monch/datalad-tabby/tree/serialize. The branch add two python scripts, not yet datalad commands:
While Both are meant to become datalad commands, maybe delivered with the tabby-extension. They use a "dataset_iterator", that is currently just borrowed from metalad, but should ultimately go into datalad_next. |
Leaving a few notes after the developments that happen between when this issue was last updated and now. TL;DR: We should minimize the Re deserialization (from a metadata record to a repo):
Re serialization (from a repo to a metadata record):
What is
|
We would like to add the feature of serializing a dataset version, i.e. a git-commit, to a tabby-record and of deserializing a tabby record into a dataset. This issue is meant as a discussion hub to track open questions, answers and to link to implementation issues.
There are a number of open questions w.r.t. serialization (in no particular order):
1.1. Metadata could be stored in a location inside the created dataset, e.g. in
.git/datalad/tabby/dataset-metadata.json
. That would lead to different dataset versions if the metadata changes.1.2. Metadata could be ignored, but then deserialized datasets would not lead to correct tabby-records because mandatory fields could not be filled automatically.
2.1. Since we are already converting (planning to convert) tabby-records to JSON-LD structures, it seems to make sense to base the final solution on that model. Especially because it could allow to expose information about the datasets in multiple other contexts, i.e. in external search engines. (The implementation could still start with a direct mapping from datalad-tabby to datasets and vice-versa)
3.1. Content name and content location could be stored in the serialized form and deserialization could retrieve the content. That would require extended records in the tabbly-file-record, i.e. a column with "location" could be added. If JSON-LD is used, the respective columns also have to exist in the JSON-LD documents.
3.2. Content could just be referenced and a custom remote git helper could be used to access it.
4.1. It would be useful to support annexed content in order to work with a large number of datasets that use git-annex, especially with DataLad dataset. That would require extended records in the tabby-file-record and their representation in an intermediary JSON-LD format.
I would suggest to do the following small implementations to map out the problem space (an alternative would be to implement JSON-LD <-> dataset operations, with the idea that tabby <-> JSON-LD conversion is done in parallel):
I expect this issue to become quite active in the next time. Let the discussion begin! ;-)
The text was updated successfully, but these errors were encountered: