-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedicated instructions for DataLad (metalad) users #23
Comments
Do you mean e.g. automating the process of generating tabby tables from metalad-extracted metadata? |
A tabby metadata record is meant to be a comprehensive description of a dataset (even comprehensive enough to actually generate a DataLad dataset from it). This means that it should be possible to generate a tabby record from an actual DataLad dataset. Some of the necessary information for such a meaningful record will need to come from external sources (e.g. human input), such as bibliographic info, description, etc. But some information can be readily provided by DataLad, such as a file(-version) table ( How would this look like in practice? Which pieces are needed, how do they connect, what would a user need to do in order to employ this functionality? |
I'm assuming:
Let's say there's a generic process describing "how to serialize a datalad dataset". It could involve the following:
|
I think this are the questions, indeed. This use case is interesting, because it is very similar to the catalog ingestion use case -- but even more thorough in that we are aiming for a full serialization of a dataset version. This could (but does not have to) involve a merge and homogenization across multiple metadata sources (like that catalog use case). |
As per yesterday's discussion (outside any issue) it makes sense to remove metalad from the dependency chain. It offer no support for homogenization, and the storage target here is a specific file format. The only thing to be done is to pull the code for the |
As @jsheunis wrote: one option to traverse datasets and subdatasets would be to use the dataset-traverser module from metalad. That could be done in a simple script, or with an trivial pipeline definition that just contains a
The output would be a JSON record for the top-level dataset and each file in it. With an additional argument, subdatasets and their files would also be emitted. |
W.r.t. special commands (as suggested by @mih): a few weeks ago, I created a small iterator-command script (55 lines, https://github.com/christian-monch/datalad-metalad/blob/test-iterator-2/iter_test/iterate_dataset.py) that uses the {
"type": "file",
"gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
"state": "clean",
"path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
"dataset_path": ".",
"fs_base_path": "/home/cristian/datalad/longnow-podcasts",
"intra_dataset_path": "a öäx.bin",
"annexed": true,
"has_content": true,
"bytesize": 368640,
"dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
"dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
"status": "ok"
} That could be used as input for further processing. |
In addition, IIUC, to recreate a dataset with an annex, we probably need to report at least one location for an annexed file. That is, if we want to recreate a "shallow" version of the dataset, i.e. annexed files are just referenced, but not fetched. Or did I misunderstand the intention? |
I updated the traversal code to support annex locations. The output of the iter-command that I mentioned above for annexed files is now something like: {
"type": "file",
"gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
"state": "clean",
"path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
"dataset_path": ".",
"executable": false,
"fs_base_path": "/home/cristian/datalad/longnow-podcasts",
"intra_dataset_path": "a öäx.bin",
"bytesize": 368640,
"annexed": true,
"key": "MD5E-s368640--4946a4cc7b130d210b4b6625d7a343aa.bin",
"locations": [
"bd689d8a-2c11-4a3d-80ef-13e41335f9f4"
],
"has_content": true,
"dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
"dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
"status": "ok"
} For un-annexed files it is something like: {
"type": "file",
"gitshasum": "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391",
"state": "clean",
"path": "/home/cristian/datalad/longnow-podcasts/a b öäc",
"dataset_path": ".",
"executable": false,
"fs_base_path": "/home/cristian/datalad/longnow-podcasts",
"intra_dataset_path": "a b öäc",
"bytesize": 0,
"annexed": false,
"dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
"dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
"status": "ok"
} |
This issue is now tracked as one aspect of dataset descriptions with |
Some parts of the specification can be addressed by automated procedures. The is in particular the
files
table (see also #20). DataLad users will want to know what exactly they need to do when looking into adoptingtabby
such that can avoid duplicate efforts and optimally benefit from their establish datalad workflows.The text was updated successfully, but these errors were encountered: