Dedicated instructions for DataLad (metalad) users #23

mih · 2023-07-03T06:54:53Z

Some parts of the specification can be addressed by automated procedures. The is in particular the files table (see also #20). DataLad users will want to know what exactly they need to do when looking into adopting tabby such that can avoid duplicate efforts and optimally benefit from their establish datalad workflows.

The text was updated successfully, but these errors were encountered:

jsheunis · 2023-07-03T11:42:57Z

Do you mean e.g. automating the process of generating tabby tables from metalad-extracted metadata?

mih · 2023-07-03T13:16:52Z

A tabby metadata record is meant to be a comprehensive description of a dataset (even comprehensive enough to actually generate a DataLad dataset from it). This means that it should be possible to generate a tabby record from an actual DataLad dataset.

Some of the necessary information for such a meaningful record will need to come from external sources (e.g. human input), such as bibliographic info, description, etc.

But some information can be readily provided by DataLad, such as a file(-version) table (path, checksum, size, url(s)). But also a dataset identifier, list of "parts" (i.e., subdatasets), provenance information, etc.

How would this look like in practice? Which pieces are needed, how do they connect, what would a user need to do in order to employ this functionality?

jsheunis · 2023-07-03T14:30:42Z

I'm assuming:

datalad-metalad functionality remains a constant for the purpose of this description, and that its components usefully relate to steps required in this process
the precise tabby spec is still in flux regarding definitions of properties of datasets and files, although some would be a given

Let's say there's a generic process describing "how to serialize a datalad dataset". It could involve the following:

Traverse the whole dataset tree in order to collect information on the dataset itself and all files:
- this could be done with an iterator of datalad-next (e.g. gitworktree), or by metalad's meta-conduct functionality
Collect information on the dataset:
- the metalad_core already reports a JSON-LD record with useful properties like the dataset id (field @id of graph item with @type == Dataset) and version (field identifier of graph item with @type == Dataset), subdatasets ( field hasPart of graph item with @type == Dataset), and contributing agents
Collect information on the files:
- if using a datalad-next iterator, some information relevant for files might already be provided (i'm not familiar enough with the options and outputs atm)
- if using metalad and the metalad_core extractor, it currently reports:
  - path (relative to root of parent dataset)
  - dataset_id
  - dataset_version
  - what I think is the annex key? e.g. "@id": "datalad:MD5E-s20226--2a3a0d4267e48c789be1ea98784b535a.txt"
  - md5 checksum inferred from the key
  - contentbytesize
  - availability, under distribution: [{"url": ""}]
Figure out what to do with collected information:
- store in dataset metadata store using metalad?
- write to disk?
- keep in memory
All of the collected info would have transformed into the tabby-relevant tables and properties in (likely first into JSON-LD). This implies some sort of datalad/metalad-metadata-to-tabby translation step which is to be implemented and depends on:
- where to grab the metadata from
- what format the metadata is in (was it generated by metalad / some iterator / another process?)
- a mapping to the exact tabby spec
Transformation of the resulting JSON-LD to the native tabby file format, with associated context and frame documents.
Updates with additional information not included in datalad datasets. This would be fields like name, keywords, description, and whatever else. It could take two approaches:
- users should include this information in a minimum-spec (to be defined) metadata file and save it as part of the dataset, so that it can be extracted as part of step 2 above, and then be processed as part of rest of the pipeline. The benefit of this option is that it can be automated and validated, so that resulting tabby documents are ensured to be valid.
- user can hand-edit the output tabby files to add required fields according to some description of what these fields are.

mih · 2023-07-03T16:56:33Z

I think this are the questions, indeed.

This use case is interesting, because it is very similar to the catalog ingestion use case -- but even more thorough in that we are aiming for a full serialization of a dataset version.

This could (but does not have to) involve a merge and homogenization across multiple metadata sources (like that catalog use case).

mih · 2023-07-04T05:38:22Z

As per yesterday's discussion (outside any issue) it makes sense to remove metalad from the dependency chain. It offer no support for homogenization, and the storage target here is a specific file format.

The only thing to be done is to pull the code for the metalad_core extractor and expose it as a command (datalad/datalad-metalad#184)

christian-monch · 2023-07-04T08:30:51Z

As @jsheunis wrote: one option to traverse datasets and subdatasets would be to use the dataset-traverser module from metalad. That could be done in a simple script, or with an trivial pipeline definition that just contains a DatasetTraverser as provider. That could then be called via:

datalad -f json meta-conduct traverse_dataset traverser.top_level_dir=/home/cristian/datalad/longnow-podcasts traverser.item_type=both

The output would be a JSON record for the top-level dataset and each file in it. With an additional argument, subdatasets and their files would also be emitted.

christian-monch · 2023-07-04T08:46:05Z

W.r.t. special commands (as suggested by @mih): a few weeks ago, I created a small iterator-command script (55 lines, https://github.com/christian-monch/datalad-metalad/blob/test-iterator-2/iter_test/iterate_dataset.py) that uses the DatasetTraverser-class to traverse datasets and outputs a simplified JSON-object for each file/dataset that looks something like this (for a file):

{
  "type": "file",
  "gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
  "dataset_path": ".",
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a öäx.bin",
  "annexed": true,
  "has_content": true,
  "bytesize": 368640,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

That could be used as input for further processing.
One attribute that is missing to faithfully recreate a dataset is the file-type, i.e. the git object mode.

christian-monch · 2023-07-04T08:49:19Z

In addition, IIUC, to recreate a dataset with an annex, we probably need to report at least one location for an annexed file.

That is, if we want to recreate a "shallow" version of the dataset, i.e. annexed files are just referenced, but not fetched. Or did I misunderstand the intention?

christian-monch · 2023-07-04T09:42:22Z

I updated the traversal code to support annex locations. The output of the iter-command that I mentioned above for annexed files is now something like:

{
  "type": "file",
  "gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
  "dataset_path": ".",
  "executable": false,
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a öäx.bin",
  "bytesize": 368640,
  "annexed": true,
  "key": "MD5E-s368640--4946a4cc7b130d210b4b6625d7a343aa.bin",
  "locations": [
    "bd689d8a-2c11-4a3d-80ef-13e41335f9f4"
  ],
  "has_content": true,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

For un-annexed files it is something like:

{
  "type": "file",
  "gitshasum": "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a b öäc",
  "dataset_path": ".",
  "executable": false,
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a b öäc",
  "bytesize": 0,
  "annexed": false,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

mih · 2023-07-27T11:03:38Z

This issue is now tracked as one aspect of dataset descriptions with tabby in #102

jsheunis mentioned this issue Jul 11, 2023

Support self-description of a dataset with tabby #55

Closed

mih assigned mih and unassigned mih Jul 20, 2023

mih mentioned this issue Jul 27, 2023

Documentation on describing datasets specifically #102

Open

8 tasks

mih closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated instructions for DataLad (metalad) users #23

Dedicated instructions for DataLad (metalad) users #23

mih commented Jul 3, 2023

jsheunis commented Jul 3, 2023

mih commented Jul 3, 2023 •

edited

Loading

jsheunis commented Jul 3, 2023

mih commented Jul 3, 2023

mih commented Jul 4, 2023

christian-monch commented Jul 4, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023

mih commented Jul 27, 2023

Dedicated instructions for DataLad (metalad) users #23

Dedicated instructions for DataLad (metalad) users #23

Comments

mih commented Jul 3, 2023

jsheunis commented Jul 3, 2023

mih commented Jul 3, 2023 • edited Loading

jsheunis commented Jul 3, 2023

mih commented Jul 3, 2023

mih commented Jul 4, 2023

christian-monch commented Jul 4, 2023 • edited Loading

christian-monch commented Jul 4, 2023 • edited Loading

christian-monch commented Jul 4, 2023 • edited Loading

christian-monch commented Jul 4, 2023

mih commented Jul 27, 2023

mih commented Jul 3, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023 •

edited

Loading

christian-monch commented Jul 4, 2023 •

edited

Loading