Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated instructions for DataLad (metalad) users #23

Closed
Tracked by #102
mih opened this issue Jul 3, 2023 · 10 comments
Closed
Tracked by #102

Dedicated instructions for DataLad (metalad) users #23

mih opened this issue Jul 3, 2023 · 10 comments

Comments

@mih
Copy link
Contributor

mih commented Jul 3, 2023

Some parts of the specification can be addressed by automated procedures. The is in particular the files table (see also #20). DataLad users will want to know what exactly they need to do when looking into adopting tabby such that can avoid duplicate efforts and optimally benefit from their establish datalad workflows.

@jsheunis
Copy link
Contributor

jsheunis commented Jul 3, 2023

Do you mean e.g. automating the process of generating tabby tables from metalad-extracted metadata?

@mih
Copy link
Contributor Author

mih commented Jul 3, 2023

A tabby metadata record is meant to be a comprehensive description of a dataset (even comprehensive enough to actually generate a DataLad dataset from it). This means that it should be possible to generate a tabby record from an actual DataLad dataset.

Some of the necessary information for such a meaningful record will need to come from external sources (e.g. human input), such as bibliographic info, description, etc.

But some information can be readily provided by DataLad, such as a file(-version) table (path, checksum, size, url(s)). But also a dataset identifier, list of "parts" (i.e., subdatasets), provenance information, etc.

How would this look like in practice? Which pieces are needed, how do they connect, what would a user need to do in order to employ this functionality?

@jsheunis
Copy link
Contributor

jsheunis commented Jul 3, 2023

I'm assuming:

  • datalad-metalad functionality remains a constant for the purpose of this description, and that its components usefully relate to steps required in this process
  • the precise tabby spec is still in flux regarding definitions of properties of datasets and files, although some would be a given

Let's say there's a generic process describing "how to serialize a datalad dataset". It could involve the following:

  1. Traverse the whole dataset tree in order to collect information on the dataset itself and all files:
    • this could be done with an iterator of datalad-next (e.g. gitworktree), or by metalad's meta-conduct functionality
  2. Collect information on the dataset:
    • the metalad_core already reports a JSON-LD record with useful properties like the dataset id (field @id of graph item with @type == Dataset) and version (field identifier of graph item with @type == Dataset), subdatasets ( field hasPart of graph item with @type == Dataset), and contributing agents
  3. Collect information on the files:
    • if using a datalad-next iterator, some information relevant for files might already be provided (i'm not familiar enough with the options and outputs atm)
    • if using metalad and the metalad_core extractor, it currently reports:
      • path (relative to root of parent dataset)
      • dataset_id
      • dataset_version
      • what I think is the annex key? e.g. "@id": "datalad:MD5E-s20226--2a3a0d4267e48c789be1ea98784b535a.txt"
      • md5 checksum inferred from the key
      • contentbytesize
      • availability, under distribution: [{"url": ""}]
  4. Figure out what to do with collected information:
    • store in dataset metadata store using metalad?
    • write to disk?
    • keep in memory
  5. All of the collected info would have transformed into the tabby-relevant tables and properties in (likely first into JSON-LD). This implies some sort of datalad/metalad-metadata-to-tabby translation step which is to be implemented and depends on:
    • where to grab the metadata from
    • what format the metadata is in (was it generated by metalad / some iterator / another process?)
    • a mapping to the exact tabby spec
  6. Transformation of the resulting JSON-LD to the native tabby file format, with associated context and frame documents.
  7. Updates with additional information not included in datalad datasets. This would be fields like name, keywords, description, and whatever else. It could take two approaches:
    • users should include this information in a minimum-spec (to be defined) metadata file and save it as part of the dataset, so that it can be extracted as part of step 2 above, and then be processed as part of rest of the pipeline. The benefit of this option is that it can be automated and validated, so that resulting tabby documents are ensured to be valid.
    • user can hand-edit the output tabby files to add required fields according to some description of what these fields are.

@mih
Copy link
Contributor Author

mih commented Jul 3, 2023

I think this are the questions, indeed.

This use case is interesting, because it is very similar to the catalog ingestion use case -- but even more thorough in that we are aiming for a full serialization of a dataset version.

This could (but does not have to) involve a merge and homogenization across multiple metadata sources (like that catalog use case).

@mih
Copy link
Contributor Author

mih commented Jul 4, 2023

As per yesterday's discussion (outside any issue) it makes sense to remove metalad from the dependency chain. It offer no support for homogenization, and the storage target here is a specific file format.

The only thing to be done is to pull the code for the metalad_core extractor and expose it as a command (datalad/datalad-metalad#184)

@christian-monch
Copy link
Contributor

christian-monch commented Jul 4, 2023

As @jsheunis wrote: one option to traverse datasets and subdatasets would be to use the dataset-traverser module from metalad. That could be done in a simple script, or with an trivial pipeline definition that just contains a DatasetTraverser as provider. That could then be called via:

datalad -f json meta-conduct traverse_dataset traverser.top_level_dir=/home/cristian/datalad/longnow-podcasts traverser.item_type=both

The output would be a JSON record for the top-level dataset and each file in it. With an additional argument, subdatasets and their files would also be emitted.

@christian-monch
Copy link
Contributor

christian-monch commented Jul 4, 2023

W.r.t. special commands (as suggested by @mih): a few weeks ago, I created a small iterator-command script (55 lines, https://github.com/christian-monch/datalad-metalad/blob/test-iterator-2/iter_test/iterate_dataset.py) that uses the DatasetTraverser-class to traverse datasets and outputs a simplified JSON-object for each file/dataset that looks something like this (for a file):

{
  "type": "file",
  "gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
  "dataset_path": ".",
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a öäx.bin",
  "annexed": true,
  "has_content": true,
  "bytesize": 368640,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

That could be used as input for further processing.
One attribute that is missing to faithfully recreate a dataset is the file-type, i.e. the git object mode.

@christian-monch
Copy link
Contributor

christian-monch commented Jul 4, 2023

In addition, IIUC, to recreate a dataset with an annex, we probably need to report at least one location for an annexed file.

That is, if we want to recreate a "shallow" version of the dataset, i.e. annexed files are just referenced, but not fetched. Or did I misunderstand the intention?

@christian-monch
Copy link
Contributor

I updated the traversal code to support annex locations. The output of the iter-command that I mentioned above for annexed files is now something like:

{
  "type": "file",
  "gitshasum": "bc228b7fbb9b70f1529de529a21214709fd4b31f",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a öäx.bin",
  "dataset_path": ".",
  "executable": false,
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a öäx.bin",
  "bytesize": 368640,
  "annexed": true,
  "key": "MD5E-s368640--4946a4cc7b130d210b4b6625d7a343aa.bin",
  "locations": [
    "bd689d8a-2c11-4a3d-80ef-13e41335f9f4"
  ],
  "has_content": true,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

For un-annexed files it is something like:

{
  "type": "file",
  "gitshasum": "e69de29bb2d1d6434b8b29ae775ad8c2e48c5391",
  "state": "clean",
  "path": "/home/cristian/datalad/longnow-podcasts/a b öäc",
  "dataset_path": ".",
  "executable": false,
  "fs_base_path": "/home/cristian/datalad/longnow-podcasts",
  "intra_dataset_path": "a b öäc",
  "bytesize": 0,
  "annexed": false,
  "dataset_id": "b3ca2718-8901-11e8-99aa-a0369f7c647e",
  "dataset_version": "4eaeaef68249ce25b0b497a571ddedd2d19d0cae",
  "status": "ok"
}

@mih
Copy link
Contributor Author

mih commented Jul 27, 2023

This issue is now tracked as one aspect of dataset descriptions with tabby in #102

@mih mih closed this as completed Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants