Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First commit of a script to transform ISA-tab-like metadata to catalog #2

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jsheunis
Copy link
Collaborator

@jsheunis jsheunis commented Jun 6, 2023

Instructions for running are located in README.md

The script takes a TSV file as input and transforms the provided fields and values into JSON output that should validate successfully against the catalog schema, i.e. ready for 'catalog add'.

The current implementation includes dataset metadata transformation and excludes file metadata transformation (todo).

jsheunis and others added 9 commits June 7, 2023 00:12
Instructions for running are located in README.md

The script takes a TSV file as input and transforms the provided
fields and values into JSON output that should validate
successfully against the catalog schema, i.e. ready for
'catalog add'.

The current implementation includes dataset metadata
transformation and excludes file metadata transformation (todo).
It read from file(s) OR stdin (default), and always writes to stdout.
This takes out needless complexity re path handling and leaves that to a
caller.

Equivalent calls now:

```
% bin/tubby2catalog -t dataset < data/dataset_metadata.tsv

% bin/tubby2catalog -t dataset data/dataset_metadata.tsv
```
In any unsupervised execution log-messages are simply not enough.
…operty categories

This is done be slightly changing the internal data structure for
`additional_display` items during aggregation.

Adjust demo data to show feature
This is needed for the catalog tooling/schema-compliance.

To make this flexible, introduce a formating mechanism that can yield
a raw ID string as input for the UUID5 generation. It can be provided
via a new/simple config option. Example:

```
bin/tubby2catalog \
  -t dataset \
  -c 'dataset_id_fmt={additional_display[sfb1451][project]}{dataset_id}'\
  data/dataset_metadata.tsv
```

This configures the UUID5 got be generated from the concatenation of the
'sfb1451'-'project' custom property, plus the 'dataset_id' field.

To make this work, this uses an intermediate metadata representation
that is comprised of mostly dicts, and supports direct addressing of
most properties.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants