Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds research-dataset-schema.yaml and testing data #12

Closed
wants to merge 8 commits into from

Conversation

jsheunis
Copy link
Contributor

This adds a new schema that aims to model a generic research dataset.

This schema draft models a generic research dataset for
the purpose of making it findable and understandable for
humans, for example via an online data catalog.

A (supposedly) valid json data document is added, as well as an invalid document, for testing purposes. The makefile is updated to add testing and document / output generation for the new schema.

TODO 1: solve errors from make validate-examples-research-dataset-schema:

/Library/Developer/CommandLineTools/usr/bin/make validate-valid-examples-research-dataset-schema validate-invalid-examples-research-dataset-schema
linkml-validate -s src/linkml/research-dataset-schema.yaml src/examples/research-dataset-schema/*
INFO:root:Importing datasets as datasets from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
INFO:root:Importing linkml:types as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=None
INFO:root:Importing types as types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
INFO:root:Using SchemaView with im=None
INFO:root:Importing datasets as datasets from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
INFO:root:Importing linkml:types as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=None
INFO:root:Importing types as types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
[ERROR] [src/examples/research-dataset-schema/ResearchDataset-penguins.json/0] Additional properties are not allowed ('date_modified', 'version' were unexpected) in /
[ERROR] [src/examples/research-dataset-schema/ResearchDataset-penguins.json/0] {'checksum_md5': 'e7e2be6b203a221949f05e02fcefd853', 'content_url': 'https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff', 'path_posix': 'raw/adelie.csv', 'size_in_bytes': 23755} is not of type 'string' in /has_part/0
[ERROR] [src/examples/research-dataset-schema/ResearchDataset-penguins.json/0] {'checksum_md5': '1549566fb97afa879dc9446edcf2015f', 'content_url': 'https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381', 'path_posix': 'raw/gentoo.csv', 'size_in_bytes': 11263} is not of type 'string' in /has_part/1
[ERROR] [src/examples/research-dataset-schema/ResearchDataset-penguins.json/0] {'checksum_md5': 'e4b0710c69297031d63866ce8b888f25', 'content_url': 'https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462', 'path_posix': 'raw/chinstrap.csv', 'size_in_bytes': 18872} is not of type 'string' in /has_part/2
[ERROR] [src/examples/research-dataset-schema/ResearchDataset-penguins.json/0] {'email': '[email protected]', 'name': 'Allison Horst'} is not of type 'string' in /data_controller
make[1]: *** [validate-valid-examples-research-dataset-schema] Error 1
make: *** [validate-examples-research-dataset-schema] Error 2

TODO 2: address warnings in make check-research-dataset-schema:

[Check src/linkml/research-dataset-schema.yaml]
Run linter
INFO:root:Importing linkml:units as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/units from source /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/meta.yaml; base_dir=None
INFO:root:Importing linkml:mappings as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/mappings from source /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/meta.yaml; base_dir=None
INFO:root:Importing linkml:types as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/types from source /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/meta.yaml; base_dir=None
INFO:root:Importing linkml:annotations as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/annotations from source /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/meta.yaml; base_dir=None
INFO:root:Importing linkml:extensions as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/extensions from source /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/meta.yaml; base_dir=None
INFO:root:Importing datasets as datasets from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
INFO:root:Importing linkml:types as /Users/jsheunis/opt/miniconda3/envs/linkml/lib/python3.11/site-packages/linkml_runtime/linkml_model/model/schema/types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=None
INFO:root:Importing types as types from source /Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml; base_dir=/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml
/Users/jsheunis/Documents/psyinf/datalad-concepts/src/linkml/research-dataset-schema.yaml
  warning  Enum 'authorType' does not have recommended slot 'description'  (recommended)
  warning  Enum has name 'authorType'  (standard_naming)
  warning  Permissible value of Enum 'authorType' has name 'Person'  (standard_naming)
  warning  Permissible value of Enum 'authorType' has name 'Organization'  (standard_naming)
  warning  Schema maps prefix 'afo' to namespace 'http://purl.allotrope.org/ontologies/result#' instead of using prefix 'afr'  (canonical_prefixes)
  warning  Schema maps prefix 'bibo' to namespace 'https://purl.org/ontology/bibo/' instead of namespace 'http://purl.org/ontology/bibo/'  (canonical_prefixes)
  warning  Schema maps prefix 'obo' to namespace 'https://purl.obolibrary.org/obo/' instead of namespace 'http://purl.obolibrary.org/obo/'  (canonical_prefixes)

✖ Found 7 problems in 1 schema
make: *** [check-research-dataset-schema] Error 1

This adds a new schema that aims to model a generic research dataset.
A (supposedly) valid json data document is added, as well as an invalid
document, for testing purposes. The makefile is updated to add testing
and document / output generation for the new schema.
@mih
Copy link
Contributor

mih commented Dec 20, 2023

Thanks! Looks like the linter run produced some good recommendations already!

@jsheunis
Copy link
Contributor Author

Indeed :) Now I have to figure out why the problem of [something that is an object] is not of type 'string' occurs.

Comment on lines 36 to 37
inlined_as_list: true
range: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this does not work, and it is exactly the case that I outlined previously. The instruction says that objects will be inlined. But at the same time the range is set to string, a basic type. One of the two needs to change, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this seems like it could be the issue. My original expectation was that the range referred to whatever was inside the list (when using inlined_as_list), so that might be the mistake. Will report back soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually looks like this is not the problem. Rather, it looks like the specification of a union of ranges via any_of, together with possibly faulty data, is what's causing the issue. I have :

slots:
  has_part:
    slot_uri: dcterms:hasPart
    multivalued: true
    inlined_as_list: true
    any_of:
      - range: File
      - range: ResearchDataset
    description: >-
      Linked entities that form part of a dataset
      such as files or other (sub)datasets

and then data:

"has_part": [
        {
            "checksum_md5": "e7e2be6b203a221949f05e02fcefd853",
            "content_url": "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff",
            "path_posix": "raw/adelie.csv",
            "size_in_bytes": 23755
        },
        {
            "checksum_md5": "1549566fb97afa879dc9446edcf2015f",
            "content_url": "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.220.3&entityid=e03b43c924f226486f2f0ab6709d2381",
            "path_posix": "raw/gentoo.csv",
            "size_in_bytes": 11263
        },
        {
            "checksum_md5": "e4b0710c69297031d63866ce8b888f25",
            "content_url": "https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.221.2&entityid=fe853aa8f7a59aa84cdd3197619ef462",
            "path_posix": "raw/chinstrap.csv",
            "size_in_bytes": 18872
        }
    ]

When I set this up I was thinking "how will the validator know whether an object in the list is a File or a ResearchDataset, and I'm guessing this is the issue. It's maybe expecting a ID as string? Will investigate this further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I change the schema to:

slots:
  has_part:
    slot_uri: dcterms:hasPart
    multivalued: true
    inlined_as_list: true
    range: File
    description: >-
      Linked entities that form part of a dataset
      such as files or other (sub)datasets

and keep the same data, validation succeeds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the difference is File being a class, not a type, which make inlining valid. So you resolved that particular flaw from the other end.

],
"author": [
{
"author_type": "Person",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mih mih mentioned this pull request Dec 21, 2023
@mih
Copy link
Contributor

mih commented Mar 10, 2024

With #87 I finally caught up with this PR. The mainline now has a schema that can do all of this, and is not constrained to research applications.

The example matching the penguins data record in here can be found at https://github.com/psychoinformatics-de/datalad-concepts/blob/main/src/examples/dataset-version/DatasetVersionObject-penguins.yaml

Closing... Thanks for setting the mark. It helped a lot.

@mih mih closed this Mar 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants