Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Implement dataset description for CAPS datasets #1158

Conversation

NicolasGensollen
Copy link
Member

@NicolasGensollen NicolasGensollen commented Apr 25, 2024

Closes #1101

After a little bit of thinking, I made some changes to the structure of the dataset_description.json file for CAPS datasets compared to what I originally described in #1101.

Having the "Name" key used as a way to encode the different processing wasn't a good idea (i.e. something like "t1-linear + pet-linear"). Actually it was a very bad idea as it will probably lead to very complicated logic to understand which pipelines were run on a given CAPS.

The dynamic nature of CAPS datasets where additional processing pipelines can be run means that the dataset_description.json file is also going to change to incorporate the necessary metadata describing these processings.

This PR proposes to have a field named "Processing" which is a list of objects describing the different processing pipelines that were run. Each processing has a name, a date, an author, a machine on which it was executed, and a path to an input dataset.

Here is an example of a dataset_description.json file obtained when running t1-linear and pet-linear on a CI machine:

{
    "Name": "e6719ef6-2411-4ad2-8abd-da1fd8fbdf32",
    "BIDSVersion": "1.6.0",
    "CAPSVersion": "1.0.0",
    "DatasetType": "derivative",
    "Processing": [
        {
            "Name": "t1-linear",
            "Date": "2024-08-06T10:28:21.848950",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/T1Linear/in/bids"
        },
        {
            "Name": "pet-linear",
            "Date": "2024-08-06T10:36:27.403373",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/PETLinear/in/bids"
        }
    ]
}

The name can be any user-provided string, and defaults to a random identifier (as the one shown above).
It cannot change, meaning that re-running a pipeline with the same CAPS dataset will not change the name of the dataset (even if the user explicitly asks for another name). It will only update the "processing" entry if needed.

A processing is identified by default by its name and its input path. This means that:

  • if you run the same pipeline (ex: t1-linear) twice on the same BIDS input, the corresponding processing metadata will be replaced with a new one.
  • if you run the same pipeline on different BIDS inputs (not super recommended but possible...), there will be a processing metadata for each.

The two version fields "BIDSVersion" and "CAPSVersion" are delicate because they are supposed to version the metadata models used. Theoretically, when using an existing CAPS dataset as an output for a pipeline, the versions of BIDS and CAPS specifications of the file should match the ones used by Clinica, otherwise there is no guarantee that things will not break.
For this reason, this PR proposes to raise an error when it happens.
I'm still debating on this as it could easily be perceived as annoying (for example in the CI data we have tons of BIDS dataset_description.json with old versions that were never updated...), but it will force us to have meaningful metadata with our datasets.

Finally, this PR also proposes to impose the presence of the dataset_description.json file in CAPS folders. When trying to run a pipeline with a new folder for CAPS, the file will automatically be generated. When trying to run a pipeline with an existing folder without a dataset_description.json, Clinica will raise an error with a suggestion of a minimum file that should be added.

Link to the documentation page: https://aramislab.paris.inria.fr/clinica/docs/public/PR-1158/CAPS/Specifications/

Requires data PR: https://github.com/aramis-lab/clinica_data_ci/pull/68

@NicolasGensollen NicolasGensollen self-assigned this Apr 25, 2024
@NicolasGensollen NicolasGensollen added the enhancement New feature or request label Apr 25, 2024
@NicolasGensollen NicolasGensollen added this to the v0.9.0 milestone Apr 25, 2024
@NicolasGensollen NicolasGensollen force-pushed the implement-caps-dataset-description branch from ca502fb to 7fad00e Compare May 27, 2024 07:41
@NicolasGensollen NicolasGensollen force-pushed the implement-caps-dataset-description branch 2 times, most recently from 7a497bd to d1540dc Compare July 11, 2024 08:54
@NicolasGensollen NicolasGensollen force-pushed the implement-caps-dataset-description branch 5 times, most recently from a562714 to 2adbc8f Compare August 2, 2024 15:39
@NicolasGensollen NicolasGensollen force-pushed the implement-caps-dataset-description branch from 1964953 to ad77d3b Compare August 6, 2024 14:16
@NicolasGensollen NicolasGensollen marked this pull request as ready for review August 7, 2024 14:20
Copy link
Contributor

@AliceJoubert AliceJoubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Thanks @NicolasGensollen

- `Date`: This date is in iso-format and indicates when the processing was run.
- `Author`: This indicates the user name which triggered the processing.
- `Machine`: This indicates the name of the machine on which the processing was run.
- `ProcessingPath`: This is a path regex (relative to the CAPS dataset root) that can be used to get all sub-folders having data for this processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not exactly true in case you have several processings with different names but similar pipelines, right ?
For example : if I choose to run T1-linear twice on different BIDS subjects going into the same CAPS output, and naming the processing two different names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I may be missing something here)

@AliceJoubert
Copy link
Contributor

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

@NicolasGensollen
Copy link
Member Author

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

Absolutely ! I couldn't find the time to do it this afternoon, but will have a go at it tomorrow.
Thanks for the review @AliceJoubert !

@NicolasGensollen NicolasGensollen merged commit 8bd07fc into aramis-lab:dev Aug 19, 2024
14 of 15 checks passed
@NicolasGensollen NicolasGensollen deleted the implement-caps-dataset-description branch August 19, 2024 11:31
AliceJoubert pushed a commit to AliceJoubert/clinica that referenced this pull request Aug 22, 2024
* add caps module with basic logic

* try linking to the engine

* add unit tests for caps module

* add unit tests for anat pipeline (might change to more general later...)

* fix broken unit tests

* trigger CI

* post rebase fixes

* add suggestion for basic dataset_description.json in error

* add some doc

* fix permission errors for non regression tests

* update documentation

* rework CAPS dataset_description.json

* write additional processing

* fix input dir

* fix permission errors

* use log_and_warn function

* permission issues on CI machines

* improvements

* update documentation

* provide more flexibility for comparing different versions of the specs

* remove processing_path attribute

* allow multiple processing with same name if input paths are different

* allow users to specify the name of the CAPS dataset for pipelines that create a CAPS

* update documentation

* small modification to the docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add version and possibly other metadata to CAPS datasets
2 participants