[ENH] Implement dataset description for CAPS datasets #1158

NicolasGensollen · 2024-04-25T16:13:05Z

After a little bit of thinking, I made some changes to the structure of the dataset_description.json file for CAPS datasets compared to what I originally described in #1101.

Having the "Name" key used as a way to encode the different processing wasn't a good idea (i.e. something like "t1-linear + pet-linear"). Actually it was a very bad idea as it will probably lead to very complicated logic to understand which pipelines were run on a given CAPS.

The dynamic nature of CAPS datasets where additional processing pipelines can be run means that the dataset_description.json file is also going to change to incorporate the necessary metadata describing these processings.

This PR proposes to have a field named "Processing" which is a list of objects describing the different processing pipelines that were run. Each processing has a name, a date, an author, a machine on which it was executed, and a path to an input dataset.

Here is an example of a dataset_description.json file obtained when running t1-linear and pet-linear on a CI machine:

{
    "Name": "e6719ef6-2411-4ad2-8abd-da1fd8fbdf32",
    "BIDSVersion": "1.6.0",
    "CAPSVersion": "1.0.0",
    "DatasetType": "derivative",
    "Processing": [
        {
            "Name": "t1-linear",
            "Date": "2024-08-06T10:28:21.848950",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/T1Linear/in/bids"
        },
        {
            "Name": "pet-linear",
            "Date": "2024-08-06T10:36:27.403373",
            "Author": "ci",
            "Machine": "ubuntu",
            "InputPath": "/mnt/data_ci/PETLinear/in/bids"
        }
    ]
}

The name can be any user-provided string, and defaults to a random identifier (as the one shown above).
It cannot change, meaning that re-running a pipeline with the same CAPS dataset will not change the name of the dataset (even if the user explicitly asks for another name). It will only update the "processing" entry if needed.

A processing is identified by default by its name and its input path. This means that:

if you run the same pipeline (ex: t1-linear) twice on the same BIDS input, the corresponding processing metadata will be replaced with a new one.
if you run the same pipeline on different BIDS inputs (not super recommended but possible...), there will be a processing metadata for each.

The two version fields "BIDSVersion" and "CAPSVersion" are delicate because they are supposed to version the metadata models used. Theoretically, when using an existing CAPS dataset as an output for a pipeline, the versions of BIDS and CAPS specifications of the file should match the ones used by Clinica, otherwise there is no guarantee that things will not break.
For this reason, this PR proposes to raise an error when it happens.
I'm still debating on this as it could easily be perceived as annoying (for example in the CI data we have tons of BIDS dataset_description.json with old versions that were never updated...), but it will force us to have meaningful metadata with our datasets.

Finally, this PR also proposes to impose the presence of the dataset_description.json file in CAPS folders. When trying to run a pipeline with a new folder for CAPS, the file will automatically be generated. When trying to run a pipeline with an existing folder without a dataset_description.json, Clinica will raise an error with a suggestion of a minimum file that should be added.

Link to the documentation page: https://aramislab.paris.inria.fr/clinica/docs/public/PR-1158/CAPS/Specifications/

Requires data PR: https://github.com/aramis-lab/clinica_data_ci/pull/68

AliceJoubert

LGTM ! Thanks @NicolasGensollen

AliceJoubert · 2024-08-08T15:17:53Z

docs/CAPS/Specifications.md

+- `Date`: This date is in iso-format and indicates when the processing was run.
+- `Author`: This indicates the user name which triggered the processing.
+- `Machine`: This indicates the name of the machine on which the processing was run.
+- `ProcessingPath`: This is a path regex (relative to the CAPS dataset root) that can be used to get all sub-folders having data for this processing.


That is not exactly true in case you have several processings with different names but similar pipelines, right ?
For example : if I choose to run T1-linear twice on different BIDS subjects going into the same CAPS output, and naming the processing two different names.

(I may be missing something here)

…t create a CAPS

AliceJoubert · 2024-08-12T15:13:05Z

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

NicolasGensollen · 2024-08-12T16:09:44Z

Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :)

Absolutely ! I couldn't find the time to do it this afternoon, but will have a go at it tomorrow.
Thanks for the review @AliceJoubert !

* add caps module with basic logic * try linking to the engine * add unit tests for caps module * add unit tests for anat pipeline (might change to more general later...) * fix broken unit tests * trigger CI * post rebase fixes * add suggestion for basic dataset_description.json in error * add some doc * fix permission errors for non regression tests * update documentation * rework CAPS dataset_description.json * write additional processing * fix input dir * fix permission errors * use log_and_warn function * permission issues on CI machines * improvements * update documentation * provide more flexibility for comparing different versions of the specs * remove processing_path attribute * allow multiple processing with same name if input paths are different * allow users to specify the name of the CAPS dataset for pipelines that create a CAPS * update documentation * small modification to the docs

NicolasGensollen self-assigned this Apr 25, 2024

NicolasGensollen added the enhancement New feature or request label Apr 25, 2024

NicolasGensollen added this to the v0.9.0 milestone Apr 25, 2024

NicolasGensollen force-pushed the implement-caps-dataset-description branch from ca502fb to 7fad00e Compare May 27, 2024 07:41

NicolasGensollen force-pushed the implement-caps-dataset-description branch 2 times, most recently from 7a497bd to d1540dc Compare July 11, 2024 08:54

NicolasGensollen force-pushed the implement-caps-dataset-description branch 5 times, most recently from a562714 to 2adbc8f Compare August 2, 2024 15:39

NicolasGensollen mentioned this pull request Aug 6, 2024

[MAINT] Remove upper bound constraint on cattrs version #1250

Merged

NicolasGensollen added 17 commits August 6, 2024 14:51

add caps module with basic logic

09bbd8f

try linking to the engine

68b2fec

add unit tests for caps module

cea5f68

add unit tests for anat pipeline (might change to more general later...)

de102d3

fix broken unit tests

501151a

trigger CI

881ec84

post rebase fixes

bc625a7

add suggestion for basic dataset_description.json in error

1dd2318

add some doc

661a22c

fix permission errors for non regression tests

4bf9cbf

update documentation

59af136

rework CAPS dataset_description.json

4c3a18e

write additional processing

394e81c

fix input dir

4e2f1ca

fix permission errors

2d489ef

use log_and_warn function

a81b4bb

permission issues on CI machines

ad77d3b

NicolasGensollen force-pushed the implement-caps-dataset-description branch from 1964953 to ad77d3b Compare August 6, 2024 14:16

NicolasGensollen added 3 commits August 7, 2024 09:20

improvements

8d76987

update documentation

77fea0f

provide more flexibility for comparing different versions of the specs

4e9492f

NicolasGensollen marked this pull request as ready for review August 7, 2024 14:20

NicolasGensollen requested a review from AliceJoubert August 8, 2024 07:10

AliceJoubert approved these changes Aug 8, 2024

View reviewed changes

NicolasGensollen added 3 commits August 12, 2024 09:13

remove processing_path attribute

5a974bf

allow multiple processing with same name if input paths are different

9ef04e7

allow users to specify the name of the CAPS dataset for pipelines tha…

0eef866

…t create a CAPS

NicolasGensollen added 2 commits August 13, 2024 08:17

update documentation

29071af

small modification to the docs

1acc1f3

NicolasGensollen merged commit 8bd07fc into aramis-lab:dev Aug 19, 2024
14 of 15 checks passed

NicolasGensollen deleted the implement-caps-dataset-description branch August 19, 2024 11:31

AliceJoubert mentioned this pull request Aug 19, 2024

Running T1-linear raises an error related to CAPS #1255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Implement dataset description for CAPS datasets #1158

[ENH] Implement dataset description for CAPS datasets #1158

NicolasGensollen commented Apr 25, 2024 •

edited

Loading

AliceJoubert left a comment

AliceJoubert Aug 8, 2024

AliceJoubert Aug 8, 2024

AliceJoubert commented Aug 12, 2024

NicolasGensollen commented Aug 12, 2024

[ENH] Implement dataset description for CAPS datasets #1158

[ENH] Implement dataset description for CAPS datasets #1158

Conversation

NicolasGensollen commented Apr 25, 2024 • edited Loading

AliceJoubert left a comment

Choose a reason for hiding this comment

AliceJoubert Aug 8, 2024

Choose a reason for hiding this comment

AliceJoubert Aug 8, 2024

Choose a reason for hiding this comment

AliceJoubert commented Aug 12, 2024

NicolasGensollen commented Aug 12, 2024

NicolasGensollen commented Apr 25, 2024 •

edited

Loading