-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Implement dataset description for CAPS datasets #1158
[ENH] Implement dataset description for CAPS datasets #1158
Conversation
ca502fb
to
7fad00e
Compare
7a497bd
to
d1540dc
Compare
a562714
to
2adbc8f
Compare
1964953
to
ad77d3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks @NicolasGensollen
docs/CAPS/Specifications.md
Outdated
- `Date`: This date is in iso-format and indicates when the processing was run. | ||
- `Author`: This indicates the user name which triggered the processing. | ||
- `Machine`: This indicates the name of the machine on which the processing was run. | ||
- `ProcessingPath`: This is a path regex (relative to the CAPS dataset root) that can be used to get all sub-folders having data for this processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not exactly true in case you have several processings with different names but similar pipelines, right ?
For example : if I choose to run T1-linear twice on different BIDS subjects going into the same CAPS output, and naming the processing two different names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I may be missing something here)
Unless I missed it the documentation would need changes too but otherwise LGTM ! Thanks @NicolasGensollen :) |
Absolutely ! I couldn't find the time to do it this afternoon, but will have a go at it tomorrow. |
* add caps module with basic logic * try linking to the engine * add unit tests for caps module * add unit tests for anat pipeline (might change to more general later...) * fix broken unit tests * trigger CI * post rebase fixes * add suggestion for basic dataset_description.json in error * add some doc * fix permission errors for non regression tests * update documentation * rework CAPS dataset_description.json * write additional processing * fix input dir * fix permission errors * use log_and_warn function * permission issues on CI machines * improvements * update documentation * provide more flexibility for comparing different versions of the specs * remove processing_path attribute * allow multiple processing with same name if input paths are different * allow users to specify the name of the CAPS dataset for pipelines that create a CAPS * update documentation * small modification to the docs
Closes #1101
After a little bit of thinking, I made some changes to the structure of the
dataset_description.json
file for CAPS datasets compared to what I originally described in #1101.Having the "Name" key used as a way to encode the different processing wasn't a good idea (i.e. something like "t1-linear + pet-linear"). Actually it was a very bad idea as it will probably lead to very complicated logic to understand which pipelines were run on a given CAPS.
The dynamic nature of CAPS datasets where additional processing pipelines can be run means that the
dataset_description.json
file is also going to change to incorporate the necessary metadata describing these processings.This PR proposes to have a field named "Processing" which is a list of objects describing the different processing pipelines that were run. Each processing has a name, a date, an author, a machine on which it was executed, and a path to an input dataset.
Here is an example of a
dataset_description.json
file obtained when runningt1-linear
andpet-linear
on a CI machine:The name can be any user-provided string, and defaults to a random identifier (as the one shown above).
It cannot change, meaning that re-running a pipeline with the same CAPS dataset will not change the name of the dataset (even if the user explicitly asks for another name). It will only update the "processing" entry if needed.
A processing is identified by default by its name and its input path. This means that:
The two version fields "BIDSVersion" and "CAPSVersion" are delicate because they are supposed to version the metadata models used. Theoretically, when using an existing CAPS dataset as an output for a pipeline, the versions of BIDS and CAPS specifications of the file should match the ones used by Clinica, otherwise there is no guarantee that things will not break.
For this reason, this PR proposes to raise an error when it happens.
I'm still debating on this as it could easily be perceived as annoying (for example in the CI data we have tons of BIDS
dataset_description.json
with old versions that were never updated...), but it will force us to have meaningful metadata with our datasets.Finally, this PR also proposes to impose the presence of the
dataset_description.json
file in CAPS folders. When trying to run a pipeline with a new folder for CAPS, the file will automatically be generated. When trying to run a pipeline with an existing folder without adataset_description.json
, Clinica will raise an error with a suggestion of a minimum file that should be added.Link to the documentation page: https://aramislab.paris.inria.fr/clinica/docs/public/PR-1158/CAPS/Specifications/
Requires data PR: https://github.com/aramis-lab/clinica_data_ci/pull/68