Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish studyforrest-data-raw #34

Open
mih opened this issue Apr 26, 2021 · 10 comments
Open

Establish studyforrest-data-raw #34

mih opened this issue Apr 26, 2021 · 10 comments
Labels

Comments

@mih
Copy link
Contributor

mih commented Apr 26, 2021

Aiming to be a superdataset for targeted subdatasets for each "study". These studies were internally called

  • 7T_ad
  • pandorra
  • anatomy
  • fg_eyegaze_raw
  • 3T_av_et
  • 3T_visloc

These names correspond to folders in the original datastructure on the cluster. They contain the pristine data artifacts and can never be made public, due to data protection regulations.

There are at least two more "raw" datasets (multires3T and multires7T), but their DICOM data are not readily accessible ATM.

@bpoldrack
Copy link

I'm working on building this dataset with subdatasets 7T_ad, pandorra, anatomy for now.
Not entirely clear whether and how we want to reflect the notion of phase1. Three options:

  • within this dataset (just directory?), so it's clear by hierarchy that 7T_ad, pandorra, anatomy are its parts
  • at the level of converted, anonymized BIDS datasets only as a partial conversion of studyforrest-data-raw (BIDS dataset would then have those three subdatasets under sourcedata)
  • intermediate dataset that would then be converted and would be at the "same level" as studyforrest-data-raw, referencing a subset of its subdatasets

@loj
Copy link

loj commented Apr 26, 2021

I would lean towards option 2

at the level of converted, anonymized BIDS datasets only as a partial conversion of studyforrest-data-raw (BIDS dataset would then have those three subdatasets under sourcedata)

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

@bpoldrack
Copy link

At this level, then, we could also maintain the data representation as described in papers in a separate branch. #5

True, but independent on how we reference the raw data at the level of a notion like phase1.

The "issue" with 2) would be dataset level files like README, dataset_description.json and so on. Current approach would be to have them in the raw dataset and use a "copy-converter" for the respective BIDS dataset. If we don't have a phase1-raw location (1 or 3), where would those things live? They could, of course, be created/added at the BIDS level only. Not sure whether there are things at the phase1 abstraction, where this wouldn't work (b/c anonymization or whatever), though.

Approach 1 would be a special case for phase1, since other, possibly overlapping superdatasets can't be addressed the same way. So, I lean towards 3) as the most flexible thing that seems likely to generalize as an approach for other subsamples of studyforrest-data-raw. WDYT, @mih ?

@adswa adswa added the data label Apr 26, 2021
@bpoldrack
Copy link

bpoldrack commented Apr 26, 2021

Adapted the scripts/approach to build this.

First trial of building the (sub)datasets finished:
/data/project/studyforrest_phase1/pandora
/data/project/studyforrest_phase1/anatomy
/data/project/studyforrest_phase1/7T_ad

Initial setup of them was done by /data/project/studyforrest_phase1/build-forrest/studyforrest-data-raw-sh.
Actual data import + spec editing was done by their respective build script in each dataset's code/creation.

@bpoldrack
Copy link

The three datasets pandora, 7T_ad and anatomy require a verification of being what we want them to be. That is: They are supposed to capture all relevant raw data of those "studies" (independent on what should be converted in what context). This requires knowledge of what exactly that means. How do we approach this, @mih?

@bpoldrack
Copy link

bpoldrack commented Apr 27, 2021

Additionally, I have now created /data/project/studyforrest_phase1/scientific-data-2014-raw, that contains those three as subdatasets, since we wanted to aim for publications being the targets for converted datasets. Currently the first conversion run based on this dataset is running in /data/project/studyforrest_phase1/scientific-data-2014-bids.

Adjusting the specs and checking what may be missing from the converted dataset, will require some kind of target definition to compare to. Is this supposed to be the release_openfmri1 tag in anondata or is there something else to base the adjustments on, @mih?

@bpoldrack
Copy link

bpoldrack commented Apr 27, 2021

Re raw data capturing:

  • anatomy looks good as far as I can tell, except for two directories:
    Under /data/project/studyforrest/anatomy/data two subjects have an orig folder in addition to raw/dicom. Content looks like a conversion result, but I'm not sure. Does this need to be captured, @mih ?

  • As for pandora:
    /data/project/studyforrest/pandora shows logs, pmc.tar.gz and swaroop that aren't currently captured. What are those, @mih and are those things anyhow associated with certain acquisitions?
    I have an old TODO note, claiming I need logs and logs/raw somehow. Not sure what to make of this distinction.

  • 7T_ad:

    • The data folder in /data/project/studyforrest/7T_ad has behav subdirectories. I guess, they need to be sucked in.
      Do they require some kind of conversion? Are they just copied into the converted dataset? If so, where?
      Old note on the issue, that I can't fully decode ATM:

      import behav data into first acq per subject
      from /data/project/studyforrest/7T_ad/ad_data/${sub}*
      => the same as behav/; Two files are copied to behav/ + two more per subject.

    • Additionally there's ad_data. What about that?

@mih
Copy link
Contributor Author

mih commented Apr 28, 2021

OK, I made a first push into this project. It contains the majority of the pieces that are needed to build studyforrest-data-raw or hirni or whatever the name will be -- in the artifact/ directory.

@mih
Copy link
Contributor Author

mih commented May 6, 2021

@bpoldrack can you please post the link to the generated raw datasets?

@bpoldrack
Copy link

@mih

/data/project/studyforrest_phase1/pandora
/data/project/studyforrest_phase1/anatomy
/data/project/studyforrest_phase1/7T_ad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants