Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current state of the OpenNeuro dataset #35

Open
m-wierzba opened this issue Apr 26, 2021 · 12 comments
Open

Current state of the OpenNeuro dataset #35

m-wierzba opened this issue Apr 26, 2021 · 12 comments

Comments

@m-wierzba
Copy link

m-wierzba commented Apr 26, 2021

This issue is meant to record all aspects related to the current state of the dataset available at OpenNeuro:
https://openneuro.org/datasets/ds000113/versions/1.3.0

The dataset has been obtained with:

datalad install https://github.com/OpenNeuroDatasets/ds000113.git

The local copy of the dataset is stored at: /data/project/studyforrest/openneuro.

@bpoldrack
Copy link

FTR: With respect to re-converting phase1 and keep it as close as possible to openneuro, it would be good to double-check task-labels for example. /data/project/studyforrest/anondata/task_key.txt lists numbered tasks and previous attempts to re-build phase1 use the labels aomovie and pandora. Probably good to settle on common terms?

@m-wierzba
Copy link
Author

Our local copy of the OpenNeuro dataset lives here:
/data/project/studyforrest/openneuro/ds000113

Based on the disk usage inspection, I assume that the download was successful:

cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ du -h -s
423G	.

To make sure that the "S3 bucket error" resulted in any content missing, I have run datalad get again, but this produced in no output (suggesting that no content is missing). The "S3 bucket error" is still present and has been reported to the OpenNeuro people (thx, @adswa!).

@m-wierzba
Copy link
Author

m-wierzba commented Apr 27, 2021

The goal now is to compare the current state of the OpenNeuro dataset against the two datasets that we are primarily interested in putting back to shape (i.e. the so called "phase1" and "phase2" datasets).

The location of the OpenNeuro data:
/data/project/studyforrest/openneuro/ds000113

The location of the "phase1" data:
/data/project/studyforrest/anondata

The location of the "phase2" data:
/data/project/studyforrest/phase2

We want to generate lists of what's common and lists of what's unique:

  • openneuro and anondata common content
  • openneuro and phase2 common content
  • content that is in annondata, but not in openneuro (and vice versa)
  • content that is in phase2, but not in openneuro (and vice versa)

My approach to the problem would be to compare the sha signatures:

  1. For each of the three datasets, generate a sorted sha list, e.g.:
cynamon@juseless in /data/project/studyforrest/openneuro/ds000113 on git:master
❱ find . -type f -print0 | xargs -0 sha1sum | sort > /home/cynamon/openneuro-sha-sorted
  1. Make sure that there are no sha duplicates (unlikely) in any of the three files.

  2. To obtain common content for any of the two compared datasets, I would use join e.g.:

join openneuro-sha-sorted anondata-sha-sorted

That would result in a list of the following structure:

sha1 <location1> <location2>

0007c0b19bfafa1c2a731b56b60821b8d3c857b7 ./.git/annex/objects/3K/GF/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt/MD5E-s14734--dbbd13a005405696670bda336f96fc99.txt ./.git/annex/objects/2z/0w/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt/SHA256E-s14734--0204eaef07f3b1658af1fdc7427e1d1863bd50d164dc07a678220bfaa2338ed2.txt
004f0b7de99745faf9a8feecce336df579ec47a7 ./.git/annex/objects/Xw/G1/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz/MD5E-s176248186--31fc064ca9db4b4205d6b0dcc5d98ad4.nii.gz ./.git/annex/objects/5z/xp/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz/SHA256E-s176248186--9dc89b01eaf861b077f3aa3bfa5718114ba6286aa853d39fc18059f3b0fc971e.nii.gz
0067621cd20d3579f5a472dfa5acff0a9648ab32 ./.git/annex/objects/V6/qF/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz/MD5E-s625777885--f8fbb1b93a4b808fe43a7c788e60e345.nii.gz ./.git/annex/objects/w9/f5/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz/SHA256E-s625777885--8c69b107ed9ff6a3bf710d050fa34c47ef6a6412a2e0bd3d8e8bcd6fcde1f863.nii.gz
  1. To obtain the unique content for any dataset compared against another dataset, I would use comm. Probably, I would need to cut the the file locations first, compare the sha signatures only and the glue the file locations back.

@bpoldrack
Copy link

Looks sane to me :)

And I'm very much interested in this:

That would result in a list of the following structure:
sha1 <location1> <location2>

for comparing to fresh conversion.

@bpoldrack
Copy link

I noticed something confusing to me:

❱ cat /data/project/studyforrest/openneuro/ds000113/recording-cardresp_physio.json
{
    "SamplingFrequency": 500.0, 
    "StartTime": 0.0, 
    "Columns": [
        "trigger", 
        "cardiac", 
        "respiratory"
    ], 
    "ContentDescription": "Activity recorded with a pulse oximeter (cardiac) and respiration belt."
}

This is the only sidecar file I could find, referring recording-cardresp_physio. cardresp is the label "we" used for phase1. However, all I have there are sampling frequencies that are either 100 or 200. Not 500. What's up with that @mih ?

@m-wierzba
Copy link
Author

m-wierzba commented Apr 27, 2021

The final approach (proposed by @mih) to compare the datasets in a more human-readable fashion is the following.

  1. We first generate a list of md5 sums in the following way:
    md5sum $(git annex find) | tee /tmp/openneuro.md5
    md5sum $(git annex find --branch release_openfmri1) | tee /tmp/openfmri.md5

  2. We then compare the resulting output files with a python script /data/project/studyforrest/openneuro/match.py, created by @mih:

import sys

d = {}

for f in sys.argv[1:]:
    for line in open(f):
        line = line.rstrip('\n')
        checksum, path = line.split(" ", 1)
        rec = d.get(checksum, [])
        rec.append('{}::{}'.format(f, path))
        d[checksum] = rec

for k, v in d.items():
    print('{}: {}'.format(k, v))

To run the script:
python match.py openneuro.md5 openfmri.md5

  1. The resulting output looks as follows:
cynamon@juseless in /data/project/studyforrest/openneuro
❱ head -3 latestopenneuro_vs_openfmrirelease1.txt
a4ca2772ab82a4d422afac2898f34af6: ['/tmp/openfmri.md5:: acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/04-sT1W_3D_TFE_TR2300_TI900_0_7iso_FS.txt']
fc4bee45d4ae95fe65e7add53413eda4: ['/tmp/openfmri.md5:: acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/05-sT2W_3D_TSE_32chSHC_0_7iso.txt']
d54a120f35975605801f8a1d0dddca2d: ['/tmp/openfmri.md5:: acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt', '/tmp/openneuro.md5:: sourcedata/acquisition_protocols/06-VEN_BOLD_HR_32chSHC.txt']
  1. It can be then searched for md5 entries that:
  • are present in both compared datasets:
❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep openneuro | wc -l
1391
  • are present only in one dataset (e.g. openfmri), but not in the other (e.g. openneuro):
❱ cat latestopenneuro_vs_openfmrirelease1.txt | grep openfmri | grep -v openneuro | wc -l   
7035

@adswa
Copy link
Contributor

adswa commented Apr 29, 2021

There are some files missing in the open neuro dataset: Subject 5 only has 2 (instead of 8) physio files for the auditoryperception/pandora task:

/data/project/studyforrest/openneuro/ds000113/sub-05/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-05_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-05_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-01_events.tsv     sub-05_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-01_physio.tsv.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-05_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-05_ses-auditoryperception_task-auditoryperception_run-02_events.tsv     sub-05_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   
sub-05_ses-auditoryperception_task-auditoryperception_run-02_physio.tsv.gz  sub-05_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz
sub-05_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz    sub-05_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

There are no files for subject 7:

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-07/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-07_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-07_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-07_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-07_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-07_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-07_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

There are no files for subject 18

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-18/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-18_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-18_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-18_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-18_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-18_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-18_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

and subject 19

adina@juseless in /data/project/studyforrest/openneuro/ds000113/sub-19/ses-auditoryperception/func on git:master
❱ ls                                                                                                                                                                                                                               1 !
sub-19_ses-auditoryperception_task-auditoryperception_run-01_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-04_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-07_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-01_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-04_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-07_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-02_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-05_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-08_bold.nii.gz
sub-19_ses-auditoryperception_task-auditoryperception_run-02_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-05_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-08_events.tsv
sub-19_ses-auditoryperception_task-auditoryperception_run-03_bold.nii.gz  sub-19_ses-auditoryperception_task-auditoryperception_run-06_bold.nii.gz 
sub-19_ses-auditoryperception_task-auditoryperception_run-03_events.tsv   sub-19_ses-auditoryperception_task-auditoryperception_run-06_events.tsv

@adswa
Copy link
Contributor

adswa commented Apr 29, 2021

The pandora/auditoryperception session on open neuro has wrong stimulus file names and does not ship the audio stimulus files.

@adswa
Copy link
Contributor

adswa commented Apr 29, 2021

The events.tsv files from pandora openneuro also messed up the run and run_id association:

from open neuro:

cat sub-01_ses-auditoryperception_task-auditoryperception_run-01_events.tsv 
onset	duration	trial_type	run	run_id	volume	run_volume	stim	genre	delay	catch	sound_soa	trigger_ts
0.01	6.0	rocknroll	1	6	0	0	rocknroll_002.wav	rocknroll	6	0	0.007200000000011642	1233.5005
12.0	6.0	symphonic	1	6	6	6	symphonic_003.wav	symphonic	6	0	0.002899999999954161	1245.4996
24.0	6.0	rocknroll	1	6	12	12	rocknroll_001.wav	rocknroll	6	0	0.002499999999827196	1257.4997
36.0	6.0	metal	1	6	18	18	metal_004.wav	metal	6	0	0.002600000000029468	1269.5
48.01	6.0	symphonic	1	6	24	24	symphonic_002.wav	symphonic	8	1	0.013300000000072032	1281.5003
62.0	6.0	country	1	6	31	31	country_003.wav	country	6	0	0.0027000000000043656	1295.4996
74.0	6.0	country	1	6	37	37	country_002.wav	country	6	0	0.0025000000000545697	1307.4993
86.0	6.0	ambient	1	6	43	43	ambient_001.wav	ambient	6	0	0.002900000000181535	1319.4994
98.01	6.0	ambient	1	6	49	49	ambient_004.wav	ambient	8	1	0.007800000000088403	1331.499
112.0	6.0	country	1	6	56	56	country_000.wav	country	4	0	0.004100000000107684	1345.4985
122.0	6.0	symphonic	1	6	61	61	symphonic_001.wav	symphonic	6	0	0.0025000000000545697	1355.499
134.0	6.0	symphonic	1	6	67	67	symphonic_004.wav	symphonic	4	0	0.003600000000005821	1367.4987
144.0	6.0	ambient	1	6	72	72	ambient_003.wav	ambient	6	0	0.0035000000000309233	1377.4986
156.0	6.0	metal	1	6	78	78	metal_003.wav	metal	4	0	0.0025000000000545697	1389.4985
166.0	6.0	metal	1	6	83	83	metal_000.wav	metal	6	0	0.0025000000000545697	1399.4991
178.0	6.0	ambient	1	6	89	89	ambient_000.wav	ambient	6	0	0.002600000000029468	1411.4989
190.02	6.0	rocknroll	1	6	95	95	rocknroll_003.wav	rocknroll	8	1	0.01510000000007494	1423.4995
204.0	6.0	country	1	6	102	102	country_004.wav	country	6	0	0.0027000000000043656	1437.4986
216.0	6.0	rocknroll	1	6	108	108	rocknroll_004.wav	rocknroll	4	0	0.002600000000029468	1449.4994
226.0	6.0	ambient	1	6	113	113	ambient_002.wav	ambient	4	0	0.002600000000029468	1459.4995
236.0	6.0	symphonic	1	6	118	118	symphonic_000.wav	symphonic	6	0	0.0025000000000545697	1469.4986
248.01	6.0	metal	1	6	124	124	metal_002.wav	metal	8	1	0.00920000000019172	1481.4986
262.01	6.0	country	1	6	131	131	country_001.wav	country	8	1	0.011099999999942156	1495.4987
276.0	6.0	metal	1	6	138	138	metal_001.wav	metal	6	0	0.003600000000005821	1509.4989
288.0	6.0	rocknroll	1	6	144	144	rocknroll_000.wav	rocknroll	6	0	0.0025000000000545697	1521.4994

from us:

cat /data/project/studyforrest/anondata/sub001/behav/task002_run001/behavdata.txt                                                                                1 !
"run","run_id","volume","run_volume","stim","genre","delay","catch","sound_soa","trigger_ts"
1,6,0,0,"rocknroll_002.wav","rocknroll",6,0,0.0072000000000116415,1233.5005
1,6,6,6,"symphonic_003.wav","symphonic",6,0,0.0028999999999541615,1245.4996
1,6,12,12,"rocknroll_001.wav","rocknroll",6,0,0.002499999999827196,1257.4997
1,6,18,18,"metal_004.wav","metal",6,0,0.0026000000000294676,1269.5
1,6,24,24,"symphonic_002.wav","symphonic",8,1,0.013300000000072032,1281.5003
1,6,31,31,"country_003.wav","country",6,0,0.0027000000000043656,1295.4996
1,6,37,37,"country_002.wav","country",6,0,0.0025000000000545697,1307.4993
1,6,43,43,"ambient_001.wav","ambient",6,0,0.002900000000181535,1319.4994
1,6,49,49,"ambient_004.wav","ambient",8,1,0.007800000000088403,1331.499
1,6,56,56,"country_000.wav","country",4,0,0.004100000000107684,1345.4985
1,6,61,61,"symphonic_001.wav","symphonic",6,0,0.0025000000000545697,1355.499
1,6,67,67,"symphonic_004.wav","symphonic",4,0,0.0036000000000058208,1367.4987
1,6,72,72,"ambient_003.wav","ambient",6,0,0.003500000000030923,1377.4986
1,6,78,78,"metal_003.wav","metal",4,0,0.0025000000000545697,1389.4985
1,6,83,83,"metal_000.wav","metal",6,0,0.0025000000000545697,1399.4991
1,6,89,89,"ambient_000.wav","ambient",6,0,0.0026000000000294676,1411.4989
1,6,95,95,"rocknroll_003.wav","rocknroll",8,1,0.015100000000074942,1423.4995
1,6,102,102,"country_004.wav","country",6,0,0.0027000000000043656,1437.4986
1,6,108,108,"rocknroll_004.wav","rocknroll",4,0,0.0026000000000294676,1449.4994
1,6,113,113,"ambient_002.wav","ambient",4,0,0.0026000000000294676,1459.4995
1,6,118,118,"symphonic_000.wav","symphonic",6,0,0.0025000000000545697,1469.4986
1,6,124,124,"metal_002.wav","metal",8,1,0.009200000000191721,1481.4986
1,6,131,131,"country_001.wav","country",8,1,0.011099999999942156,1495.4987
1,6,138,138,"metal_001.wav","metal",6,0,0.0036000000000058208,1509.4989
1,6,144,144,"rocknroll_000.wav","rocknroll",6,0,0.0025000000000545697,1521.4994

@m-wierzba
Copy link
Author

@bpoldrack, I'm not sure if this is going to be useful in any way, but you can find some very simple scripts that use nibabel to compare two NIfTI files here: /data/project/studyforrest/openneuro/unittests-comparison

python3 nibabel-header-compare.py <nii1> <nii2>

  • the nibabel-img-compare.py compares the image data:

python3 nibabel-img-compare.py <nii1> <nii2>

Obviously, you need to know what to compare against what. If you have any thoughts on what could be improved for this to be useful, just let me know (pinging @mih).

@m-wierzba
Copy link
Author

Just FTR: with regard to the presumed fslreorient2std issue, I have run and checked that for the subject we've been looking at previously.

anondata path:
/data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz

openneuro path:
/data/project/studyforrest/openneuro/ds000113/sub-01/ses-forrestgump/func/sub-01_ses-forrestgump_task-forrestgump_acq-raw_run-04_bold.nii.gz

I have run fslreorient2std again on the anondata file:

cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ fslreorient2std /data/project/studyforrest/anondata/sub001/BOLD/task001_run004/bold.nii.gz ./reoriented.nii.gz

Now, the FSL header information has been obtained with fslhd for all three files (anondata, openneuro, and the newly created reoriented.nii.gz):

cynamon@juseless in /data/project/studyforrest/openneuro/fslhd-comparison
❱ ls *hd.txt
newhd.txt  oldhd.txt  reorientedhd.txt

The conclusion is that fslreorient2std doesn't seem to have caused the problem:

❱ diff oldhd.txt reorientedhd.txt                      
1c1
< filename       ../anondata/sub001/BOLD/task001_run004/bold.nii.gz
---
> filename       reoriented.nii.gz

@mih
Copy link
Contributor

mih commented Apr 30, 2021

Cool, that is good to know. Hence there is no point in trying to implement something like this in the conversion. Also the response to my original issues was along the lines of "unclear why". I'd say we stick with the output of the more modern converter the @bpoldrack is using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants