-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADNI] Handle reading new format of clinical csv #1016
[ADNI] Handle reading new format of clinical csv #1016
Conversation
Hello @MatthieuJoulot! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2023-11-17 13:42:50 UTC |
adni_merge_path = path.join(clinical_data_dir, "ADNIMERGE.csv") | ||
participants = set( | ||
read_csv(adni_merge_path, sep=",", usecols=["PTID"], squeeze=True).unique() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @MatthieuJoulot !
I think this is looking good. I made some suggestions to have clearer error messages.
I think it would be great to add one or two subject/session with the new format to the CI data such that we can test that we are able to handle both formats. WDYT ?
import re | ||
from pathlib import Path | ||
|
||
pattern = filename + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting to have exactly one file matching this pattern ?
If yes, I think we should verify this and raise if more than one file was found. Currently the function returns the last one found. Maybe something like this:
files_matching_pattern = [
f for f in Path(clinical_dir).rglob("*.csv") if re.search(pattern, (z.name))
]
if len(files_matching_pattern) != 1:
raise IOError(
f"Expecting to find exactly one file in folder {clinical_dir} "
f"matching pattern {pattern}. {len(files_matching_pattern)} "
f"files were found instead : {'\n- '.join(files_matching_pattern)}"
)
try:
return ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you download the data twice, yes, we expect only one file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done !
try: | ||
return pd.read_csv(adni_merge_path, sep=",", low_memory=False) | ||
except: | ||
raise ValueError(f"{filename}.csv was not found. Please check your data.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be a bit more specific here. If you get to this line, the file exists but the reading as a CSV file failed:
raise ValueError(
f"File {adni_merge_path} was found but could not "
"be loaded as a DataFrame. Please check your data."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not what happens. The file is actually not found (because of the name change), and it errors because adni_merge_path
is therefore not defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant, if you implement the suggestion above, the file has to exist otherwise an error would have been raised before. So you would get to this line only if the CSV file wasn't formatted as expected. Right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I have been reading from bottom to top, I'll come back afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done as well, works better reading from the top
|
||
file_to_read_path = path.join(clinical_data_dir, location) | ||
cprint(f"\tReading clinical data file: {location}") | ||
pattern = location.split(".")[0] + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use your load_clinical_csv()
function here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must have missed that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha yes, actually there was. The way this was built before I changed things was different and I did not see how to use load_clinical_csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but I don't understand why you couldn't do:
for location in files:
location = location.split("/")[0]
df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
df_filtered = filter_subj_bids(df_file, location, bids_ids).copy()
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it looks like this should work. I did not use it, because this part of the code needs to be able not to find a file and this suggestion would make the code crash. Maybe we can work something out though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I nested this part in a try. I'm not fond of it, but it works. WDYT
|
||
file_to_read_path = path.join(clinical_data_dir, location) | ||
cprint(f"\tReading clinical data file: {location}") | ||
pattern = location.split(".")[0] + "(_\d{1,2}[A-Za-z]{3}\d{4})?.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing something, but I don't understand why you couldn't do:
for location in files:
location = location.split("/")[0]
df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
df_filtered = filter_subj_bids(df_file, location, bids_ids).copy()
...
cprint(f"\tReading clinical data file: {location}") | ||
|
||
df_file = pd.read_csv(file_to_read_path, dtype=str) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put as few lines as possible in the try-catch and catch explicit errors. In this case, we need to catch the IOError which happens when the number of found files isn't exactly one. It could be a good idea to give a warning and continue the loop. WDYT ?
try:
df_file = load_clinical_csv(clinical_data_dir, location.split(".")[0])
except IOError as e:
warnings.warn(e)
continue
df_filtered = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @MatthieuJoulot !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MatthieuJoulot !
Just a small suggestion on removing parametrization for tests using a single value.
LGTM otherwise !
assert_frame_equal(load_clinical_csv(tmp_path, csv_to_look_for), input_df) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the parametrization if you have only one value.
remove csv_to_look_for
from the arguments of the test function and simply use "adnimerge" in the test function's body.
load_clinical_csv(tmp_path, csv_to_look_for) | ||
|
||
|
||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here
with open(tmp_path / "adnimerge.csv", "w") as fp: | ||
fp.write("col1,col2,col3\n1,2,3\n1,2,3,4") | ||
|
||
# input_df.to_csv(tmp_path / csv_name, sep="\t", index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# input_df.to_csv(tmp_path / csv_name, sep="\t", index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the linter is unhappy... Could you format ?
input_df.to_csv(tmp_path / csv_name, index=False) | ||
assert_frame_equal(load_clinical_csv(tmp_path, csv_to_look_for), input_df) | ||
|
||
def test_load_clinical_csv_error(tmp_path, ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_load_clinical_csv_error(tmp_path, ): | |
def test_load_clinical_csv_error(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange I thought I did. I will again then
Closes #1002
ADNI has changed the format of the clinical csv, thus breaking the converter. This PR aims at fixing this problem.