-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
David/fig5 #42
base: main
Are you sure you want to change the base?
David/fig5 #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most comments do not affect function but it would be nice if they are fixed.
These affect functions:
- The argument names not being consistent when defined and throughout script, for all scripts.
- In the precision_per_sensitivity.py script where I don't think the precision is being computed properly.
parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={ | ||
'CNN_Manakov_full': 'Manakov_2,524,246', | ||
"CNN_Manakov_subset_200": 'Manakov_subset_200', | ||
"CNN_Manakov_subset_2k": 'Manakov_subset_2k', | ||
"CNN_Manakov_subset_7720": 'Manakov_subset_7720', | ||
"CNN_Manakov_subset_20k": 'Manakov_subset_20k', | ||
"CNN_Manakov_subset_200k": 'Manakov_subset_200k', | ||
}, help='Mapping for renaming Manakov dataset columns') | ||
parser.add_argument('--method-names', type=lambda x: eval(x), default=[ | ||
"random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720", | ||
"Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246" | ||
], help='List of method names for plotting') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could used the values in the manakov-rename-map
dictionary instead of passing them as a separate argument again in --method-names
? Mayb you could order the dictionary entries in the same order you want your labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all 3 scripts.
|
||
|
||
def main(args): | ||
dataset_path = args.dataset_dir + args.dataset_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid this and just pass the path to file as argument and have one argument instead of two, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all 3 scripts.
def generate_random_predictions(length): | ||
return [random.uniform(0, 1) for _ in range(length)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need random predictions here so I guess this function and wherever else it is called is extra, right? Including mentions in the md file for this script.
'Manakov_subset_200': np.log10(200), | ||
'Manakov_subset_2k': np.log10(2000), | ||
'Manakov_subset_7720': np.log10(7720), | ||
'Manakov_subset_20k': np.log10(20000), | ||
'Manakov_subset_200k': np.log10(200000), | ||
'Manakov_2,524,246': np.log10(2524246) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally subset sizes were not hard-coded in case we wanted to add more sizes.
parser.add_argument('--dataset-dir', type=str, default='predictions/', | ||
help='Directory containing the dataset') | ||
parser.add_argument('--dataset-name', type=str, | ||
default='AGO2_eCLIP_Manakov2022_100_CNN_predictions', | ||
help='Name of the dataset file') | ||
parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={ | ||
'CNN_Manakov_full': 'Manakov_2,524,246', | ||
"CNN_Manakov_subset_200": 'Manakov_subset_200', | ||
"CNN_Manakov_subset_2k": 'Manakov_subset_2k', | ||
"CNN_Manakov_subset_7720": 'Manakov_subset_7720', | ||
"CNN_Manakov_subset_20k": 'Manakov_subset_20k', | ||
"CNN_Manakov_subset_200k": 'Manakov_subset_200k', | ||
}, help='Mapping for renaming Manakov dataset columns') | ||
parser.add_argument('--method-names', type=lambda x: eval(x), default=[ | ||
"random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720", | ||
"Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246" | ||
], help='List of method names for plotting') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decide if you want your arguments to be hyphenated or underscored because you define them with hyphens (e.g. --dataset-dir
in line 79) but throughout your script you use underscores (e.g. --dataset_dir
in line 47).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all 3 scripts.
fi | ||
|
||
download_dir="predictions" | ||
plot_output_dir="output" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this plot_output_dir
variable is defined, used to make the dir but then never used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally dirs, fileID and name given to file are passed as arguments not hardcoded.
plt.savefig(f"output/{args.dataset_name}{title_suffix}.svg", format='svg') | ||
plt.savefig(f"output/{args.dataset_name}{title_suffix}.png", format='png') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output dir is hardcoded here, you could use your output dir argument here instead.
Same applies to all 3 scripts.
precision, recall, thresholds = precision_recall_curve(true_labels, predictions) | ||
|
||
for threshold in sensitivity_thresholds: | ||
idx_closest = np.argmin(np.abs(recall - threshold)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is how we should be computing this.
thresholds
in line 30 is referring to decision thresholds not recall/sensitivity thresholds, i.e. the prediction value that is the cutoff to rounding scores to binary (0 or 1) so that precision and recall can be computed. @evaklimentova @katarinagresova please correct me if I am wrong.
We should be looking for precision value that corresponds to recall 0.5 (and 0.33 if we are doing that too), for each model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds
. But in line 32, iteration is on sensitivity_thresholds
which is a user defined variable.
In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@katarinagresova wanted to see how this was being computed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include a script that was used to create plots in the current version of manuscript?
"""Calculates AUC-PR for specified method columns in the predictions DataFrame.""" | ||
auc_values = {} | ||
for method in method_names: | ||
if method in predictions.columns and pd.api.types.is_numeric_dtype(predictions[method]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't exclude columns that do not pass pd.api.types.is_numeric_dtype(predictions[method]
.
I would try to convert column to numeric first. It that fails, raise some exception or at least inform user that this column cannot be used.
parser.add_argument('--dataset-name', type=str, | ||
default='AGO2_eCLIP_Manakov2022_100_CNN_predictions', | ||
help='Name of the dataset file') | ||
parser.add_argument('--thresholds', type=float, nargs='+', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enforce the same sensitivity for each method. Also, the name of the argument is confusing.
@@ -0,0 +1,20 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cou incorporate this script into https://github.com/BioGeMT/miRBench_paper/blob/main/code/plots/process_datasets_for_plots.py for consistency?
Or coordinate dirs with #38
precision, recall, thresholds = precision_recall_curve(true_labels, predictions) | ||
|
||
for threshold in sensitivity_thresholds: | ||
idx_closest = np.argmin(np.abs(recall - threshold)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds
. But in line 32, iteration is on sensitivity_thresholds
which is a user defined variable.
In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?
Scripts to compute evaluation of predictions of miRNA_CNN_Hejret2023 models retrained on Manakov2022 train set subsets and tested on the Manakov2022 1:100 test set and use it to produce Fig5 plots