Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

David/fig5 #42

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

David/fig5 #42

wants to merge 6 commits into from

Conversation

davidcechak
Copy link

Scripts to compute evaluation of predictions of miRNA_CNN_Hejret2023 models retrained on Manakov2022 train set subsets and tested on the Manakov2022 1:100 test set and use it to produce Fig5 plots

Copy link
Contributor

@stephaniesamm stephaniesamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments do not affect function but it would be nice if they are fixed.

These affect functions:

  1. The argument names not being consistent when defined and throughout script, for all scripts.
  2. In the precision_per_sensitivity.py script where I don't think the precision is being computed properly.

Comment on lines +84 to +95
parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={
'CNN_Manakov_full': 'Manakov_2,524,246',
"CNN_Manakov_subset_200": 'Manakov_subset_200',
"CNN_Manakov_subset_2k": 'Manakov_subset_2k',
"CNN_Manakov_subset_7720": 'Manakov_subset_7720',
"CNN_Manakov_subset_20k": 'Manakov_subset_20k',
"CNN_Manakov_subset_200k": 'Manakov_subset_200k',
}, help='Mapping for renaming Manakov dataset columns')
parser.add_argument('--method-names', type=lambda x: eval(x), default=[
"random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720",
"Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246"
], help='List of method names for plotting')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could used the values in the manakov-rename-map dictionary instead of passing them as a separate argument again in --method-names? Mayb you could order the dictionary entries in the same order you want your labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to all 3 scripts.



def main(args):
dataset_path = args.dataset_dir + args.dataset_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this and just pass the path to file as argument and have one argument instead of two, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to all 3 scripts.

Comment on lines +10 to +11
def generate_random_predictions(length):
return [random.uniform(0, 1) for _ in range(length)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need random predictions here so I guess this function and wherever else it is called is extra, right? Including mentions in the md file for this script.

Comment on lines +62 to +67
'Manakov_subset_200': np.log10(200),
'Manakov_subset_2k': np.log10(2000),
'Manakov_subset_7720': np.log10(7720),
'Manakov_subset_20k': np.log10(20000),
'Manakov_subset_200k': np.log10(200000),
'Manakov_2,524,246': np.log10(2524246)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally subset sizes were not hard-coded in case we wanted to add more sizes.

Comment on lines +79 to +95
parser.add_argument('--dataset-dir', type=str, default='predictions/',
help='Directory containing the dataset')
parser.add_argument('--dataset-name', type=str,
default='AGO2_eCLIP_Manakov2022_100_CNN_predictions',
help='Name of the dataset file')
parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={
'CNN_Manakov_full': 'Manakov_2,524,246',
"CNN_Manakov_subset_200": 'Manakov_subset_200',
"CNN_Manakov_subset_2k": 'Manakov_subset_2k',
"CNN_Manakov_subset_7720": 'Manakov_subset_7720',
"CNN_Manakov_subset_20k": 'Manakov_subset_20k',
"CNN_Manakov_subset_200k": 'Manakov_subset_200k',
}, help='Mapping for renaming Manakov dataset columns')
parser.add_argument('--method-names', type=lambda x: eval(x), default=[
"random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720",
"Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246"
], help='List of method names for plotting')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decide if you want your arguments to be hyphenated or underscored because you define them with hyphens (e.g. --dataset-dir in line 79) but throughout your script you use underscores (e.g. --dataset_dir in line 47).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to all 3 scripts.

fi

download_dir="predictions"
plot_output_dir="output"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this plot_output_dir variable is defined, used to make the dir but then never used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally dirs, fileID and name given to file are passed as arguments not hardcoded.

Comment on lines +73 to +74
plt.savefig(f"output/{args.dataset_name}{title_suffix}.svg", format='svg')
plt.savefig(f"output/{args.dataset_name}{title_suffix}.png", format='png')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output dir is hardcoded here, you could use your output dir argument here instead.

Same applies to all 3 scripts.

precision, recall, thresholds = precision_recall_curve(true_labels, predictions)

for threshold in sensitivity_thresholds:
idx_closest = np.argmin(np.abs(recall - threshold))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is how we should be computing this.

thresholds in line 30 is referring to decision thresholds not recall/sensitivity thresholds, i.e. the prediction value that is the cutoff to rounding scores to binary (0 or 1) so that precision and recall can be computed. @evaklimentova @katarinagresova please correct me if I am wrong.

We should be looking for precision value that corresponds to recall 0.5 (and 0.33 if we are doing that too), for each model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds. But in line 32, iteration is on sensitivity_thresholds which is a user defined variable.

In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@katarinagresova wanted to see how this was being computed.

Copy link
Collaborator

@katarinagresova katarinagresova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a script that was used to create plots in the current version of manuscript?

"""Calculates AUC-PR for specified method columns in the predictions DataFrame."""
auc_values = {}
for method in method_names:
if method in predictions.columns and pd.api.types.is_numeric_dtype(predictions[method]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't exclude columns that do not pass pd.api.types.is_numeric_dtype(predictions[method].
I would try to convert column to numeric first. It that fails, raise some exception or at least inform user that this column cannot be used.

parser.add_argument('--dataset-name', type=str,
default='AGO2_eCLIP_Manakov2022_100_CNN_predictions',
help='Name of the dataset file')
parser.add_argument('--thresholds', type=float, nargs='+',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should enforce the same sensitivity for each method. Also, the name of the argument is confusing.

@@ -0,0 +1,20 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cou incorporate this script into https://github.com/BioGeMT/miRBench_paper/blob/main/code/plots/process_datasets_for_plots.py for consistency?
Or coordinate dirs with #38

precision, recall, thresholds = precision_recall_curve(true_labels, predictions)

for threshold in sensitivity_thresholds:
idx_closest = np.argmin(np.abs(recall - threshold))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds. But in line 32, iteration is on sensitivity_thresholds which is a user defined variable.

In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants