David/fig5 #42

davidcechak · 2024-11-13T20:03:07Z

Scripts to compute evaluation of predictions of miRNA_CNN_Hejret2023 models retrained on Manakov2022 train set subsets and tested on the Manakov2022 1:100 test set and use it to produce Fig5 plots

stephaniesamm

Most comments do not affect function but it would be nice if they are fixed.

These affect functions:

The argument names not being consistent when defined and throughout script, for all scripts.
In the precision_per_sensitivity.py script where I don't think the precision is being computed properly.

stephaniesamm · 2024-11-14T10:45:24Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+    parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={
+        'CNN_Manakov_full': 'Manakov_2,524,246',
+        "CNN_Manakov_subset_200": 'Manakov_subset_200',
+        "CNN_Manakov_subset_2k": 'Manakov_subset_2k',
+        "CNN_Manakov_subset_7720": 'Manakov_subset_7720',
+        "CNN_Manakov_subset_20k": 'Manakov_subset_20k',
+        "CNN_Manakov_subset_200k": 'Manakov_subset_200k',
+    }, help='Mapping for renaming Manakov dataset columns')
+    parser.add_argument('--method-names', type=lambda x: eval(x), default=[
+        "random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720", 
+        "Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246"
+    ], help='List of method names for plotting')


I think you could used the values in the manakov-rename-map dictionary instead of passing them as a separate argument again in --method-names? Mayb you could order the dictionary entries in the same order you want your labels?

This applies to all 3 scripts.

stephaniesamm · 2024-11-14T10:50:37Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+
+
+def main(args):
+    dataset_path = args.dataset_dir + args.dataset_name


You can avoid this and just pass the path to file as argument and have one argument instead of two, no?

This applies to all 3 scripts.

stephaniesamm · 2024-11-14T10:55:21Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+def generate_random_predictions(length):
+    return [random.uniform(0, 1) for _ in range(length)]


We do not need random predictions here so I guess this function and wherever else it is called is extra, right? Including mentions in the md file for this script.

stephaniesamm · 2024-11-14T10:55:49Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+        'Manakov_subset_200': np.log10(200),
+        'Manakov_subset_2k': np.log10(2000),
+        'Manakov_subset_7720': np.log10(7720),
+        'Manakov_subset_20k': np.log10(20000),
+        'Manakov_subset_200k': np.log10(200000),
+        'Manakov_2,524,246': np.log10(2524246)


Ideally subset sizes were not hard-coded in case we wanted to add more sizes.

stephaniesamm · 2024-11-14T10:56:36Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+    parser.add_argument('--dataset-dir', type=str, default='predictions/', 
+                       help='Directory containing the dataset')
+    parser.add_argument('--dataset-name', type=str, 
+                       default='AGO2_eCLIP_Manakov2022_100_CNN_predictions', 
+                       help='Name of the dataset file')
+    parser.add_argument('--manakov-rename-map', type=lambda x: eval(x), default={
+        'CNN_Manakov_full': 'Manakov_2,524,246',
+        "CNN_Manakov_subset_200": 'Manakov_subset_200',
+        "CNN_Manakov_subset_2k": 'Manakov_subset_2k',
+        "CNN_Manakov_subset_7720": 'Manakov_subset_7720',
+        "CNN_Manakov_subset_20k": 'Manakov_subset_20k',
+        "CNN_Manakov_subset_200k": 'Manakov_subset_200k',
+    }, help='Mapping for renaming Manakov dataset columns')
+    parser.add_argument('--method-names', type=lambda x: eval(x), default=[
+        "random", "Manakov_subset_200", "Manakov_subset_2k", "Manakov_subset_7720", 
+        "Manakov_subset_20k", "Manakov_subset_200k", "Manakov_2,524,246"
+    ], help='List of method names for plotting')


Decide if you want your arguments to be hyphenated or underscored because you define them with hyphens (e.g. --dataset-dir in line 79) but throughout your script you use underscores (e.g. --dataset_dir in line 47).

This applies to all 3 scripts.

stephaniesamm · 2024-11-14T10:58:50Z

code/plots/Fig5/download_predictions.sh

+fi
+
+download_dir="predictions"
+plot_output_dir="output"


I think this plot_output_dir variable is defined, used to make the dir but then never used?

stephaniesamm · 2024-11-14T11:01:28Z

code/plots/Fig5/download_predictions.sh

Ideally dirs, fileID and name given to file are passed as arguments not hardcoded.

stephaniesamm · 2024-11-14T11:57:23Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+    plt.savefig(f"output/{args.dataset_name}{title_suffix}.svg", format='svg')
+    plt.savefig(f"output/{args.dataset_name}{title_suffix}.png", format='png')


Output dir is hardcoded here, you could use your output dir argument here instead.

Same applies to all 3 scripts.

stephaniesamm · 2024-11-14T13:25:55Z

code/plots/Fig5/precision_per_sensitivity.py

+        precision, recall, thresholds = precision_recall_curve(true_labels, predictions)
+
+        for threshold in sensitivity_thresholds:
+            idx_closest = np.argmin(np.abs(recall - threshold))


I don't think this is how we should be computing this.

thresholds in line 30 is referring to decision thresholds not recall/sensitivity thresholds, i.e. the prediction value that is the cutoff to rounding scores to binary (0 or 1) so that precision and recall can be computed. @evaklimentova @katarinagresova please correct me if I am wrong.

We should be looking for precision value that corresponds to recall 0.5 (and 0.33 if we are doing that too), for each model.

I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds. But in line 32, iteration is on sensitivity_thresholds which is a user defined variable.

In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?

stephaniesamm · 2024-11-14T13:41:15Z

code/plots/Fig5/avrg_fp_per_sensitivity.py

@katarinagresova wanted to see how this was being computed.

katarinagresova

Can we include a script that was used to create plots in the current version of manuscript?

katarinagresova · 2024-11-15T21:58:40Z

code/plots/Fig5/auc_pr_vs_train_data_size.py

+    """Calculates AUC-PR for specified method columns in the predictions DataFrame."""
+    auc_values = {}
+    for method in method_names:
+        if method in predictions.columns and pd.api.types.is_numeric_dtype(predictions[method]):


I wouldn't exclude columns that do not pass pd.api.types.is_numeric_dtype(predictions[method].
I would try to convert column to numeric first. It that fails, raise some exception or at least inform user that this column cannot be used.

katarinagresova · 2024-11-15T22:05:51Z

code/plots/Fig5/avrg_fp_per_sensitivity.py

+    parser.add_argument('--dataset-name', type=str, 
+                       default='AGO2_eCLIP_Manakov2022_100_CNN_predictions', 
+                       help='Name of the dataset file')
+    parser.add_argument('--thresholds', type=float, nargs='+', 


I think we should enforce the same sensitivity for each method. Also, the name of the argument is confusing.

katarinagresova · 2024-11-15T22:08:44Z

code/plots/Fig5/download_predictions.sh

@@ -0,0 +1,20 @@
+#!/bin/bash


Cou incorporate this script into https://github.com/BioGeMT/miRBench_paper/blob/main/code/plots/process_datasets_for_plots.py for consistency?
Or coordinate dirs with #38

katarinagresova · 2024-11-15T22:18:40Z

code/plots/Fig5/precision_per_sensitivity.py

+        precision, recall, thresholds = precision_recall_curve(true_labels, predictions)
+
+        for threshold in sensitivity_thresholds:
+            idx_closest = np.argmin(np.abs(recall - threshold))


I think this might be problem with variable names. on line 30, decision thresholds are stored in variable thresholds. But in line 32, iteration is on sensitivity_thresholds which is a user defined variable.

In previous script, to get index of desired sensitivity, code is as follows:
indices = np.where(tpr >= desired_sensitivity)[0]
why is is different now?

davidcechak added 6 commits November 12, 2024 15:09

Add scripts to make Fig5

5ad1ef6

Download only the testing dataset used

af09694

Clean precision_per_sensitivity

785e530

Clean pr_curve.py

8eca000

Clean download_predictions script

1af3ad6

Clean imports in Fig5 scripts

87a028f

davidcechak requested review from evaklimentova and stephaniesamm November 13, 2024 20:03

stephaniesamm reviewed Nov 14, 2024

View reviewed changes

katarinagresova reviewed Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

David/fig5 #42

David/fig5 #42

davidcechak commented Nov 13, 2024

stephaniesamm left a comment

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

stephaniesamm Nov 14, 2024

katarinagresova Nov 15, 2024

stephaniesamm Nov 14, 2024

katarinagresova left a comment

katarinagresova Nov 15, 2024

katarinagresova Nov 15, 2024

katarinagresova Nov 15, 2024

katarinagresova Nov 15, 2024



		def main(args):
		dataset_path = args.dataset_dir + args.dataset_name

		def generate_random_predictions(length):
		return [random.uniform(0, 1) for _ in range(length)]

		plt.savefig(f"output/{args.dataset_name}{title_suffix}.svg", format='svg')
		plt.savefig(f"output/{args.dataset_name}{title_suffix}.png", format='png')

David/fig5 #42

Are you sure you want to change the base?

David/fig5 #42

Conversation

davidcechak commented Nov 13, 2024

stephaniesamm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katarinagresova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment