grouping test cases and categorisation #112

pr4deepr · 2024-09-04T23:35:48Z

Hi @haesleinhuepf
I was going through the preprint and one thought I had was grouping the test cases under categories. We get an overall view of how well LLMs perform but lose the granularity on whether LLMs perform well/worse on certain tasks and not on others..

For example, with our test cases perhaps grouping can be like:

Quantification/Measurements: test cases that measure or count can be under quantificaiton/measurements
statistical analysis: t test, pairwise correlation
Morphological operations: binary close, skeleton, expand labels

etc..

It may give an idea of where we need more or less test cases as well.
I remember you had a preprint on ontologies and standards for bioimage analysis. Perhaps that can be used as a reference.

Cheers
Pradeep

haesleinhuepf · 2024-09-05T03:16:51Z

Hey @pr4deepr ,

great idea! This categorization is obviously a subjective task. We could automate this and make it more objective using an LLM; a text-classification LLM. Do you by chance have experience with this?

Cheers,
Robert

pr4deepr · 2024-09-05T03:43:18Z

No, I do not.
Would you put it through chatgpt for example to get a first pass?

If I copied the readme in the main repo containing description of the current test cases and used the question in Chat-GPT (GPT-4o):

I have some python functions and each of them perform a specific operation in bioimage analysis. Classify them into categories based on their function and where they will fit in the image analysis pipeline.

1. Image Preprocessing
These functions focus on preparing the image by modifying its format, reducing noise, or performing other transformations to enhance analysis.

rgb_to_grey_image_transform
rotate_image_by_90_degrees
crop_quarter_image
transpose_image_axes
remove_noise_edge_preserving
subtract_background_tophat
mask_image
remove_labels_on_edges
remove_small_labels
expand_labels_without_overlap

2. Image Enhancement
Functions that enhance specific features of the image, such as sharpening edges or enhancing contrast.

detect_edges
convolve_images
deconvolve_image
remove_noise_edge_preserving

3. Segmentation
These functions segment images into meaningful regions or objects, usually through intensity-based methods.

apply_otsu_threshold_and_count_postiive_pixels
region_growing_segmentation
workflow_segmentation_counting
workflow_watershed_segmentation_correction_measurement
workflow_segmentation_measurement_summary
workflow_segment_measure_umap

4. Morphological Operations
These functions apply morphological transformations to binary images for cleaning or refining shapes.

binary_closing
binary_skeleton
label_binary_image_and_count_labels
label_sequentially
remove_labels_on_edges
remove_small_labels

5. Quantification and Measurement
Functions that quantify different properties of labeled regions or objects in the image.

convex_hull_measure_area
count_number_of_touching_neighbors
measure_aspect_ratio_of_regions
measure_intensity_of_labels
measure_intensity_over_time
measure_mean_image_intensity
measure_pixel_count_of_labels
measure_properties_of_regions
extract_surface_measure_area
measure_properties_of_regions
count_objects_over_time
count_overlapping_regions

6. Feature Extraction
Functions that extract geometrical or intensity features from the image.

fit_circle
radial_intensity_profile
convex_hull_measure_area
measure_aspect_ratio_of_regions

7. File I/O
Functions that handle loading, saving, or listing files.

list_image_files_in_folder
open_image_read_voxel_size
open_image_return_dimensions
open_nifti_image
open_zarr

8. Statistical Analysis
Functions for performing statistical analysis on datasets related to the images.

bland_altman
t_test
pair_wise_correlation_matrix
mean_std_column
mean_squared_error

9. Pipeline/Workflow Automation
These functions automate a sequence of steps, combining multiple image processing operations.

workflow_batch_process_folder_count_labels
workflow_batch_process_folder_measure_intensity
workflow_segment_measure_umap
workflow_segmentation_measurement_summary
workflow_watershed_segmentation_correction_measurement

10. Miscellaneous
Functions that don't fit directly into other categories but are useful for specific tasks.

return_hello_world

haesleinhuepf · 2024-09-05T04:46:40Z

Awesome! I couldn't have done better ☺️ Now what do we do with this? Shall we preserve it somewhere in the repository?

pr4deepr · 2024-09-05T06:16:58Z

I'm creating a branch with the new categories..
Plan will be to save this as a yaml file..

There are errors and repetitions above, so need to clean it up as well..

https://github.com/pr4deepr/human-eval-bia/tree/function_categorize

haesleinhuepf · 2024-09-05T06:24:04Z

repetitions above

Well I think test-cases can be in multiple categories.

pr4deepr · 2024-09-05T06:39:28Z

Good point

pr4deepr · 2024-09-06T01:27:16Z

So, I've done an initial pass.

Interesting findings:

Morphological operations: GPT-4o >claude 3.5
Feature extraction: Claude3.5 >GPT-4o
Statistical analysis: llama3 ==gpt3.5turbo >claude3.5 and GPT-4o

The categorisation of course is important and if not done properly can misrepresent the results..

The function categorisation can be found here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/create_function_category_yaml.ipynb which saves is as a yaml file

I can create a separate notebook for the data processing and graphing as its currently here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/summarize_by_case.ipynb..

pr4deepr · 2024-09-06T01:59:19Z

Happy to create a PR, but wasn't sure if it should be to main..

haesleinhuepf · 2024-09-06T05:53:48Z

Yes! I certainly need such a figure for talks, because showing the blue table for all test-cases doesn't fit on a slide. It could also be in the paper... Curious what @tischi says about this figure:

What I'm a bit concerned about this the static list of categories in the other notebook. It could be a pain to maintain this mid/long term. Would it be possible to put them in a dataframe, and add some code that warns if a test-case is in no category? Or even better, code that uses gpt4-o to categorize tesr-cases that are in no category and then adding them to the dataframe?

jkh1 · 2024-09-06T06:30:53Z

Regarding categorization, why not require some metadata tag to be present with each submitted test case? Maybe it's too late for this but we could add it to existing test cases as the numbers seem manageable. Doing it with an LLM would anyway require manual review of the outcome.

haesleinhuepf · 2024-09-06T06:44:11Z

Yet, we don't have any infrastructure for handling meta data of test-cases. I was hoping to fully automate this, so that only minimal manual curation is necessary. In an earlier discussion, also categorizing code depending on its complexity was discussed. No matter how we do these things, I'd love to have a semi-automatic solution with minimal code/infrastructure to maintain.

pr4deepr · 2024-09-06T07:06:23Z

We can use the GPT4-o idea, but is there a way to have a seed or something similar to guarantee relatively similar responses? The categories change everytime I ask... OR we just need to really be specific on question we ask GPT..

jkh1 · 2024-09-06T07:08:16Z

The tagging could be left to the author of the test case given a choice of predefined categories. Then its should be a matter of reading the tags when compiling the results. If classification is automated with an LLM, the outcome is likely to change over time and with the LLM used. I think we would need a deterministic algorithm for this.

jkh1 · 2024-09-06T07:09:13Z

@pr4deepr Exactly what I thought likely :)

haesleinhuepf · 2024-09-06T07:21:01Z

Ok, I leave the decision about this to you guys. Whatever works :-)

pr4deepr · 2024-09-06T21:37:39Z

I'm happy with solution from @jkh1 , i.e., having a few tags, and getting the author of new test cases to put those tags in their functions. We can have a few different tags for each category.

This could be a requirement when submitting a new test cases.

For existing functions, perhaps myself and @jkh1 could

decide on categories & tags
add them to existing test cases and decide where to define it
submit a PR
in same PR once we've agreed on tags, change PR template to reflect this

Cheers
Pradeep

haesleinhuepf · 2024-09-07T01:14:27Z

put those tags in their functions

Can you give an example how this could look like?

pr4deepr · 2024-09-07T01:40:02Z

Either in functions or in each notebook. I need to look at the code first. Will update it here

pr4deepr · 2024-09-09T02:28:38Z

Upon looking at the code again, I think we'll want to minimize any modifications to existing test functions & with creating yaml files for cases at this point.

I propose we have all the categorisation information in a yaml file with:

each function name as a key
values are the categories.

The categories can be:

file_i_o
image_preprocessing
morphological_operations
segmentation
feature_extraction
statistical_analysis
workflow_automation
oher

example yaml file:

create_umap

feature_extraction

workflow_segment_measure_umap

segmentation

workflow_automation

feature_extraction

I'm happy to go through existing test cases and create this yaml file..

When a test case PR is submitted, the yaml file will have to be modified to add the new function and category. The PR template will need to be modified.

If the need arises we can expand the categories, but I feel like this should cover it.

haesleinhuepf · 2024-09-09T05:47:17Z

Yes, great idea!

The PR template will need to be modified.

We can also add some python code which tests if all test cases are in this yaml.file. e.g. in create_cases.ipynb or as github worklfow.

jkh1 · 2024-09-09T06:02:24Z

Sounds good to me. My intiial idea was to use notebook tags but I realized this may be more complicated to get at.

pr4deepr · 2024-09-09T20:42:17Z

I've made the necessary changes with commit history here:

https://github.com/pr4deepr/human-eval-bia/commits/function_categorize/

Added a categorise functions yaml file
Added a check in create_test_cases to verify all functions are present
Code updated to plot by category
- updated PR template. wording may need changing

If you are happy with this, I can open a PR. Let me know which branch you'd prefer.

haesleinhuepf · 2024-09-10T05:28:14Z

Awesome @pr4deepr , thanks for working on this! Yes, please send a PR!

pr4deepr mentioned this issue Sep 10, 2024

Categorise test cases #129

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grouping test cases and categorisation #112

grouping test cases and categorisation #112

pr4deepr commented Sep 4, 2024

haesleinhuepf commented Sep 5, 2024

pr4deepr commented Sep 5, 2024

haesleinhuepf commented Sep 5, 2024 •

edited

Loading

pr4deepr commented Sep 5, 2024

haesleinhuepf commented Sep 5, 2024

pr4deepr commented Sep 5, 2024

pr4deepr commented Sep 6, 2024

pr4deepr commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024

jkh1 commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024 •

edited

Loading

pr4deepr commented Sep 6, 2024

jkh1 commented Sep 6, 2024

jkh1 commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024

pr4deepr commented Sep 6, 2024 •

edited

Loading

haesleinhuepf commented Sep 7, 2024

pr4deepr commented Sep 7, 2024

pr4deepr commented Sep 9, 2024

haesleinhuepf commented Sep 9, 2024 •

edited

Loading

jkh1 commented Sep 9, 2024

pr4deepr commented Sep 9, 2024

haesleinhuepf commented Sep 10, 2024

grouping test cases and categorisation #112

grouping test cases and categorisation #112

Comments

pr4deepr commented Sep 4, 2024

haesleinhuepf commented Sep 5, 2024

pr4deepr commented Sep 5, 2024

haesleinhuepf commented Sep 5, 2024 • edited Loading

pr4deepr commented Sep 5, 2024

haesleinhuepf commented Sep 5, 2024

pr4deepr commented Sep 5, 2024

pr4deepr commented Sep 6, 2024

pr4deepr commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024

jkh1 commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024 • edited Loading

pr4deepr commented Sep 6, 2024

jkh1 commented Sep 6, 2024

jkh1 commented Sep 6, 2024

haesleinhuepf commented Sep 6, 2024

pr4deepr commented Sep 6, 2024 • edited Loading

haesleinhuepf commented Sep 7, 2024

pr4deepr commented Sep 7, 2024

pr4deepr commented Sep 9, 2024

haesleinhuepf commented Sep 9, 2024 • edited Loading

jkh1 commented Sep 9, 2024

pr4deepr commented Sep 9, 2024

haesleinhuepf commented Sep 10, 2024

haesleinhuepf commented Sep 5, 2024 •

edited

Loading

haesleinhuepf commented Sep 6, 2024 •

edited

Loading

pr4deepr commented Sep 6, 2024 •

edited

Loading

haesleinhuepf commented Sep 9, 2024 •

edited

Loading