Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grouping test cases and categorisation #112

Open
pr4deepr opened this issue Sep 4, 2024 · 23 comments · Fixed by #129
Open

grouping test cases and categorisation #112

pr4deepr opened this issue Sep 4, 2024 · 23 comments · Fixed by #129

Comments

@pr4deepr
Copy link
Contributor

pr4deepr commented Sep 4, 2024

Hi @haesleinhuepf
I was going through the preprint and one thought I had was grouping the test cases under categories. We get an overall view of how well LLMs perform but lose the granularity on whether LLMs perform well/worse on certain tasks and not on others..

For example, with our test cases perhaps grouping can be like:

  • Quantification/Measurements: test cases that measure or count can be under quantificaiton/measurements
  • statistical analysis: t test, pairwise correlation
  • Morphological operations: binary close, skeleton, expand labels

etc..

It may give an idea of where we need more or less test cases as well.
I remember you had a preprint on ontologies and standards for bioimage analysis. Perhaps that can be used as a reference.

Cheers
Pradeep

@haesleinhuepf
Copy link
Owner

Hey @pr4deepr ,

great idea! This categorization is obviously a subjective task. We could automate this and make it more objective using an LLM; a text-classification LLM. Do you by chance have experience with this?

Cheers,
Robert

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 5, 2024

No, I do not.
Would you put it through chatgpt for example to get a first pass?

If I copied the readme in the main repo containing description of the current test cases and used the question in Chat-GPT (GPT-4o):

I have some python functions and each of them perform a specific operation in bioimage analysis. Classify them into categories based on their function and where they will fit in the image analysis pipeline.

1. Image Preprocessing
These functions focus on preparing the image by modifying its format, reducing noise, or performing other transformations to enhance analysis.

rgb_to_grey_image_transform
rotate_image_by_90_degrees
crop_quarter_image
transpose_image_axes
remove_noise_edge_preserving
subtract_background_tophat
mask_image
remove_labels_on_edges
remove_small_labels
expand_labels_without_overlap

2. Image Enhancement
Functions that enhance specific features of the image, such as sharpening edges or enhancing contrast.

detect_edges
convolve_images
deconvolve_image
remove_noise_edge_preserving

3. Segmentation
These functions segment images into meaningful regions or objects, usually through intensity-based methods.

apply_otsu_threshold_and_count_postiive_pixels
region_growing_segmentation
workflow_segmentation_counting
workflow_watershed_segmentation_correction_measurement
workflow_segmentation_measurement_summary
workflow_segment_measure_umap

4. Morphological Operations
These functions apply morphological transformations to binary images for cleaning or refining shapes.

binary_closing
binary_skeleton
label_binary_image_and_count_labels
label_sequentially
remove_labels_on_edges
remove_small_labels

5. Quantification and Measurement
Functions that quantify different properties of labeled regions or objects in the image.

convex_hull_measure_area
count_number_of_touching_neighbors
measure_aspect_ratio_of_regions
measure_intensity_of_labels
measure_intensity_over_time
measure_mean_image_intensity
measure_pixel_count_of_labels
measure_properties_of_regions
extract_surface_measure_area
measure_properties_of_regions
count_objects_over_time
count_overlapping_regions

6. Feature Extraction
Functions that extract geometrical or intensity features from the image.

fit_circle
radial_intensity_profile
convex_hull_measure_area
measure_aspect_ratio_of_regions

7. File I/O
Functions that handle loading, saving, or listing files.

list_image_files_in_folder
open_image_read_voxel_size
open_image_return_dimensions
open_nifti_image
open_zarr

8. Statistical Analysis
Functions for performing statistical analysis on datasets related to the images.

bland_altman
t_test
pair_wise_correlation_matrix
mean_std_column
mean_squared_error

9. Pipeline/Workflow Automation
These functions automate a sequence of steps, combining multiple image processing operations.

workflow_batch_process_folder_count_labels
workflow_batch_process_folder_measure_intensity
workflow_segment_measure_umap
workflow_segmentation_measurement_summary
workflow_watershed_segmentation_correction_measurement

10. Miscellaneous
Functions that don't fit directly into other categories but are useful for specific tasks.

return_hello_world

@haesleinhuepf
Copy link
Owner

haesleinhuepf commented Sep 5, 2024

Awesome! I couldn't have done better ☺️ Now what do we do with this? Shall we preserve it somewhere in the repository?

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 5, 2024

I'm creating a branch with the new categories..
Plan will be to save this as a yaml file..

There are errors and repetitions above, so need to clean it up as well..

https://github.com/pr4deepr/human-eval-bia/tree/function_categorize

@haesleinhuepf
Copy link
Owner

repetitions above

Well I think test-cases can be in multiple categories.

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 5, 2024

Good point

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 6, 2024

So, I've done an initial pass.
image

Interesting findings:

  • Morphological operations: GPT-4o >claude 3.5
  • Feature extraction: Claude3.5 >GPT-4o
  • Statistical analysis: llama3 ==gpt3.5turbo >claude3.5 and GPT-4o

The categorisation of course is important and if not done properly can misrepresent the results..

The function categorisation can be found here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/create_function_category_yaml.ipynb which saves is as a yaml file

I can create a separate notebook for the data processing and graphing as its currently here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/summarize_by_case.ipynb..

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 6, 2024

Happy to create a PR, but wasn't sure if it should be to main..

@haesleinhuepf
Copy link
Owner

Yes! I certainly need such a figure for talks, because showing the blue table for all test-cases doesn't fit on a slide. It could also be in the paper... Curious what @tischi says about this figure:

download (2)

What I'm a bit concerned about this the static list of categories in the other notebook. It could be a pain to maintain this mid/long term. Would it be possible to put them in a dataframe, and add some code that warns if a test-case is in no category? Or even better, code that uses gpt4-o to categorize tesr-cases that are in no category and then adding them to the dataframe?

@jkh1
Copy link
Collaborator

jkh1 commented Sep 6, 2024

Regarding categorization, why not require some metadata tag to be present with each submitted test case? Maybe it's too late for this but we could add it to existing test cases as the numbers seem manageable. Doing it with an LLM would anyway require manual review of the outcome.

@haesleinhuepf
Copy link
Owner

haesleinhuepf commented Sep 6, 2024

Yet, we don't have any infrastructure for handling meta data of test-cases. I was hoping to fully automate this, so that only minimal manual curation is necessary. In an earlier discussion, also categorizing code depending on its complexity was discussed. No matter how we do these things, I'd love to have a semi-automatic solution with minimal code/infrastructure to maintain.

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 6, 2024

We can use the GPT4-o idea, but is there a way to have a seed or something similar to guarantee relatively similar responses? The categories change everytime I ask... OR we just need to really be specific on question we ask GPT..

@jkh1
Copy link
Collaborator

jkh1 commented Sep 6, 2024

The tagging could be left to the author of the test case given a choice of predefined categories. Then its should be a matter of reading the tags when compiling the results. If classification is automated with an LLM, the outcome is likely to change over time and with the LLM used. I think we would need a deterministic algorithm for this.

@jkh1
Copy link
Collaborator

jkh1 commented Sep 6, 2024

@pr4deepr Exactly what I thought likely :)

@haesleinhuepf
Copy link
Owner

Ok, I leave the decision about this to you guys. Whatever works :-)

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 6, 2024

I'm happy with solution from @jkh1 , i.e., having a few tags, and getting the author of new test cases to put those tags in their functions. We can have a few different tags for each category.

This could be a requirement when submitting a new test cases.

For existing functions, perhaps myself and @jkh1 could

  • decide on categories & tags
  • add them to existing test cases and decide where to define it
  • submit a PR
  • in same PR once we've agreed on tags, change PR template to reflect this

Cheers
Pradeep

@haesleinhuepf
Copy link
Owner

put those tags in their functions

Can you give an example how this could look like?

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 7, 2024

Either in functions or in each notebook. I need to look at the code first. Will update it here

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 9, 2024

Upon looking at the code again, I think we'll want to minimize any modifications to existing test functions & with creating yaml files for cases at this point.

I propose we have all the categorisation information in a yaml file with:

  • each function name as a key
  • values are the categories.

The categories can be:

  1. file_i_o
  2. image_preprocessing
  3. morphological_operations
  4. segmentation
  5. feature_extraction
  6. statistical_analysis
  7. workflow_automation
  8. oher

example yaml file:

create_umap

  • feature_extraction

workflow_segment_measure_umap

  • segmentation
  • workflow_automation
  • feature_extraction

I'm happy to go through existing test cases and create this yaml file..

When a test case PR is submitted, the yaml file will have to be modified to add the new function and category. The PR template will need to be modified.

If the need arises we can expand the categories, but I feel like this should cover it.

@haesleinhuepf
Copy link
Owner

haesleinhuepf commented Sep 9, 2024

Yes, great idea!

The PR template will need to be modified.

We can also add some python code which tests if all test cases are in this yaml.file. e.g. in create_cases.ipynb or as github worklfow.

@jkh1
Copy link
Collaborator

jkh1 commented Sep 9, 2024

Sounds good to me. My intiial idea was to use notebook tags but I realized this may be more complicated to get at.

@pr4deepr
Copy link
Contributor Author

pr4deepr commented Sep 9, 2024

I've made the necessary changes with commit history here:

https://github.com/pr4deepr/human-eval-bia/commits/function_categorize/

If you are happy with this, I can open a PR. Let me know which branch you'd prefer.

@haesleinhuepf
Copy link
Owner

Awesome @pr4deepr , thanks for working on this! Yes, please send a PR!

@pr4deepr pr4deepr mentioned this issue Sep 10, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants