-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grouping test cases and categorisation #112
Comments
Hey @pr4deepr , great idea! This categorization is obviously a subjective task. We could automate this and make it more objective using an LLM; a text-classification LLM. Do you by chance have experience with this? Cheers, |
No, I do not. If I copied the readme in the main repo containing description of the current test cases and used the question in Chat-GPT (GPT-4o): I have some python functions and each of them perform a specific operation in bioimage analysis. Classify them into categories based on their function and where they will fit in the image analysis pipeline. 1. Image Preprocessing rgb_to_grey_image_transform 2. Image Enhancement detect_edges 3. Segmentation apply_otsu_threshold_and_count_postiive_pixels 4. Morphological Operations binary_closing 5. Quantification and Measurement convex_hull_measure_area 6. Feature Extraction fit_circle 7. File I/O list_image_files_in_folder 8. Statistical Analysis bland_altman 9. Pipeline/Workflow Automation workflow_batch_process_folder_count_labels 10. Miscellaneous return_hello_world |
Awesome! I couldn't have done better |
I'm creating a branch with the new categories.. There are errors and repetitions above, so need to clean it up as well.. https://github.com/pr4deepr/human-eval-bia/tree/function_categorize |
Well I think test-cases can be in multiple categories. |
Good point |
So, I've done an initial pass. Interesting findings:
The categorisation of course is important and if not done properly can misrepresent the results.. The function categorisation can be found here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/create_function_category_yaml.ipynb which saves is as a yaml file I can create a separate notebook for the data processing and graphing as its currently here: https://github.com/pr4deepr/human-eval-bia/blob/function_categorize/demo/summarize_by_case.ipynb.. |
Happy to create a PR, but wasn't sure if it should be to main.. |
Yes! I certainly need such a figure for talks, because showing the blue table for all test-cases doesn't fit on a slide. It could also be in the paper... Curious what @tischi says about this figure: What I'm a bit concerned about this the static list of categories in the other notebook. It could be a pain to maintain this mid/long term. Would it be possible to put them in a dataframe, and add some code that warns if a test-case is in no category? Or even better, code that uses gpt4-o to categorize tesr-cases that are in no category and then adding them to the dataframe? |
Regarding categorization, why not require some metadata tag to be present with each submitted test case? Maybe it's too late for this but we could add it to existing test cases as the numbers seem manageable. Doing it with an LLM would anyway require manual review of the outcome. |
Yet, we don't have any infrastructure for handling meta data of test-cases. I was hoping to fully automate this, so that only minimal manual curation is necessary. In an earlier discussion, also categorizing code depending on its complexity was discussed. No matter how we do these things, I'd love to have a semi-automatic solution with minimal code/infrastructure to maintain. |
We can use the GPT4-o idea, but is there a way to have a seed or something similar to guarantee relatively similar responses? The categories change everytime I ask... OR we just need to really be specific on question we ask GPT.. |
The tagging could be left to the author of the test case given a choice of predefined categories. Then its should be a matter of reading the tags when compiling the results. If classification is automated with an LLM, the outcome is likely to change over time and with the LLM used. I think we would need a deterministic algorithm for this. |
@pr4deepr Exactly what I thought likely :) |
Ok, I leave the decision about this to you guys. Whatever works :-) |
I'm happy with solution from @jkh1 , i.e., having a few tags, and getting the author of new test cases to put those tags in their functions. We can have a few different tags for each category. This could be a requirement when submitting a new test cases. For existing functions, perhaps myself and @jkh1 could
Cheers |
Can you give an example how this could look like? |
Either in functions or in each notebook. I need to look at the code first. Will update it here |
Upon looking at the code again, I think we'll want to minimize any modifications to existing test functions & with creating yaml files for cases at this point. I propose we have all the categorisation information in a yaml file with:
The categories can be:
example yaml file:
I'm happy to go through existing test cases and create this yaml file.. When a test case PR is submitted, the yaml file will have to be modified to add the new function and category. The PR template will need to be modified. If the need arises we can expand the categories, but I feel like this should cover it. |
Yes, great idea!
We can also add some python code which tests if all test cases are in this yaml.file. e.g. in create_cases.ipynb or as github worklfow. |
Sounds good to me. My intiial idea was to use notebook tags but I realized this may be more complicated to get at. |
I've made the necessary changes with commit history here: https://github.com/pr4deepr/human-eval-bia/commits/function_categorize/
If you are happy with this, I can open a PR. Let me know which branch you'd prefer. |
Awesome @pr4deepr , thanks for working on this! Yes, please send a PR! |
Hi @haesleinhuepf
I was going through the preprint and one thought I had was grouping the test cases under categories. We get an overall view of how well LLMs perform but lose the granularity on whether LLMs perform well/worse on certain tasks and not on others..
For example, with our test cases perhaps grouping can be like:
etc..
It may give an idea of where we need more or less test cases as well.
I remember you had a preprint on ontologies and standards for bioimage analysis. Perhaps that can be used as a reference.
Cheers
Pradeep
The text was updated successfully, but these errors were encountered: