Categorise test cases #129

pr4deepr · 2024-09-10T05:48:32Z

This PR contains:

a new test-case for the benchmark
- I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
new dependencies in requirements.txt
- The environment.yml file was updated using the command conda env export > environment.yml
new generator-functions allowing to sample from other LLMs
new samples (sample_....jsonl files)
new benchmarking results (..._results.jsonl files)
documentation update
bug fixes

Related github issue (if relevant): closes #112

Short description:
Group test cases into categories making it easier to understand benchmark LLM performance

How do you think will this influence the benchmark results?

Make it easier to understand and benchmark models. Help Identify areas where we need more test cases

Why do you think it makes sense to merge this PR?

As above
Added a categorise functions yaml file
Added a check in create_test_cases to verify all functions are present
Code updated to plot by category
updated PR template. wording may need changing

ian-coccimiglio · 2024-09-10T20:27:08Z

Interesting work here! This inspires me to look into my next set of test cases on "why are all these models failing at file I/O"

Llama3.1 405b

…net_20241022 Benchmark claude-3-5-sonnet-20241022

haesleinhuepf · 2024-11-21T13:09:28Z

Thanks a lot @pr4deepr !

pr4deepr added 7 commits September 5, 2024 16:15

creating new categories

cc70493

ensured all tasks are in categories. Plot results

ae8e242

add categorise_functions yaml

d91385f

Code to check if all test cases in yaml file

8a6b51b

removed old files

48b164f

read new yaml file and plot model performance by category

de3de62

updated pull request template

c2bf2ae

haesleinhuepf added 10 commits October 11, 2024 09:59

WIP: sample llama3.1 405b

d648af4

sampling with streaming API to prevent disconnect because of timeout

e50a9e0

sampled llama 3.1 405B (in two runs)

93c4819

evaluated samples, updated plots

2a82ab6

Merge pull request haesleinhuepf#139 from haesleinhuepf/llama3.1-405b

3bd076a

Llama3.1 405b

sampled claude-3-5-sonnet-20241022

b41fc01

evaluated claude-3-5-sonnet-20241022

488f27a

evaluated claude-3-5-sonnet-20241022

116b165

redraw figures

857d0fd

Merge pull request haesleinhuepf#143 from haesleinhuepf/benchmark_son…

6157140

…net_20241022 Benchmark claude-3-5-sonnet-20241022

haesleinhuepf changed the base branch from main to development-collecting-new-test-cases November 21, 2024 13:02

haesleinhuepf added 2 commits November 21, 2024 14:08

Merge branch 'main' into pr/129

9bee740

reran notebook

db2e3fd

haesleinhuepf merged commit d4ab9f8 into haesleinhuepf:development-collecting-new-test-cases Nov 21, 2024

haesleinhuepf mentioned this pull request Nov 21, 2024

Collection of new use-cases and bug-fixes #93

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorise test cases #129

Categorise test cases #129

pr4deepr commented Sep 10, 2024

ian-coccimiglio commented Sep 10, 2024

haesleinhuepf commented Nov 21, 2024

Categorise test cases #129

Categorise test cases #129

Conversation

pr4deepr commented Sep 10, 2024

ian-coccimiglio commented Sep 10, 2024

haesleinhuepf commented Nov 21, 2024