-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up the ability to run eval suites #114
Set up the ability to run eval suites #114
Conversation
This is a barely functional wrapper for running "test suites", which are just a list of preconfigured tasks. You can specify prompt and model. This needs more testing and UI cleanup.
This moves suite config handling code into the library proper instead of the script, and creates a subdir for suite configs.
This is still pretty bare-bones, but it's functional and should be good for automating eval across different models. Basically we can run the same eval we've been running, but with a simpler invocation, and without worrying about copying versions or fewshot parameters the wrong way. |
lm_eval/prompts.py
Outdated
PROMPT_CODES = { | ||
"user": "0.0", | ||
"jgpt": "0.1", | ||
"fintan": "0.2", | ||
"fintan2": "0.2.1", | ||
"ja-alpaca": "0.3", | ||
"rinna-sft": "0.4", | ||
"rinna-bilingual": "0.5", | ||
"llama2": "0.6", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if these names make sense or could be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polm-stability thank you for this PR!
- Can you add the instruction to use suites somewhere? Maybe you can just change the example script in README to one using suites.
- Can you fix
docs/prompt_templates.md
?
Good point, docs should be updated now. |
Awesome! LGTM! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM 👍, but let me double check one point just in case 🙏
This introduces a style for handling complex prompts and specifically handles the case of JSLM Beta. This is handled by using a function that takes the name of task as input. This allows for full customization without requiring details specification when actually running an eval suite. The style is simple - instead of mapping to a numeric version like 0.2, a shortname for a prompt can map to a callable that takes the task name. This allows for any kind of custom logic. This may not be the simplest or best approach, but it required few changes, keeps everything in one place, and touches nothing else in the code base, so it should be easy to change later if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! 👍
This PR includes changes to allow the running of eval suites with a single command. An example command looks like this:
The suite is specified as a list of tasks, with versions and fewshot specs, in a config file. Because the spec is in a file, it can be versioned and shared across models, while each model can vary the prompt it uses (as well as args related to loading the model). Prompts are specified using names rather than numbers to make it clear what they refer to and avoid mistakes.