-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code eval #587
Add code eval #587
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, just left some comments
add three shot to coding tasks
make 0 and 3 shot for gauntlet, add programming_lite
I used the following YAML to run the new coding evals against MPT-7b-8k. I reproduced our previous value for HumanEval (Python) of 11.6% integrations:
- integration_type: git_repo
git_repo: mosaicml/llm-foundry
git_branch: sam/add-coding-eval
pip_install: -e ".[gpu]"
ssh_clone: false
- integration_type: wandb
project: code-lora
tags:
- peft
- eval
entity: mosaic-ml
command: |
cd llm-foundry/scripts
composer eval/eval.py /mnt/config/parameters.yaml || (echo "Command failed - killing python" && pkill python && exit 1)
run_name: humaneval-mpt-7b-8k
gpu_num: 8
gpu_type: a100_40gb
cluster: r7z2
resumable: false
priority: medium
image: mosaicml/examples:llm-latest
env_variables:
- key: DATABRICKS_HOST
value: redacted
- key: DATABRICKS_TOKEN
value: redacted
- key: CODE_EVAL_DEVICE
value: LOCAL
parameters:
dist_timeout: 6000
seed: 1
max_seq_len: 1024
device_eval_batch_size: 4
precision: amp_fp16
loggers:
wandb: {}
mlflow:
experiment_name: redacted
tracking_uri: databricks
model_name_or_path: mosaicml/mpt-7b-8k
models:
-
model_name: ${model_name_or_path}
model:
name: hf_causal_lm
pretrained_model_name_or_path: ${model_name_or_path}
init_device: cpu
pretrained: true
tokenizer:
name: ${model_name_or_path}
kwargs:
model_max_length: ${max_seq_len}
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false
icl_tasks: 'eval/yamls/coding_tasks.yaml' | Category | Benchmark | Subtask | Accuracy | Number few shot | Model |
|:-----------|:---------------|:----------|-----------:|:------------------|:-------------------|
| | human_eval | | 0.115854 | 0-shot | mosaicml/mpt-7b-8k |
| | human_eval_cpp | | 0.0621118 | 0-shot | mosaicml/mpt-7b-8k |
| | human_eval_js | | 0.0487805 | 0-shot | mosaicml/mpt-7b-8k |
| | human_eval_c | | 0.222222 | 0-shot | mosaicml/mpt-7b-8k | Note that this runs the code eval "locally" (on the same machine as eval.py). We need to very this works in Lambdas as well. |
My repro using @samhavens yaml produces different results:
@samhavens and I suspected generate/decode params and @samhavens found sampling is enabled for code eval, hence introducing nondeterminism: |
@samhavens @mcarbin I would still expect it to be deterministic if we are seeding everything properly...if you run on the same hardaware, batch size, etc, the data order should be the same. Perhaps we should see closer to the generate call to make it more deterministic. |
non-determinism due to a subtle bug in seeding eval trainer. fixed with a2c877b before
after
|
Non-determinism also exposed an issue with code eval pass@k metric. fixing is WIP |
Closes #362 |
Waiting on PR to composer: mosaicml/composer#2550 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmosaicml could you rereview please? want to make sure the gauntlet changes are ok with you
Also, please bump the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will approve once composer release is out, version is bumped, and CI passes
Lgtm, any chance you could include the time to eval for whichever model/num GPUs you tested this with? It'll be useful reference to know once we run >30B models |
…al' into sam/add-coding-eval
For MPT30b, 16xA100-80g is the minimum spec. Here are reference results:
|
This is a re-do of Rishab's PR #441 , but now we have a stable version of Composer.
This has been tested somewhat but there are issues with nondeterminism. As well, we are still figuring out how to eval using the secure sandbox on AWS.