Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add code eval #587

Merged
merged 29 commits into from
Sep 26, 2023
Merged

Add code eval #587

merged 29 commits into from
Sep 26, 2023

Conversation

samhavens
Copy link
Contributor

@samhavens samhavens commented Sep 8, 2023

This is a re-do of Rishab's PR #441 , but now we have a stable version of Composer.

This has been tested somewhat but there are issues with nondeterminism. As well, we are still figuring out how to eval using the secure sandbox on AWS.

Copy link
Contributor

@bmosaicml bmosaicml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, just left some comments

scripts/train/yamls/finetune/mpt-7b-code.yaml Outdated Show resolved Hide resolved
scripts/eval/yamls/tasks.yaml Show resolved Hide resolved
scripts/eval/yamls/eval_gauntlet.yaml Show resolved Hide resolved
add three shot to coding tasks
make 0 and 3 shot for gauntlet, add programming_lite
@samhavens
Copy link
Contributor Author

samhavens commented Sep 12, 2023

I used the following YAML to run the new coding evals against MPT-7b-8k. I reproduced our previous value for HumanEval (Python) of 11.6%

integrations:
- integration_type: git_repo
  git_repo: mosaicml/llm-foundry
  git_branch: sam/add-coding-eval
  pip_install: -e ".[gpu]"
  ssh_clone: false
- integration_type: wandb
  project: code-lora
  tags:
  - peft
  - eval
  entity: mosaic-ml

command: |
  cd llm-foundry/scripts
  composer eval/eval.py /mnt/config/parameters.yaml  || (echo "Command failed - killing python" && pkill python && exit 1)


run_name: humaneval-mpt-7b-8k
gpu_num: 8
gpu_type: a100_40gb
cluster: r7z2

resumable: false
priority: medium

image: mosaicml/examples:llm-latest

env_variables:
  - key: DATABRICKS_HOST
    value: redacted
  - key: DATABRICKS_TOKEN
    value: redacted
  - key: CODE_EVAL_DEVICE
    value: LOCAL

parameters:
  dist_timeout: 6000
  seed: 1
  max_seq_len: 1024
  device_eval_batch_size: 4
  precision: amp_fp16

  loggers:
    wandb: {}
    mlflow:
      experiment_name: redacted
      tracking_uri: databricks

  model_name_or_path: mosaicml/mpt-7b-8k

  models:
  -
    model_name: ${model_name_or_path}
    model:
      name: hf_causal_lm
      pretrained_model_name_or_path: ${model_name_or_path}
      init_device: cpu
      pretrained: true
    tokenizer:
      name: ${model_name_or_path}
      kwargs:
        model_max_length: ${max_seq_len}

  fsdp_config:
    sharding_strategy: FULL_SHARD
    mixed_precision: PURE
    activation_checkpointing: true
    activation_checkpointing_reentrant: false
    activation_cpu_offload: false
    limit_all_gathers: true
    verbose: false

  icl_tasks: 'eval/yamls/coding_tasks.yaml'
| Category   | Benchmark      | Subtask   |   Accuracy | Number few shot   | Model              |
|:-----------|:---------------|:----------|-----------:|:------------------|:-------------------|
|            | human_eval     |           |  0.115854  | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_cpp |           |  0.0621118 | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_js  |           |  0.0487805 | 0-shot            | mosaicml/mpt-7b-8k |
|            | human_eval_c   |           |  0.222222  | 0-shot            | mosaicml/mpt-7b-8k |

Note that this runs the code eval "locally" (on the same machine as eval.py). We need to very this works in Lambdas as well.

@mcarbin
Copy link
Contributor

mcarbin commented Sep 12, 2023

My repro using @samhavens yaml produces different results:

Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.0914634 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0372671 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0426829 0-shot mosaicml/mpt-7b-8k
human_eval_c 0.111111 0-shot mosaicml/mpt-7b-8k

@samhavens and I suspected generate/decode params and @samhavens found sampling is enabled for code eval, hence introducing nondeterminism:

https://github.com/mosaicml/composer/blob/71cdfad53c5962fed8bcf340bb2bbfe403472678/composer/datasets/in_context_learning_evaluation.py#L1048

@dakinggg
Copy link
Collaborator

@samhavens @mcarbin I would still expect it to be deterministic if we are seeding everything properly...if you run on the same hardaware, batch size, etc, the data order should be the same. Perhaps we should see closer to the generate call to make it more deterministic.

@mcarbin
Copy link
Contributor

mcarbin commented Sep 14, 2023

non-determinism due to a subtle bug in seeding eval trainer. fixed with a2c877b

before

Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.0914634 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0372671 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0487805 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k
Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.0731707 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0496894 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0243902 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k
Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.109756 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0248447 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0426829 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k

after

Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.109756 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0372671 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0365854 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k
Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.109756 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0372671 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0365854 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k
Category Benchmark Subtask Accuracy Number few shot Model
human_eval 0.109756 0-shot mosaicml/mpt-7b-8k
human_eval_cpp 0.0372671 0-shot mosaicml/mpt-7b-8k
human_eval_js 0.0365854 0-shot mosaicml/mpt-7b-8k
human_eval_c 0 0-shot mosaicml/mpt-7b-8k

@mcarbin
Copy link
Contributor

mcarbin commented Sep 14, 2023

Non-determinism also exposed an issue with code eval pass@k metric. fixing is WIP

@dakinggg
Copy link
Collaborator

Closes #362

@mcarbin
Copy link
Contributor

mcarbin commented Sep 22, 2023

Waiting on PR to composer: mosaicml/composer#2550

@mcarbin mcarbin marked this pull request as ready for review September 25, 2023 19:35
@mcarbin mcarbin self-assigned this Sep 25, 2023
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmosaicml could you rereview please? want to make sure the gauntlet changes are ok with you

llmfoundry/utils/builders.py Show resolved Hide resolved
scripts/eval/eval.py Outdated Show resolved Hide resolved
@dakinggg
Copy link
Collaborator

dakinggg commented Sep 26, 2023

Also, please bump the mosaicml version pin in setup.py to >=0.16.3. Then once its released CI should pass

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will approve once composer release is out, version is bumped, and CI passes

@bmosaicml
Copy link
Contributor

Lgtm, any chance you could include the time to eval for whichever model/num GPUs you tested this with? It'll be useful reference to know once we run >30B models

@mcarbin
Copy link
Contributor

mcarbin commented Sep 26, 2023

For MPT30b, 16xA100-80g is the minimum spec. Here are reference results:

Ran mosaicml/mpt-30b eval in: 15220.648153305054 seconds 
Printing complete results for all models 
| Category   | Benchmark      | Subtask   |   Accuracy | Number few shot   | Model            | 
|:-----------|:---------------|:----------|-----------:|:------------------|:-----------------| 
|            | human_eval     |           |  0.143598  | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_cpp |           |  0.0928571 | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_js  |           |  0.0951219 | 0-shot            | mosaicml/mpt-30b | 
|            | human_eval_c   |           |  0.0888889 | 0-shot            | mosaicml/mpt-30b | 

@mcarbin mcarbin merged commit fd36398 into main Sep 26, 2023
@dakinggg dakinggg mentioned this pull request Oct 5, 2023
@dakinggg dakinggg deleted the sam/add-coding-eval branch October 11, 2023 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants