Skip to content

Commit

Permalink
add eval readme (#566)
Browse files Browse the repository at this point in the history
* add eval readme

* modify readme

* modify readme

* doc

* restory eval yaml

* lint

* Update scripts/train/yamls/pretrain/gpt-neo-125m_eval.yaml

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update scripts/eval/README.md

Co-authored-by: Daniel King <[email protected]>

* Update README.md

---------

Co-authored-by: Daniel King <[email protected]>
  • Loading branch information
bmosaicml and dakinggg authored Sep 13, 2023
1 parent 0fdf43f commit e75cfc9
Show file tree
Hide file tree
Showing 2 changed files with 212 additions and 7 deletions.
92 changes: 85 additions & 7 deletions scripts/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@ You can evaluate a model by preparing an evaluation YAML following the format of

## Quickstart

To run a full evaluation on a model, you would need to install this repo and then run the following commands:
### Offline evaluation

To run offline evaluation, download this repo and run the following commands:

<!--pytest.mark.skip-->
```bash
cd llm-foundry/scripts
composer eval/eval.py eval/yamls/hf_eval.yaml
```

This will run a large eval suite, including our Model Gauntlet, on `EleutherAI/gpt-neo-125m`. You can update the model in that YAML file, or create your own, or override the values in the YAML with CLI args, such as:
This will run `EleutherAI/gpt-neo-125m` through the MosaicML Eval Gauntlet, a diverse evaluation suite consisting of over 30 benchmarks. You can update the configuration directly in the `hf_eval.yaml` YAML file, or override the values in the YAML with CLI args, such as:

<!--pytest.mark.skip-->
```bash
Expand All @@ -25,12 +27,29 @@ composer eval/eval.py eval/yamls/hf_eval.yaml \
model_name_or_path=mosaicml/mpt-7b
```

You can also modify the specific benchmarks executed and their formatting by modifying the contents of `tasks.yaml` and you can modify the choice of composite scores and the set of tasks they consist of by modifying `eval_gauntlet.yaml`.


### Evaluation during training
To run evaluatio during training, download this repo, follow the instructions in `scripts/train/README.md` to perform single node pre-training and run the following commands

<!--pytest.mark.skip-->
```bash
cd llm-foundry/scripts/train
composer train.py yamls/pretrain/mpt-125m_eval.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small
```
You can also modify the specific benchmarks executed and their formatting by modifying the contents of `tasks.yaml` and you can modify the choice of composite scores and the set of tasks they consist of by modifying `eval_gauntlet.yaml`. You can also choose to either run the full evaluation or run on a subset number of batches per benchmark by setting `icl_subset_num_batches`.

----
## In-depth walkthrough

## Offline evaluation
ICL evaluation can be done offline via the `scripts/eval/eval.py` or during training via `scripts/train/train.py`.

In order to do ICL evaluation you must specify a set of benchmarks you'd like to run via the `icl_tasks` key in your eval/training config. `icl_tasks` can either consist of config, or it can be a file path pointing to a locally accessible YAML config (see `scripts/eval/yamls/icl_tasks.yaml` for an example).

You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).
Your YAML must have a config section entitled `icl_tasks`, this can either be a list of dictionaries of the form

#### ICL task YAML format
Your YAML must have a config section entitled `icl_tasks` specifying the benchmarks to evaluate againts, this can either be a list of dictionaries of the form

```jsx
icl_tasks:
Expand Down Expand Up @@ -59,9 +78,66 @@ example_delimiter: "\n"
prompt_string: ''
```

## Evaluation during training
#### Eval gauntlet YAML format
Your YAML may optionally have a config section entitled eval\_gauntlet specifying how to aggregate the results (if absent, only the individual benchmark accuracies will be reported). After the tasks listed in the `icl_tasks` config are evaluated, the eval script will use the `eval_gauntlet` config, if specified, to aggregate the individual benchmarks into composite scores.


You can also add ICL evaluation to your training runs by adding an `icl_tasks` config to your training config at the same depth as the `model` subconfig.
An `eval_gauntlet` config must specify the list of categories you'd like to generate composite scores for, as well as the list of benchmarks to be included in each category. For each category you need to list the name and the `num_fewshot`. These two values must exactly match the values specified in the `icl_tasks` config. Additionally you must specify the random baseline accuracy for each benchmark.

There are also three flags indicating how to perform the aggregation:
1. `weighting` can either be `EQUAL` (all tasks are weighted equally), `SAMPLE_SZ` (tasks are weighted proportional to the size of the dataset), or `LOG_SAMPLE_SZ` (tasks are weighted proportional to the logarithm of the dataset size).
2. `subtract_random_baseline` can either be `true` or `false`. If `true` we will subtract the random baseline accuracy from the final accuracy before averaging, otherwise it will be averaged in as is.
3. `rescale_accuracy` can either be `true` or `false`. If `true` (and if `subtract_random_baseline` was also `true`), the accuracy will be rescaled to be <= 1 before averaging.

An example config is below:
```jsx
eval_gauntlet:
weighting: EQUAL
subtract_random_baseline: true
rescale_accuracy: true
categories:
- name: world_knowledge
benchmarks:
- name: jeopardy
num_fewshot: 10
random_baseline: 0
- name: mmlu
num_fewshot: 10
random_baseline: 0.25
- name: language_understanding
benchmarks:
- name: lambada_openai
num_fewshot: 0
random_baseline: 0.0
- name: hellaswag
num_fewshot: 10
random_baseline: 0.25
```

You can either specify your `eval_gauntlet` config directly in your eval/train YAML or via a local path pointing to a YAML containing an `eval_gauntlet` config.


### Offline evaluation

You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).

You can use the default `icl_tasks` and `eval_gauntlet` configs or specify your own following the instructions above.

### Evaluation during training

You can use ICL evaluation during training by taking an ordinary training YAML and adding an `icl_tasks` and `eval_gauntlet` config. You should also specify `icl_seq_len` in your training YAML and you can optionally run a truncated version of eval on a random subset of each benchmark by specifying a value for `icl_subset_num_batches`

An example is given below:
```
icl_tasks: eval/yamls/tasks.yaml # or use tasks_light.yaml
icl_subset_num_batches: 100 # -1, or omit this key entirely, to evaluate on all batches
eval_gauntlet: 'eval/yamls/eval_gauntlet.yaml'
icl_seq_len: 1024
```

For training, we recommend you do not run the full eval gauntlet. Instead either use the `tasks_light.yaml` which is a subset of the full gauntlet benchmarks, or set `icl_subset_num_batches` to a small number O(100) which will only run each benchmark on a random sample of `icl_subset_num_batches` batches.

You can use the default `icl_tasks` and `eval_gauntlet` configs or specify your own following the instructions above.

----

Expand Down Expand Up @@ -369,3 +445,5 @@ def prep_dataset(out_file):
When formatting samples, `prompt_string` is prepended to the beginning, then `num_fewshot` examples from the dataset are concatenated. Each few shot example is formatted with the context/continuation of each being separated by `continuation_delimiter`, then each example is separated from the others by the `example_delimiter`. Finally, we append the context/query/question/context options of the current sample to be evaluated and the `continuation_delimiter`.

Thus the structure of each question's preamble is `prompt | few shot examples | context | continuation delimiter`. The continuation (aka choices for MC) is then tokenized separately and the tokens of the preamble and tokens of the continuation are concatenated. It is important to note that if the continuation delimiter has a trailing space, it is stripped and instead prepended to the continuation. Furthermore, if the continuation does not have a leading space, one will be prepended.

----
127 changes: 127 additions & 0 deletions scripts/train/yamls/pretrain/gpt-neo-125m_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Pretrain a gpt-neo-125m style model
# this is NOT a finetuning run

data_local: ./my-copy-c4
data_remote: # If blank, files must be present in data_local
tokenizer_name: EleutherAI/gpt-neo-125M
max_seq_len: 2048
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $RUN_NAME

# Model
model:
name: hf_causal_lm
pretrained_model_name_or_path: EleutherAI/gpt-neo-125M
config_overrides:
# WARNING: if setting `pretrained: true`, `max_position_embeddings` must match the
# `max_position_embeddings` used during pre-training
max_position_embeddings: ${max_seq_len}
pretrained: false # false: only use the architecture; true: initialize with pretrained weights

# Tokenizer
tokenizer:
name: ${tokenizer_name}
kwargs:
model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: train
shuffle: true
tokenizer_name: ${tokenizer_name}
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: true
num_workers: 8

eval_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: val
shuffle: false
tokenizer_name: ${tokenizer_name}
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: false
num_workers: 8

# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1

optimizer:
name: decoupled_adamw
lr: 6.0e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0

algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0

max_duration: 4800ba # ~ 2.5B tokens
eval_interval: 500ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256

# System
seed: ${global_seed}
device_eval_batch_size: 4
device_train_microbatch_size: 4
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: false
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

icl_tasks: eval/yamls/tasks.yaml # or use tasks_light.yaml
icl_subset_num_batches: 2 # -1, or omit this key entirely, to evaluate on all batches
eval_gauntlet: 'eval/yamls/eval_gauntlet.yaml'
icl_seq_len: 1024

callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}

# loggers:
# wandb: {}

# Checkpoint to local filesystem or remote object store
# save_interval: 500ba
# save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK
# save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

# Load from local filesystem or remote object store
# load_path: ./gpt-125m/checkpoints/latest-rank{rank}.pt
# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt

0 comments on commit e75cfc9

Please sign in to comment.