add eval readme (#566)

* add eval readme * modify readme * modify readme * doc * restory eval yaml * lint * Update scripts/train/yamls/pretrain/gpt-neo-125m_eval.yaml Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update scripts/eval/README.md Co-authored-by: Daniel King <[email protected]> * Update README.md --------- Co-authored-by: Daniel King <[email protected]>
mosaicml · Sep 13, 2023 · e75cfc9 · e75cfc9
1 parent 0fdf43f
commit e75cfc9
Show file tree

Hide file tree

Showing 2 changed files with 212 additions and 7 deletions.
diff --git a/scripts/eval/README.md b/scripts/eval/README.md
@@ -8,15 +8,17 @@ You can evaluate a model by preparing an evaluation YAML following the format of
 
 ## Quickstart
 
-To run a full evaluation on a model, you would need to install this repo and then run the following commands:
+### Offline evaluation
+
+To run offline evaluation, download this repo and run the following commands:
 
 <!--pytest.mark.skip-->
 ```bash
 cd llm-foundry/scripts
 composer eval/eval.py eval/yamls/hf_eval.yaml
 ```
 
-This will run a large eval suite, including our Model Gauntlet, on `EleutherAI/gpt-neo-125m`. You can update the model in that YAML file, or create your own, or override the values in the YAML with CLI args, such as:
+This will run `EleutherAI/gpt-neo-125m` through the MosaicML Eval Gauntlet, a diverse evaluation suite consisting of over 30 benchmarks. You can update the configuration directly in the `hf_eval.yaml` YAML file, or override the values in the YAML with CLI args, such as:
 
 <!--pytest.mark.skip-->
 ```bash
@@ -25,12 +27,29 @@ composer eval/eval.py eval/yamls/hf_eval.yaml \
     model_name_or_path=mosaicml/mpt-7b
 ```
 
+You can also modify the specific benchmarks executed and their formatting by modifying the contents of `tasks.yaml` and you can modify the choice of composite scores and the set of tasks they consist of by modifying `eval_gauntlet.yaml`.
+
+
+### Evaluation during training
+To run evaluatio during training, download this repo, follow the instructions in `scripts/train/README.md` to perform single node pre-training and run the following commands
+
+<!--pytest.mark.skip-->
+```bash
+cd llm-foundry/scripts/train
+composer train.py yamls/pretrain/mpt-125m_eval.yaml train_loader.dataset.split=train_small eval_loader.dataset.split=val_small
+```
+You can also modify the specific benchmarks executed and their formatting by modifying the contents of `tasks.yaml` and you can modify the choice of composite scores and the set of tasks they consist of by modifying `eval_gauntlet.yaml`. You can also choose to either run the full evaluation or run on a subset number of batches per benchmark by setting `icl_subset_num_batches`.
+
 ----
+## In-depth walkthrough
 
-## Offline evaluation
+ICL evaluation can be done offline via the `scripts/eval/eval.py` or during training via `scripts/train/train.py`.
+
+In order to do ICL evaluation you must specify a set of benchmarks you'd like to run via the `icl_tasks` key in your eval/training config. `icl_tasks` can either consist of config, or it can be a file path pointing to a locally accessible YAML config (see `scripts/eval/yamls/icl_tasks.yaml` for an example).
 
-You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).
-Your YAML must have a config section entitled `icl_tasks`, this can either be a list of dictionaries of the form
+
+#### ICL task YAML format
+Your YAML must have a config section entitled `icl_tasks` specifying the benchmarks to evaluate againts, this can either be a list of dictionaries of the form
 
 ```jsx
 icl_tasks:
@@ -59,9 +78,66 @@ example_delimiter: "\n"
 prompt_string: ''
 ```
 
-## Evaluation during training
+#### Eval gauntlet YAML format
+Your YAML may optionally have a config section entitled eval\_gauntlet specifying how to aggregate the results (if absent, only the individual benchmark accuracies will be reported). After the tasks listed in the `icl_tasks` config are evaluated, the eval script will use the `eval_gauntlet` config, if specified, to aggregate the individual benchmarks into composite scores.
+
 
-You can also add ICL evaluation to your training runs by adding an `icl_tasks` config to your training config at the same depth as the `model` subconfig.
+An `eval_gauntlet` config  must specify the list of categories you'd like to generate composite scores for, as well as the list of benchmarks to be included in each category. For each category you need to list the name and the `num_fewshot`. These two values must exactly match the values specified in the `icl_tasks` config. Additionally you must specify the random baseline accuracy for each benchmark.
+
+There are also three flags indicating how to perform the aggregation:
+1. `weighting` can either be `EQUAL` (all tasks are weighted equally), `SAMPLE_SZ` (tasks are weighted proportional to the size of the dataset), or `LOG_SAMPLE_SZ` (tasks are weighted proportional to the logarithm of the dataset size).
+2. `subtract_random_baseline` can either be `true` or `false`. If `true` we will subtract the random baseline accuracy from the final accuracy before averaging, otherwise it will be averaged in as is.
+3. `rescale_accuracy` can either be `true` or `false`. If `true` (and if `subtract_random_baseline` was also `true`), the accuracy will be rescaled to be <= 1 before averaging.
+
+An example config is below:
+```jsx
+eval_gauntlet:
+  weighting: EQUAL
+  subtract_random_baseline: true
+  rescale_accuracy: true
+  categories:
+  - name: world_knowledge
+    benchmarks:
+    - name: jeopardy
+      num_fewshot: 10
+      random_baseline: 0
+    - name: mmlu
+      num_fewshot: 10
+      random_baseline: 0.25
+  - name: language_understanding
+    benchmarks:
+    - name: lambada_openai
+      num_fewshot: 0
+      random_baseline: 0.0
+    - name: hellaswag
+      num_fewshot: 10
+      random_baseline: 0.25
+```
+
+You can either specify your `eval_gauntlet` config directly in your eval/train YAML or via a local path pointing to a YAML containing an `eval_gauntlet` config.
+
+
+### Offline evaluation
+
+You can run the evaluation script on a model checkpoint via `composer eval/eval.py YOUR_YAML` from the `scripts` directory or launch it on the MosaicML platform using a an MCLI YAML following the format of [`llm-foundry/mcli/mcli-1b-eval.yaml`](https://github.com/mosaicml/llm-foundry/blob/main/mcli/mcli-1b-eval.yaml).
+
+You can use the default `icl_tasks` and `eval_gauntlet` configs or specify your own following the instructions above.
+
+### Evaluation during training
+
+You can use ICL evaluation during training by taking an ordinary training YAML and adding an `icl_tasks` and `eval_gauntlet` config. You should also specify `icl_seq_len` in your training YAML and you can optionally run a truncated version of eval on a random subset of each benchmark by specifying a value for `icl_subset_num_batches`
+
+An example is given below:
+```
+  icl_tasks: eval/yamls/tasks.yaml # or use tasks_light.yaml
+  icl_subset_num_batches: 100 # -1, or omit this key entirely, to evaluate on all batches
+  eval_gauntlet: 'eval/yamls/eval_gauntlet.yaml'
+  icl_seq_len: 1024
+```
+
+For training, we recommend you do not run the full eval gauntlet. Instead either use the `tasks_light.yaml` which is a subset of the full gauntlet benchmarks, or set `icl_subset_num_batches` to a small number O(100) which will only run each benchmark on a random sample of `icl_subset_num_batches` batches.
+
+You can use the default `icl_tasks` and `eval_gauntlet` configs or specify your own following the instructions above.
 
 ----
 
@@ -369,3 +445,5 @@ def prep_dataset(out_file):
 When formatting samples, `prompt_string` is prepended to the beginning, then `num_fewshot` examples from the dataset are concatenated. Each few shot example is formatted with the context/continuation of each being separated by `continuation_delimiter`, then each example is separated from the others by the `example_delimiter`. Finally, we append the context/query/question/context options of the current sample to be evaluated and the `continuation_delimiter`.
 
 Thus the structure of each question's preamble is `prompt | few shot examples | context | continuation delimiter`. The continuation (aka choices for MC) is then tokenized separately and the tokens of the preamble and tokens of the continuation are concatenated. It is important to note that if the continuation delimiter has a trailing space, it is stripped and instead prepended to the continuation. Furthermore, if the continuation does not have a leading space, one will be prepended.
+
+----
diff --git a/scripts/train/yamls/pretrain/gpt-neo-125m_eval.yaml b/scripts/train/yamls/pretrain/gpt-neo-125m_eval.yaml
@@ -0,0 +1,127 @@
+# Pretrain a gpt-neo-125m style model
+# this is NOT a finetuning run
+
+data_local: ./my-copy-c4
+data_remote: # If blank, files must be present in data_local
+tokenizer_name: EleutherAI/gpt-neo-125M
+max_seq_len: 2048
+global_seed: 17
+
+# Run Name
+run_name: # If left blank, will be read from env var $RUN_NAME
+
+# Model
+model:
+  name: hf_causal_lm
+  pretrained_model_name_or_path: EleutherAI/gpt-neo-125M
+  config_overrides:
+    # WARNING: if setting `pretrained: true`, `max_position_embeddings` must match the
+    # `max_position_embeddings` used during pre-training
+    max_position_embeddings: ${max_seq_len}
+  pretrained: false  # false: only use the architecture; true: initialize with pretrained weights
+
+# Tokenizer
+tokenizer:
+  name: ${tokenizer_name}
+  kwargs:
+    model_max_length: ${max_seq_len}
+
+# Dataloaders
+train_loader:
+  name: text
+  dataset:
+    local: ${data_local}
+    remote: ${data_remote}
+    split: train
+    shuffle: true
+    tokenizer_name: ${tokenizer_name}
+    max_seq_len: ${max_seq_len}
+    shuffle_seed: ${global_seed}
+  drop_last: true
+  num_workers: 8
+
+eval_loader:
+  name: text
+  dataset:
+    local: ${data_local}
+    remote: ${data_remote}
+    split: val
+    shuffle: false
+    tokenizer_name: ${tokenizer_name}
+    max_seq_len: ${max_seq_len}
+    shuffle_seed: ${global_seed}
+  drop_last: false
+  num_workers: 8
+
+# Optimization
+scheduler:
+  name: cosine_with_warmup
+  t_warmup: 100ba
+  alpha_f: 0.1
+
+optimizer:
+  name: decoupled_adamw
+  lr: 6.0e-4
+  betas:
+  - 0.9
+  - 0.95
+  eps: 1.0e-08
+  weight_decay: 0.0
+
+algorithms:
+  gradient_clipping:
+    clipping_type: norm
+    clipping_threshold: 1.0
+
+max_duration: 4800ba # ~ 2.5B tokens
+eval_interval: 500ba
+eval_first: false
+eval_subset_num_batches: -1
+global_train_batch_size: 256
+
+# System
+seed: ${global_seed}
+device_eval_batch_size: 4
+device_train_microbatch_size: 4
+# device_train_microbatch_size: auto
+precision: amp_bf16
+
+# FSDP
+fsdp_config:
+  sharding_strategy: FULL_SHARD
+  mixed_precision: PURE
+  activation_checkpointing: false
+  activation_checkpointing_reentrant: false
+  activation_cpu_offload: false
+  limit_all_gathers: true
+  verbose: false
+
+# Logging
+progress_bar: false
+log_to_console: true
+console_log_interval: 1ba
+
+icl_tasks: eval/yamls/tasks.yaml # or use tasks_light.yaml
+icl_subset_num_batches: 2 # -1, or omit this key entirely, to evaluate on all batches
+eval_gauntlet: 'eval/yamls/eval_gauntlet.yaml'
+icl_seq_len: 1024
+
+callbacks:
+  speed_monitor:
+    window_size: 10
+  lr_monitor: {}
+  memory_monitor: {}
+  runtime_estimator: {}
+
+# loggers:
+#   wandb: {}
+
+# Checkpoint to local filesystem or remote object store
+# save_interval: 500ba
+# save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
+# save_folder: ./{run_name}/checkpoints
+# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
+
+# Load from local filesystem or remote object store
+# load_path: ./gpt-125m/checkpoints/latest-rank{rank}.pt
+# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt