Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guide on running evaluations #3

Open
malteos opened this issue Jun 27, 2022 · 0 comments
Open

Guide on running evaluations #3

malteos opened this issue Jun 27, 2022 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@malteos
Copy link

malteos commented Jun 27, 2022

Follow this guide to run the evaluation framework with all the German tasks provided by this repo.

Install

# create fresh conda environment
conda create -n lm-evaluation-harness python=3.8
conda activate lm-evaluation-harness

# clone repo
git clone https://github.com/OpenGPTX/lm-evaluation-harness.git
cd lm-evaluation-harness

# change to `german` branch
git checkout german

# install dependencies
pip install -r requirements.txt

# set environment variables (optional)
export DATASETS_DIR="./data" # TODO replace with your own path
export DATASETS_DIR="/data/datasets" # TODO replace with your own path
export CUDA_VISIBLE_DEVICES=1  # TODO change to your GPU id

export TRANSFORMERS_CACHE="${DATASETS_DIR}/transformers_cache"
export HF_DATASETS_CACHE="${DATASETS_DIR}/hf_datasets_cache"

export HF_DATASETS_OFFLINE=0
export TRANSFORMERS_OFFLINE=0

Run evaluations

Evaluate all German tasks on gpt2-xl-wechsel-german (HF implementation).

# gpt2 @ wikitext (HF PPL wikitext-2-raw-v1 = 69.18, HF PPL  wikitext-2-v1= 76.25, expected PPL = 29.41; paper report 37.50 for wikitext103)
python main.py  --model gpt2 --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/gpt2',subfolder='' --tasks wikitext --no_tokenizer_check --no_cache

|  Task  |Version|    Metric     | Value |   |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext|      1|word_perplexity|37.3698|   |      |

# gpt2 @ lambada (paper reports: 35.13 PPL, 4599 ACC)
| Task  |Version|Metric| Value |   |Stderr|
|-------|------:|------|------:|---|-----:|
|lambada|      0|ppl   |40.0554|±  |1.4881|
|       |       |acc   | 0.3256|±  |0.0065|


# gpt2 @ wechsel_de  (HF PPL = 66.72)
|   Task   |Version|    Metric     |  Value   |   |Stderr|
|----------|------:|---------------|---------:|---|------|
|wechsel_de|      1|word_perplexity|58076.8245|   |      |


# facebook/xglm-564M
# xglm @ wikitext  (HF PPL= 57.93)
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/xglm-564M',subfolder='' --tasks wikitext  --no_tokenizer_check
|  Task  |Version|    Metric     | Value |   |Stderr|
|--------|------:|---------------|------:|---|------|
|wikitext|      1|word_perplexity|31.8644|   |      |

# xglm @ wechsel_de  (HF PPL= 34.05)
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/xglm-564M',subfolder='' --tasks wechsel_de  --no_tokenizer_check
|   Task   |Version|    Metric     | Value  |   |Stderr|
|----------|------:|---------------|-------:|---|------|
|wechsel_de|      1|word_perplexity|179.5915|   |      |
|          |       |byte_perplexity|  2.0521|   |      |
|          |       |bits_per_byte  |  1.0371|   |      |

# gpt2-xl-wechsel-german (HF trainer = 14.5 PPL)
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/gpt2-xl-wechsel-german',subfolder='' --tasks wechsel_de  --no_tokenizer_check
|   Task   |Version|    Metric     | Value  |   |Stderr|
|----------|------:|---------------|-------:|---|------|
|wechsel_de|      1|word_perplexity|157.9543|   |      |


python main.py \
	--model gpt2 \
	--model_args pretrained=malteos/gpt2-xl-wechsel-german \
	--device 0 --no_tokenizer_check \
	--language de

# with HF fork API
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg',subfolder='' --tasks wechsel_de --limit 10 --no_tokenizer_check

# gpt2-wechsel-german-ds-meg @ wechsel_de (HF PPL = 23.26)
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg',subfolder='' --tasks wechsel_de  --no_tokenizer_check
|   Task   |Version|    Metric     | Value  |   |Stderr|
|----------|------:|---------------|-------:|---|------|
|wechsel_de|      1|word_perplexity|594.3954|   |      |

# gpt2-wechsel-german-ds-meg @ wikitext
python main.py  --model hf-causal --model_args pretrained='/data/datasets/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg',subfolder='' --tasks wikitext --no_tokenizer_check --no_cache
|  Task  |Version|    Metric     |  Value  |   |Stderr|
|--------|------:|---------------|--------:|---|------|
|wikitext|      1|word_perplexity|1620.2171|   |      |






# gpt2-wechsel-german-ds-meg
python main.py \
	--model gpt2 \
	--model_args pretrained=/data/datasets/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg \
	--device 0 --no_tokenizer_check \
	--tasks wechsel_de 

|   Task   |Version|    Metric     | Value  |   |Stderr|
|----------|------:|---------------|-------:|---|------|
|wechsel_de|      1|word_perplexity|594.3951|   |      |
|          |       |byte_perplexity|  2.4220|   |      |
|          |       |bits_per_byte  |  1.2762|   |      |


# bloom-350m @ wikitext

# bloom-350m @ wechsel_de
python main.py \
	--model hf \
	--model_args pretrained=/data/datasets/huggingface_transformers/pytorch/bloom-350m \
	--device 0 --no_tokenizer_check \
	--tasks wechsel_de 
|   Task   |Version|    Metric     |  Value   |   |Stderr|
|----------|------:|---------------|---------:|---|------|
|wechsel_de|      1|word_perplexity|11384.6973|   |      |




python main.py \
	--model gpt2 \
	--model_args pretrained=/data/datasets/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg \
	--device c0 --no_tokenizer_check \
	--tasks wechsel_de --batch_size 512 --limit 200

Arguments:

  • The flag --no_tokenizer_check is required if you want to run the evaluation with a custom / non-English tokenizer.
  • For debugging, you can add --limit <int> to run the script only on a subset of the test samples.
  • With --batch_size <int> size of batch per GPU

The output should look as follows:

TODO

To run only a specific set of tasks, execute this command:

python main.py \
	--model gpt2 \
	--model_args pretrained=malteos/gpt2-xl-wechsel-german \
	--device 0 --no_tokenizer_check \
	--tasks wechsel_de,germanquad

You can also evaluate model checkpoints from disk:

# germanquad QA evaluation
python main.py \
	--model gpt2 \
	--model_args pretrained=${DATASETS_DIR}/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg \
	--device 0 --no_cache --no_tokenizer_check --limit 10  \
	--tasks germanquad

# wechsel PPL evaluation
python main.py \
	--model gpt2 \
	--model_args pretrained=${DATASETS_DIR}/huggingface_transformers/pytorch/gpt2-wechsel-german-ds-meg \
	--device 0 --no_cache --no_tokenizer_check \
	--tasks wechsel_de --limit 2000 

Run English GPT2:

# should yield 37.50 PPL
python main.py \
	--model gpt2 \
	--model_args pretrained=${DATASETS_DIR}/huggingface_transformers/pytorch/gpt2 \
	--device 0 --no_cache  \
	--tasks wikitext
@malteos malteos added the documentation Improvements or additions to documentation label Jun 27, 2022
sasaadi pushed a commit that referenced this issue Aug 25, 2022
Add `ROUGE` metric to `PromptSourceTask`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant