Language Modelling Example #30446

ajherman · 2024-04-24T04:21:18Z

System Info

I'm running the first example in the examples/pytorch/language-modelling/README.md. I have installed the requirements.txt. I consistently get the following error (sorry for the screen grab, I have issues with copy/paste in this environment):

I don't understand why I am getting this key error from openai-community/gpt2. Incidentally, when I try run a script in the same environment that simply has the line model = from_pretrained('openai-community/gpt2'), I do not get any errors. So, it seems like it must be an issue in the example code? That is, the run_clm.py.

Who can help?

ajherman

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python run_clm.py
--model_name_or_path openai-community/gpt2
--train_file path_to_train_file
--validation_file path_to_validation_file
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--do_train
--do_eval
--output_dir /tmp/test-clm

Expected behavior

Should run the example script...

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-24T08:57:45Z

Hi @ajherman, can you share your running environment? Run transformers-cli env in the terminal and copy-paste the output

I'm able to run the example from the README without issue on main:

python run_clm.py \
    --model_name_or_path openai-community/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

cc @ArthurZucker @younesbelkada

ajherman · 2024-04-24T17:37:43Z

For the last two points:

Yes, I'm using a GPU.
I think what I'm using would be considered parallel computing? But I'm running on a remote server and I'm honestly not completely sure about the details of the set up.

I should mention additionally, that I want to run this script with SLURM, but I get the same error whether I run through an SBATCH script or just run the command directly in the terminal.

amyeroberts · 2024-04-24T20:43:46Z

@ajherman Are you able to run the example I shared i.e. one with a common dataset?

ajherman · 2024-04-24T23:10:06Z

Ah, yes, the example script you provided works! Or at the very least, it begins the training loop (it hasn't gotten past 0%, but at least I'm not getting the key error anymore).

What does this mean...?

amyeroberts · 2024-04-25T14:07:09Z

Are you by any change passing --model_type in as an argument? From the error it looks like the model type is being passed in as openai-community/gpt2 when it should be gpt2.

There's a few places where this is wrong in our docs. I've opened #30480 to fix these

ajherman · 2024-04-25T21:42:19Z

Thank you! Yes, this appears to have solved the problem. As a note, the confusion came from the end of the examples/pytorch/language-modelling/README.md file. In the example command, it passes openai-community/gpt2 as the --model_type argument. I changed it to gpt2, and it now seems to work.

The same command passes the same string to --model_tokenizer, but I did not have to modify that one. Could you give me some insight into how the naming is intended to work? What is the implied meaning of including or not including something like "openai-community" before the model name? Thanks for your help!

amyeroberts · 2024-04-26T08:30:23Z

GPT2 is a bit special. I'll use another model first to explain the general case, and then explain why GPT2 different.

When loading a model using a checkpoint e.g. AutoModel.from_pretrained(checkpoint) the checkpoint will have two parts: organization/model-name. The first part tells us under which organization or user the model sits under on the hub; the second part points to the specific model checkpoint. For example, with meta-llama/Meta-Llama-3-8B meta-lama is the organization, which can have many models, datasets and spaces under its domain. The second part specifies the specific model checkpoint.

Another this to note is the difference between a checkpoint and a model type. In the script --model_name_or_path refers to a specific checkpoint which will have this organization/model structure (or a local path). --model_type refers to a general architecture e.g. all of the llama-2 models have the model type llama e.g. here in the config i.e. they can all be loaded into the transformers llama architecture.

Many years ago, when some models were added to to Hugging Face, we didn't have this standard, and just the model name was used e.g. bert-base-uncased. The same is true for gpt2. Rather confusingly, not only was gpt2 the checkpoint name, but it's also the model type. Unlike bert-base-uncased whose model type is bert. A few months ago, we updated all the checkpoints on the hub for these old models to the canonical organization/model format, and the examples were updated too. Some of the references were mistakenly changed here for gpt2 because of this model name<->model type clash. Meaning model_type was updated to openai-community/gpt2 when it should have been left as gpt2 (the model architecture doesn't change just the checkpoint on the hub).

Now, to avoid breaking existing scripts and code, we maintain backwards compatibility and the old checkpoints (bert-base-uncased, gpt2, ...) will STILL map to the old checkpoints i.e. I can do AutoModel.from_pretrained('gpt2') even though I should do AutoModel.from_pretrained('openai-community/gpt2').

So, you can't change --model_type to openai-community/gpt2 as this model architecture doesn't exist (it's a checkpoints), but you can still do --model_type gpt2 --model_path_or_name gpt2 because gpt2 is an old model and this backwards compatible behaviour.

github-actions · 2024-05-24T08:02:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added the Examples Which is related to examples in general label Apr 24, 2024

github-actions bot closed this as completed Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Language Modelling Example #30446

Language Modelling Example #30446

ajherman commented Apr 24, 2024

amyeroberts commented Apr 24, 2024

ajherman commented Apr 24, 2024 •

edited

Loading

amyeroberts commented Apr 24, 2024

ajherman commented Apr 24, 2024

amyeroberts commented Apr 25, 2024

ajherman commented Apr 25, 2024

amyeroberts commented Apr 26, 2024

github-actions bot commented May 24, 2024

Language Modelling Example #30446

Language Modelling Example #30446

Comments

ajherman commented Apr 24, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 24, 2024

ajherman commented Apr 24, 2024 • edited Loading

amyeroberts commented Apr 24, 2024

ajherman commented Apr 24, 2024

amyeroberts commented Apr 25, 2024

ajherman commented Apr 25, 2024

amyeroberts commented Apr 26, 2024

github-actions bot commented May 24, 2024

ajherman commented Apr 24, 2024 •

edited

Loading