Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language Modelling Example #30446

Closed
4 tasks
ajherman opened this issue Apr 24, 2024 · 8 comments
Closed
4 tasks

Language Modelling Example #30446

ajherman opened this issue Apr 24, 2024 · 8 comments
Labels
Examples Which is related to examples in general

Comments

@ajherman
Copy link

System Info

I'm running the first example in the examples/pytorch/language-modelling/README.md. I have installed the requirements.txt. I consistently get the following error (sorry for the screen grab, I have issues with copy/paste in this environment):

Screenshot from 2024-04-23 21-14-11

I don't understand why I am getting this key error from openai-community/gpt2. Incidentally, when I try run a script in the same environment that simply has the line model = from_pretrained('openai-community/gpt2'), I do not get any errors. So, it seems like it must be an issue in the example code? That is, the run_clm.py.

Who can help?

ajherman

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python run_clm.py
--model_name_or_path openai-community/gpt2
--train_file path_to_train_file
--validation_file path_to_validation_file
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--do_train
--do_eval
--output_dir /tmp/test-clm

Expected behavior

Should run the example script...

@amyeroberts amyeroberts added the Examples Which is related to examples in general label Apr 24, 2024
@amyeroberts
Copy link
Collaborator

Hi @ajherman, can you share your running environment? Run transformers-cli env in the terminal and copy-paste the output

I'm able to run the example from the README without issue on main:

python run_clm.py \
    --model_name_or_path openai-community/gpt2 \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --output_dir /tmp/test-clm

cc @ArthurZucker @younesbelkada

@ajherman
Copy link
Author

ajherman commented Apr 24, 2024

Screenshot from 2024-04-24 10-33-47

For the last two points:

Yes, I'm using a GPU.
I think what I'm using would be considered parallel computing? But I'm running on a remote server and I'm honestly not completely sure about the details of the set up.

I should mention additionally, that I want to run this script with SLURM, but I get the same error whether I run through an SBATCH script or just run the command directly in the terminal.

@amyeroberts
Copy link
Collaborator

@ajherman Are you able to run the example I shared i.e. one with a common dataset?

@ajherman
Copy link
Author

Ah, yes, the example script you provided works! Or at the very least, it begins the training loop (it hasn't gotten past 0%, but at least I'm not getting the key error anymore).

What does this mean...?

@amyeroberts
Copy link
Collaborator

Are you by any change passing --model_type in as an argument? From the error it looks like the model type is being passed in as openai-community/gpt2 when it should be gpt2.

There's a few places where this is wrong in our docs. I've opened #30480 to fix these

@ajherman
Copy link
Author

Thank you! Yes, this appears to have solved the problem. As a note, the confusion came from the end of the examples/pytorch/language-modelling/README.md file. In the example command, it passes openai-community/gpt2 as the --model_type argument. I changed it to gpt2, and it now seems to work.

The same command passes the same string to --model_tokenizer, but I did not have to modify that one. Could you give me some insight into how the naming is intended to work? What is the implied meaning of including or not including something like "openai-community" before the model name? Thanks for your help!

@amyeroberts
Copy link
Collaborator

GPT2 is a bit special. I'll use another model first to explain the general case, and then explain why GPT2 different.

When loading a model using a checkpoint e.g. AutoModel.from_pretrained(checkpoint) the checkpoint will have two parts: organization/model-name. The first part tells us under which organization or user the model sits under on the hub; the second part points to the specific model checkpoint. For example, with meta-llama/Meta-Llama-3-8B meta-lama is the organization, which can have many models, datasets and spaces under its domain. The second part specifies the specific model checkpoint.

Another this to note is the difference between a checkpoint and a model type. In the script --model_name_or_path refers to a specific checkpoint which will have this organization/model structure (or a local path). --model_type refers to a general architecture e.g. all of the llama-2 models have the model type llama e.g. here in the config i.e. they can all be loaded into the transformers llama architecture.

Many years ago, when some models were added to to Hugging Face, we didn't have this standard, and just the model name was used e.g. bert-base-uncased. The same is true for gpt2. Rather confusingly, not only was gpt2 the checkpoint name, but it's also the model type. Unlike bert-base-uncased whose model type is bert. A few months ago, we updated all the checkpoints on the hub for these old models to the canonical organization/model format, and the examples were updated too. Some of the references were mistakenly changed here for gpt2 because of this model name<->model type clash. Meaning model_type was updated to openai-community/gpt2 when it should have been left as gpt2 (the model architecture doesn't change just the checkpoint on the hub).

Now, to avoid breaking existing scripts and code, we maintain backwards compatibility and the old checkpoints (bert-base-uncased, gpt2, ...) will STILL map to the old checkpoints i.e. I can do AutoModel.from_pretrained('gpt2') even though I should do AutoModel.from_pretrained('openai-community/gpt2').

So, you can't change --model_type to openai-community/gpt2 as this model architecture doesn't exist (it's a checkpoints), but you can still do --model_type gpt2 --model_path_or_name gpt2 because gpt2 is an old model and this backwards compatible behaviour.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Examples Which is related to examples in general
Projects
None yet
Development

No branches or pull requests

2 participants