Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

Closed
2 of 4 tasks
sidnarayanan opened this issue Mar 1, 2024 · 1 comment
Closed
2 of 4 tasks

Comments

@sidnarayanan
Copy link

sidnarayanan commented Mar 1, 2024

System Info

  • transformers version: 4.38.2
  • Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes, but issue should not be GPU-specific
  • Using distributed or parallel set-up in script?: No

Who can help?

@muellerz @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Here's a script that reproduces this problem:

#!/usr/bin/env python3

from datasets import load_dataset, DatasetDict

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

from transformers import TrainingArguments, Trainer
from transformers import TrainingArguments


dataset = load_dataset("yelp_review_full")
dataset = DatasetDict({k: v.select(range(100)) for k, v in dataset.items()})

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)


model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5
)
small_train_dataset = tokenized_datasets["train"]
small_eval_dataset = tokenized_datasets["test"]


training_args = TrainingArguments(
    output_dir="/tmp/test_trainer",
    push_to_hub=True,
    hub_model_id="test_trainer",
    hub_strategy="checkpoint",
    num_train_epochs=1,
    save_strategy="epoch",
    hub_private_repo=True,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)
trainer.train()
breakpoint()

Expected behavior

Hi, I'm running into an issue pushing checkpoints to Hub during training. I think the issue arises in these lines.

Specifically, the staging directory is set in trainer.py, and then passed to self.save_model():

        if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
            logger.warning(
                f"Checkpoint destination directory {output_dir} already exists and is non-empty. "
                "Saving will proceed but saved results may be invalid."
            )
            staging_output_dir = output_dir
        else:
            staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
        self.save_model(staging_output_dir, _internal_call=True)

If the output directory does not exist (e.g. a new training run), tmp- is prepended.

With push_to_hub=True, save_model launches a model push job in a new thread. This job is looking for the tmp-checkpoint-... directory.

Before that job can finish executing (sometimes even before it starts), L2538 is run:

os.rename(staging_output_dir, output_dir)

This removes the tmp- prefix.

If I inspect the trainer's push jobs afterwards, I can see the exception:

ipdb> trainer.push_in_progress.jobs
[<Future at 0x7f04be9a3b50 state=finished returned CommitInfo>, <Future at 0x7f04be9a27a0 state=finished raised ValueError>]
ipdb> trainer.push_in_progress.jobs[-1].result()
*** ValueError: Provided path: '/tmp/test_trainer/tmp-checkpoint-13' is not a directory
@sidnarayanan
Copy link
Author

sidnarayanan commented Mar 1, 2024

This goes away with the branch in #29370, so this issue can be closed if that PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants