Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

sidnarayanan · 2024-03-01T21:29:44Z

System Info

transformers version: 4.38.2
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes, but issue should not be GPU-specific
Using distributed or parallel set-up in script?: No

Who can help?

@muellerz @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Here's a script that reproduces this problem:

#!/usr/bin/env python3

from datasets import load_dataset, DatasetDict

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

from transformers import TrainingArguments, Trainer
from transformers import TrainingArguments


dataset = load_dataset("yelp_review_full")
dataset = DatasetDict({k: v.select(range(100)) for k, v in dataset.items()})

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)


model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-cased", num_labels=5
)
small_train_dataset = tokenized_datasets["train"]
small_eval_dataset = tokenized_datasets["test"]


training_args = TrainingArguments(
    output_dir="/tmp/test_trainer",
    push_to_hub=True,
    hub_model_id="test_trainer",
    hub_strategy="checkpoint",
    num_train_epochs=1,
    save_strategy="epoch",
    hub_private_repo=True,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)
trainer.train()
breakpoint()

Expected behavior

Hi, I'm running into an issue pushing checkpoints to Hub during training. I think the issue arises in these lines.

Specifically, the staging directory is set in trainer.py, and then passed to self.save_model():

        if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
            logger.warning(
                f"Checkpoint destination directory {output_dir} already exists and is non-empty. "
                "Saving will proceed but saved results may be invalid."
            )
            staging_output_dir = output_dir
        else:
            staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
        self.save_model(staging_output_dir, _internal_call=True)

If the output directory does not exist (e.g. a new training run), tmp- is prepended.

With push_to_hub=True, save_model launches a model push job in a new thread. This job is looking for the tmp-checkpoint-... directory.

Before that job can finish executing (sometimes even before it starts), L2538 is run:

os.rename(staging_output_dir, output_dir)

This removes the tmp- prefix.

If I inspect the trainer's push jobs afterwards, I can see the exception:

ipdb> trainer.push_in_progress.jobs
[<Future at 0x7f04be9a3b50 state=finished returned CommitInfo>, <Future at 0x7f04be9a27a0 state=finished raised ValueError>]
ipdb> trainer.push_in_progress.jobs[-1].result()
*** ValueError: Provided path: '/tmp/test_trainer/tmp-checkpoint-13' is not a directory

The text was updated successfully, but these errors were encountered:

sidnarayanan · 2024-03-01T23:44:48Z

This goes away with the branch in #29370, so this issue can be closed if that PR is merged.

muellerzr closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

sidnarayanan commented Mar 1, 2024 •

edited

Loading

sidnarayanan commented Mar 1, 2024 •

edited

Loading

Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

Pushing checkpoint to hub looks for tmp directory that is immediately renamed #29399

Comments

sidnarayanan commented Mar 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sidnarayanan commented Mar 1, 2024 • edited Loading

sidnarayanan commented Mar 1, 2024 •

edited

Loading

sidnarayanan commented Mar 1, 2024 •

edited

Loading