Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Transformer's RayTrainReportCallback failed to copy checkpoints #42926

Closed
woshiyyya opened this issue Feb 1, 2024 · 2 comments · Fixed by #42953
Closed

[Train] Transformer's RayTrainReportCallback failed to copy checkpoints #42926

woshiyyya opened this issue Feb 1, 2024 · 2 comments · Fixed by #42953
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@woshiyyya
Copy link
Member

woshiyyya commented Feb 1, 2024

What happened + What you expected to happen

From an OSS user:

Hello community, I’ve recently encountered some issues while using Ray for distributed training with Huggingface Trainer on two servers. I use s3 to save checkpoints, which works fine on a single machine. However, when I switch to multiple machines, I’m facing a problem. My output_dir is set to ‘outputs’. On node1, the checkpoint is saved correctly in outputs/checkpoint-1, but on node2, it gets saved in outputs/tmp-checkpoint-1. This causes a problem in the Callback class, where node2 can’t find the checkpoint folder.
Here’s code snippet of Callback:
https://github.com/ray-project/ray/blob/fce7a361807580953364e2da964f9498f3123bf9/p[…]ython/ray/train/huggingface/transformers/_transformers_utils.py

The issue is that source_ckpt_path becomes None, because transformers.trainer.get_last_checkpoint only checks for paths starting with ‘checkpoint’, and ‘tmp-checkpoint’ doesn’t match this. This leads to an error in shutil.copytree, as it can’t copy from a None directory. I’m not sure if this is a version issue.

When I modify the code like this:

    def on_save(self, args, state, control, **kwargs):
        """Event called after a checkpoint save."""
        with TemporaryDirectory() as tmpdir:
            # Aggregate all the logged metrics
            metrics = {}
            for log in state.log_history:
                metrics.update(log)

            # Copy ckpt files and construct a Ray Train Checkpoint
            source_ckpt_path = transformers.trainer.get_last_checkpoint(args.output_dir)
            if source_ckpt_path is not None:
                target_ckpt_path = os.path.join(tmpdir, self.CHECKPOINT_NAME)
                shutil.copytree(source_ckpt_path, target_ckpt_path)
                checkpoint = Checkpoint.from_directory(tmpdir)
            else:
                checkpoint = None

            # Report latest metrics and checkpoint to Ray Train
            ray.train.report(metrics=metrics, checkpoint=checkpoint)

It runs normally, but I’m not sure if this will cause other issues, or if I need to report this checkpoint differently.

Versions / Dependencies

master

Reproduction script

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@woshiyyya woshiyyya added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 1, 2024
@woshiyyya woshiyyya self-assigned this Feb 1, 2024
@woshiyyya woshiyyya added train Ray Train Related Issue P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 1, 2024
@woshiyyya
Copy link
Member Author

@woshiyyya woshiyyya changed the title [Train] TransformerTrainer RayTrainReportCallback [Train] TransformerTrainer RayTrainReportCallback failed to copy checkpoints Feb 1, 2024
@woshiyyya woshiyyya changed the title [Train] TransformerTrainer RayTrainReportCallback failed to copy checkpoints [Train] Transformer's RayTrainReportCallback failed to copy checkpoints Feb 1, 2024
@woshiyyya
Copy link
Member Author

Here is the root cause: huggingface/transformers#28364

Previously before this PR, it renames the tmp-checkpoint-* dir to checkpoint-* on all workers. This PR force it to only rename on global-rank-0 worker (if save_on_all_nodes=False). Therefore, except for node 0, the remaining nodes still keep the tmp-checkpoint-* dir around and returns None in transformers.trainer.get_last_checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
1 participant