You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello community, I’ve recently encountered some issues while using Ray for distributed training with Huggingface Trainer on two servers. I use s3 to save checkpoints, which works fine on a single machine. However, when I switch to multiple machines, I’m facing a problem. My output_dir is set to ‘outputs’. On node1, the checkpoint is saved correctly in outputs/checkpoint-1, but on node2, it gets saved in outputs/tmp-checkpoint-1. This causes a problem in the Callback class, where node2 can’t find the checkpoint folder.
Here’s code snippet of Callback: https://github.com/ray-project/ray/blob/fce7a361807580953364e2da964f9498f3123bf9/p[…]ython/ray/train/huggingface/transformers/_transformers_utils.py
The issue is that source_ckpt_path becomes None, because transformers.trainer.get_last_checkpoint only checks for paths starting with ‘checkpoint’, and ‘tmp-checkpoint’ doesn’t match this. This leads to an error in shutil.copytree, as it can’t copy from a None directory. I’m not sure if this is a version issue.
When I modify the code like this:
defon_save(self, args, state, control, **kwargs):
"""Event called after a checkpoint save."""withTemporaryDirectory() astmpdir:
# Aggregate all the logged metricsmetrics= {}
forloginstate.log_history:
metrics.update(log)
# Copy ckpt files and construct a Ray Train Checkpointsource_ckpt_path=transformers.trainer.get_last_checkpoint(args.output_dir)
ifsource_ckpt_pathisnotNone:
target_ckpt_path=os.path.join(tmpdir, self.CHECKPOINT_NAME)
shutil.copytree(source_ckpt_path, target_ckpt_path)
checkpoint=Checkpoint.from_directory(tmpdir)
else:
checkpoint=None# Report latest metrics and checkpoint to Ray Trainray.train.report(metrics=metrics, checkpoint=checkpoint)
It runs normally, but I’m not sure if this will cause other issues, or if I need to report this checkpoint differently.
Versions / Dependencies
master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
woshiyyya
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 1, 2024
woshiyyya
added
train
Ray Train Related Issue
P1
Issue that should be fixed within a few weeks
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 1, 2024
woshiyyya
changed the title
[Train] TransformerTrainer RayTrainReportCallback
[Train] TransformerTrainer RayTrainReportCallback failed to copy checkpoints
Feb 1, 2024
woshiyyya
changed the title
[Train] TransformerTrainer RayTrainReportCallback failed to copy checkpoints
[Train] Transformer's RayTrainReportCallback failed to copy checkpoints
Feb 1, 2024
Previously before this PR, it renames the tmp-checkpoint-* dir to checkpoint-* on all workers. This PR force it to only rename on global-rank-0 worker (if save_on_all_nodes=False). Therefore, except for node 0, the remaining nodes still keep the tmp-checkpoint-* dir around and returns None in transformers.trainer.get_last_checkpoint.
What happened + What you expected to happen
From an OSS user:
Hello community, I’ve recently encountered some issues while using Ray for distributed training with Huggingface Trainer on two servers. I use s3 to save checkpoints, which works fine on a single machine. However, when I switch to multiple machines, I’m facing a problem. My output_dir is set to ‘outputs’. On node1, the checkpoint is saved correctly in outputs/checkpoint-1, but on node2, it gets saved in outputs/tmp-checkpoint-1. This causes a problem in the Callback class, where node2 can’t find the checkpoint folder.
Here’s code snippet of Callback:
https://github.com/ray-project/ray/blob/fce7a361807580953364e2da964f9498f3123bf9/p[…]ython/ray/train/huggingface/transformers/_transformers_utils.py
The issue is that source_ckpt_path becomes None, because transformers.trainer.get_last_checkpoint only checks for paths starting with ‘checkpoint’, and ‘tmp-checkpoint’ doesn’t match this. This leads to an error in shutil.copytree, as it can’t copy from a None directory. I’m not sure if this is a version issue.
When I modify the code like this:
It runs normally, but I’m not sure if this will cause other issues, or if I need to report this checkpoint differently.
Versions / Dependencies
master
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: