Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix retry crash loop #1696

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 31 additions & 1 deletion llmfoundry/command_utils/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,37 @@ def train(cfg: DictConfig) -> Trainer:
trainer.eval()

log.info('Starting training...')
trainer.fit()
try:
trainer.fit()
except ValueError as e:
msg = str(e)
if 'The max_duration' in msg and 'is less than or equal to the elapsed training duration' in msg and train_cfg.run_is_retry:
log.info(
'Training is already complete and detected retry. Skipping training and saving checkpoint.',
)
trainer.save_checkpoint_to_save_folder()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe you don't want to call the composer save, just the hf save. because if the you get this error that means the last composer checkpoint was already saved successfully.


hf_checkpointer_callbacks = [
c for c in callbacks if isinstance(c, HuggingFaceCheckpointer)
]
if len(hf_checkpointer_callbacks) == 0:
log.info(
'No HuggingFaceCheckpointer callback found. Skipping HF checkpoint.',
)
return trainer
if len(hf_checkpointer_callbacks) > 1:
raise ValueError(
'Multiple HuggingFaceCheckpointer callbacks found, but only_hf_checkpoint was set to True. Please remove all but one HuggingFaceCheckpointer.',
) from e

hf_checkpointer_callback = hf_checkpointer_callbacks[0]
hf_checkpointer_callback._save_checkpoint(
trainer.state,
trainer.logger,
upload_to_save_folder=True,
register_to_mlflow=True,
)
return trainer

log.info('Done.')
return trainer
Expand Down
1 change: 1 addition & 0 deletions llmfoundry/utils/config_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ class TrainConfig:

# Resumption
autoresume: bool = False
run_is_retry: bool = False

# Profiling
profiler: Optional[dict[str, Any]] = None
Expand Down
Loading