Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[integration] Update Ray Tune integration for Ray 2.7 #26499

Merged
merged 14 commits into from
Dec 9, 2023

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Sep 29, 2023

What does this PR do?

Ray 2.7 introduced some backwards-incompatible API changes, which broke the HuggingFace transformers integration with Ray Tune trainer.hyperparameter_search(backend="ray"). This PR fixes the integration to use the new APIs. Note that this means that the next transformers version will no longer support ray<2.7.

I have manually tested this with the example in the Ray repo here: ray-project/ray#40125
This will be regularly tested in CI once this PR is in.

ray-project/ray#39763

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@richardliaw

Comment on lines -299 to -307
if "keep_checkpoints_num" in kwargs and kwargs["keep_checkpoints_num"] > 0:
# `keep_checkpoints_num=0` would disabled checkpointing
trainer.use_tune_checkpoints = True
if kwargs["keep_checkpoints_num"] > 1:
logger.warning(
f"Currently keeping {kwargs['keep_checkpoints_num']} checkpoints for each trial. "
"Checkpoints are usually huge, "
"consider setting `keep_checkpoints_num=1`."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, you needed to set a # of checkpoints to keep in order to do checkpointing at all -- now this is just an optional arg

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we confirm that restoration works correctly? Are there tests for this?

src/transformers/hyperparameter_search.py Outdated Show resolved Hide resolved
src/transformers/hyperparameter_search.py Outdated Show resolved Hide resolved
Comment on lines 254 to 257
with checkpoint.as_directory() as checkpoint_dir:
for subdir in os.listdir(checkpoint_dir):
if subdir.startswith(PREFIX_CHECKPOINT_DIR):
checkpoint_path = os.path.join(checkpoint_dir, subdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the checkpoint_dir get removed after getting out of scope?

I think we need checkpoint.to_directory() here instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed and tested restoration with an updated test (in the ray repo which I'll link)

src/transformers/integrations/integration_utils.py Outdated Show resolved Hide resolved
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@justinvyu
Copy link
Contributor Author

Tagging @muellerzr @pacman100 -- who can help get this PR in? Thanks!

@huggingface huggingface deleted a comment from github-actions bot Nov 20, 2023
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix 🤗 ! I guess TrainerHyperParameterRayIntegrationTest is the test that was broken in transformers. Looks good to me, pinging @pacman100 and @muellerzr to have a second look!

Comment on lines +1181 to +1182
metrics = metrics.copy()
self.objective = self.compute_objective(metrics)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure I understand this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is because the metrics get modified later with some extra keys, so I want to keep the original one in tact: https://github.com/huggingface/transformers/pull/26499/files/9fbd0ec915455293eb4a1730e6a473d5c11b1151#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR1199

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LG2M, agree though that perhaps leave it as a one-liner instead of splitting it into 2 :)

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @justinvyu for updating ray tune integration! 🚀✨

@muellerzr
Copy link
Contributor

@justinvyu FYI seems like there's still some errors/issues with this: #27598

@justinvyu
Copy link
Contributor Author

Ah thanks for confirming the fix with the user, I'll take a look at that.

@justinvyu
Copy link
Contributor Author

@muellerzr I took a look at the user issue, and it's actually something in the example that should be changed. The limitation of the integration is that non-serializable objects like (metrics) cannot be supplied to the HF trainer. This PR is good to merge though.

@muellerzr
Copy link
Contributor

@justinvyu a quick make style; make quality; should fix the failures then we should be set and good :) Thanks for your patience!

Signed-off-by: Justin Yu <[email protected]>
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Failures look to be unrelated? cc @ArthurZucker to make sure

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are unrelated indeed feel free to rebase and I'll merge then!

@ArthurZucker
Copy link
Collaborator

The test is also failing on main, merging 😉

@ArthurZucker ArthurZucker merged commit 5fa66df into huggingface:main Dec 9, 2023
20 of 22 checks passed
@justinvyu justinvyu deleted the update_ray_integration branch December 14, 2023 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hyperparameter search error with Ray tune Ray Release 2.7.0 breaks Trainer.hyperparameter_search
6 participants