Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for https://github.com/instructlab/training/issues/254 #273

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ashna000
Copy link

Added checks for resolving the issue reported in #254

Signed-off-by: ashna000 <[email protected]>
@ashna000
Copy link
Author

@RobotSail Can you please review this ?

@mergify mergify bot added the ci-failure label Oct 15, 2024
@mergify mergify bot removed the ci-failure label Oct 16, 2024
@@ -76,7 +76,7 @@ def get_effective_samples_per_minibatch(num_tokens_per_gpu):
padding=True,
)
batches = sampler.generate_batches()
return len(dataset) / len(batches)
return len(dataset) / len(batches) if len(batches) > 0 else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're hitting a scenario where Multipack generated no batches, then we need to either error out and have the calling function handle it (by falling back to DistributedDataSampler, or simply by preventing us from using multipack in the first place.

My recommendation here is to do this:

  • Raise an appropriate exception here (either RuntimeError or DivisionByZeroError)
  • Have the calling function check for this additional exception and fall back to the naive sampler

@RobotSail
Copy link
Member

Thanks for this PR @ashna000, this will be a really important fix ! I've left a comment on how we should handle this problem.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashna000 can you please squash your commits?

@nathan-weinberg
Copy link
Member

@ashna000 these still need to be squashed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Division by zero error when calling find_packing_max_batch_len_and_grad_accum
3 participants