We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
find_packing_max_batch_len_and_grad_accum
It seems like there's a division by zero error that can still occur with the right conditions.
[rank0]: Traceback (most recent call last): [rank0]: File "/home/oleg/Programming/training/src/instructlab/training/main_ds.py", line 934, in <module> [rank0]: main(args) [rank0]: File "/home/oleg/Programming/training/src/instructlab/training/main_ds.py", line 546, in main [rank0]: packing_max_batch_len, grad_accum = find_packing_max_batch_len_and_grad_accum( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/oleg/Programming/training/src/instructlab/training/multipack_sampler.py", line 160, in find_packing_max_batch_len_and_grad_accum [rank0]: packing_max_batch_len = find_max_pack_len_with_padding( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/oleg/Programming/training/src/instructlab/training/multipack_sampler.py", line 86, in find_max_pack_len_with_padding [rank0]: avg_bs_per_minibatch = get_effective_samples_per_minibatch( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/oleg/Programming/training/src/instructlab/training/multipack_sampler.py", line 79, in get_effective_samples_per_minibatch [rank0]: return len(dataset) / len(batches) [rank0]: ~~~~~~~~~~~~~^~~~~~~~~~~~~~ [rank0]: ZeroDivisionError: division by zero
To reproduce:
TrainingArgs
train_args = TrainingArgs( data_path="path/to/data", is_padding_free=False, ckpt_output_dir='checkpoints', data_output_dir='/dev/shm', max_seq_len=70, max_batch_len=70, effective_batch_size=3840, save_samples=20_000, )
The text was updated successfully, but these errors were encountered:
Fix for instructlab#254
c84359e
Signed-off-by: ashna000 <[email protected]>
Successfully merging a pull request may close this issue.
It seems like there's a division by zero error that can still occur with the right conditions.
To reproduce:
TrainingArgs
like so:The text was updated successfully, but these errors were encountered: