Fix stable_train_samples #67

Rohan138 · 2024-11-18T12:51:09Z

Fix how stable_train_samples is calculated. This is a ROCm/transformers specific change to add warmup before collecting perf numbers, but it is currently not working as expected. Specifically:

https://github.com/ROCm/transformers/blob/main/src/transformers/trainer.py#L2347 skips the first 10 steps of training
https://github.com/ROCm/transformers/blob/main/src/transformers/trainer.py#L2503 is supposed to skip the first 10 steps' samples, but args.warmup_steps is actually intended for learning rate warmup and defaults to 0.

E.g. if batch_size is 10, total_steps is 150, first 10 steps take 2 seconds, next 140 steps take 1 second, then:

train_samples_per_second = (10 * 150) / (10 * 2 + 140 * 1) = 9.375
stable_train_samples_per_second (expected) = (10 * 150 - 10 * 10) / (140 * 1) = 10.000
stable_train_samples_per_second (current) = (10 * 150) / (140 * 1) = 10.714

Instead, I added a stable_train_warmup_steps argument (default=10) to perform as intended.
With this change, pyt_huggingface_gpt2 perf changes from 559.092 to 529.131 stable_train_samples_per_second

NOTE: This will affect HF perf for QA, execdb, etc.

Rohan138 marked this pull request as ready for review November 18, 2024 13:07

fix stable_train_samples

3309265

Rohan138 force-pushed the fix_stable_train_samples branch from f49dd07 to 3309265 Compare November 18, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stable_train_samples #67

Fix stable_train_samples #67

Rohan138 commented Nov 18, 2024 •

edited

Loading

Fix stable_train_samples #67

Are you sure you want to change the base?

Fix stable_train_samples #67

Conversation

Rohan138 commented Nov 18, 2024 • edited Loading

Rohan138 commented Nov 18, 2024 •

edited

Loading