Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add throughput timer configuration #5363

Merged
merged 26 commits into from
May 22, 2024

Conversation

deepcharm
Copy link
Contributor

The new "timers" section describes configuration for different timers.

Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment.

No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty.

The new 'timers' section describes configuration for
different timers.

Specifically, in the "throughput" section, it is possible
to disable the throughput timer (enabled by default).
This allows to avoid the performance degradation whenever
the throughput measurekent is not needed, for example
in production environment.

No device synchronize() is invoked when "synchronized" is set
to False (default is True). This allows to produce approximate
throughput measurements with minimal performance penalty.
@deepcharm
Copy link
Contributor Author

We've discovered the following issues in the current implementation of the Throughput timer:

  • The timer invokes synchronize() twice on each step at start and stop.

  • Calling synchronize() ensures that all workloads running on the device complete.
    This makes sure that the measured time includes both the elapsed CPU and the device times.

  • However, being an expensive operation, synchronize() causes the performance drop of up to 20-25%
    on some accelerators (e.g. HPU). On the other hand, that avoiding the call to synchronize()
    still produces a meaningful performance estimate without performance drop.

  • Since there is no configuration flag to disable the timer, today a user pays consistent performance penalty,
    even in cases when the throughput measurement may be unnecessary (e.g. production).

The purpose of this PR is to add config options to control the above behavior.

@tjruwase tjruwase requested review from jomayeri and removed request for duli2012, awan-10 and mrwyattii April 5, 2024 22:34
deepspeed/utils/config.py Outdated Show resolved Hide resolved
@deepcharm
Copy link
Contributor Author

Hi @loadams, all the requested changes are done. If you can please review and trigger the Ci. Thanks

@nelyahu
Copy link
Contributor

nelyahu commented May 7, 2024

@loadams @mrwyattii Can you help taking this PR further? @deepcharm handled all review comments.

@deepcharm deepcharm requested a review from mrwyattii May 9, 2024 12:26
@nelyahu
Copy link
Contributor

nelyahu commented May 12, 2024

@loadams @mrwyattii @lekurile can this change be merged?

@loadams
Copy link
Contributor

loadams commented May 13, 2024

@loadams @mrwyattii @lekurile can this change be merged?

Sorry for the delay @nelyahu - I will make sure this is merged shortly.

@loadams loadams added this pull request to the merge queue May 22, 2024
Merged via the queue into microsoft:master with commit 995ba11 May 22, 2024
14 checks passed
@deepcharm deepcharm deleted the add-timers-configuration branch May 23, 2024 07:17
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
The new "timers" section describes configuration for different timers.

Specifically, in the "throughput" section, it is possible to disable the
throughput timer (enabled by default). This allows to avoid the
performance degradation whenever the throughput measurement is not
needed, for example in production environment.

No device synchronize() is invoked when "synchronized" is set to False
(default is True). This allows to produce approximate throughput
measurements with minimal performance penalty.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants