-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: API to synchronize devices #434
base: master
Are you sure you want to change the base?
Conversation
cc: @Jianhui-Li, @jzhoulon, @yiqianglee, @kulinseth, @wchao1115, and @PatriceVignola for PluggableDevices. |
rfcs/20221025-sync-devices.md
Outdated
|
||
## Objective | ||
|
||
This document proposes a simple API to synchronize TensorFlow devices: `tf.sync_devices()`. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put into this namespace. tf.config.experimental.sync_devices()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sun51 It is not a config call. It could be call at every iterations:
for step in range(NUM_STEPS):
data = ds_iter.next()
start_t = time.time()
rslt = model(data)
tf.sync_devices()
print(f"Time: {time.time() - start_t}")
I don't think config
is the appropriate namespace for such an API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually changing the configuration of runtime. Meanwhile, I am still not quite clear about the semantics of this API. In the above case, model() will include many ops, it is called before the tf.sync_device() is even called. How does this work? Once you sync device, can you set to normal config(async) later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you there is a misunderstanding on how the API works.
It is actually changing the configuration of runtime.
No, the API introduces a blocking call that awaits for the OPs scheduled on every devices to be cleared. There is no "configuration" being updated or modified. It's just an API call that awaits for the GPUs to be done. Hence this has nothing to do in the config
namespace IMHO.
How does this work?
As said above, it awaits for the pipeline of scheduled OPs to clear. In short, "everyone finishes what they are doing, let me know when you're done and we move on when everybody is done."
Feel free to reach to @reedwm inside Google, he designed the original solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatted with reed, put into tf.test.XXX namespace is something we both think reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I edited the RFC to put under tf.test.sync_devices.
print(f'Time taken: {time.time() - start}') | ||
``` | ||
|
||
This can be fixed by calling `y.numpy()` which forces the Python thread to wait until the matmul finishes, but this also adds a device-to-host transfer. The benchmark only wants to measure the matmul time, not the device transfer time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to measure exactly? because when we call tf.matmul(), even with sync, it also includes all the Python overhead, C++ launch overhead, and then GPU execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to measure step time (including all the potential overheads) from start to finish.
Having the most accurate picture from start to finish of a single "step" or "compute unit". Without having to force a .numpy()
that would memcpyDtoH.
LGTM, thanks |
@reedwm can you please link the commit that adds the feature or the PR/CL here ? Thanks |
There was an RFC for this API: tensorflow/community#434 PiperOrigin-RevId: 504062646
Objective
This document proposes a simple API to synchronize TensorFlow devices:
tf.test.sync_devices()
. This is important in accurately measuring execution time in TensorFlow GPU benchmarks, especially in microbenchmarks./CC @DEKHTIARJonathan @rohan100jain