-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update flash_attention_fwd_benchmark.py
#2265
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
80a8f26
to
92998b2
Compare
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit de9335c.
Signed-off-by: Anatoly Myachev <[email protected]>
q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False, scale=sm_scale).to(torch.float32) | ||
atol = 1e-1 if N_CTX == 16384 else 1e-2 | ||
benchmark_suit.assert_close(triton_fn(), torch_fn(), atol=atol, rtol=1e-3, err_msg='triton to torch') | ||
torch_fn = lambda: torch.nn.functional.scaled_dot_product_attention( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
the available memory is doubled and there is no more out of memory error for upstream pytorch (however, this affects the performance)
Signed-off-by: Anatoly Myachev <[email protected]>
@@ -64,7 +64,10 @@ def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fas | |||
# We maintain a buffer of 256 MB that we clear | |||
# before each kernel call to make sure that the L2 | |||
# doesn't contain any input data before the run | |||
cache_size = 256 * 1024 * 1024 | |||
factor = 1 | |||
if os.getenv("ZE_FLAT_DEVICE_HIERARCHY", "FLAT") == "COMPOSITE": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By increasing the cache for cleaning accordingly, the performance becomes +- the same for both cases
CI:
Error:
torch.OutOfMemoryError: XPU out of memory. Tried to allocate 32.00 GiB. GPU 0 has a total capacity of 64.00 GiB. Of the allocated memory 32.81 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. Please use `empty_cache` to release all unoccupied cached memory.
It's strange that
a total capacity
is 64.00 GiB. I need to understand why (the expected capacity should be more in my understanding).UPD: Maybe it's related to https://spec.oneapi.io/level-zero/latest/core/PROG.html#environment-variables ZE_FLAT_DEVICE_HIERARCHY
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
can help with this, but for now it has been decided to leave it as is