-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A pair of runtime::ttmetal CQExecutor buffer map improvements (#408) #529
Conversation
kmabeeTT
commented
Aug 29, 2024
- Prevent duplicate Buffers from being created inside CreateBufferCommand handler by checking for existence in buffers umap.
- Change to use buffers.erase() in DeallocateBufferCommand to actually remove the entry from buffers umap. Buffer will still be destroyed because it goes out of scope.
- Neither of these help with the original segfault in this ticket but these were found through visual observation.
8192fa8
to
03ca6f0
Compare
Heads up - @nsmithtt, I pushed commit just now to include a quick-and-dirty hack in ttmetal.h's |
auto &buffer_map = tt::tt_metal::detail::BUFFER_MAP; | ||
auto map_copy = buffer_map.value(); | ||
auto old_key = std::make_tuple(device->id(), buffer->address()); | ||
if (auto it = map_copy.find(old_key); it != map_copy.end()) { | ||
auto new_key = std::make_tuple(device->id(), tensorRef->address()); | ||
buffer_map.insert(new_key, it->second); | ||
buffer_map.erase(old_key); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do all of this only if (tensorRef->address() != buffer->address())
?
Also, is there a reason to have a map copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rpavlovicTT , it's pretty much guaranteed that tensorRef->address() != buffer->address()
. This is a hack that reaches into tt-metal data structure to fix things up and avoid a segfault. We have a proper fix in metal in flight and ideally lands in a week, but depending on metal reviewers could take longer. We just wanted something that could unblock our silicon CI. We'll remove this whole block once the fix in metal lands and we do another metal uplift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, temporary hack, and yes it's guaranteed tensorRef->address() != buffer->address()
today, the buffer->address() picked by metal will be completely unrelated, which is why it's later (after this hack) set to be tensorRef->address()
.
For the question about map copy, due to how tt::tt_metal::detail::BUFFER_MAP
is implemented, it only returns a copy of it's map and so if I did something like this (without storing a single copy of the map) :
if (auto it = buffer_map.value().find(old_key); it != buffer_map.value().end())
it doesn't compile:
/localdev/kmabee/mlir2/runtime/include/tt/runtime/detail/ttmetal.h:197:19: error: object backing the pointer will be destroyed at the end of the full-expression [-Werror,-Wdangling-gsl]
197 | if (auto it = buffer_map.value().find(old_key); it != buffer_map.value().end()) {
| ^~~~~~~~~~~~~~~~~~
CI failing with:
|
Yeah... From offline chat with @jnie-TT and @tapspatel looks like ttrt-run is mixing ttnn and ttmetal binaries together but it's not being handled properly, I will have to keep test excluded for now I think, and open issue for ttrt change/fix with details. Going to remove test change from this PR now if that helps to approve it. |
03ca6f0
to
e2aff3c
Compare
Taking a look at the mixing runtimes issue: #563 |
fix merged: 19c5407 |
Since Taps fixed mixing test issue , I'll add passing test back and try again with above fix and merge if good. @nsmithtt any other concerns here before you can approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
33d3ba6
to
ab4c153
Compare
- Prevent duplicate Buffers from being created inside CreateBufferCommand handler by checking for existence in buffers umap. - Change to use buffers.erase() in DeallocateBufferCommand to actually remove the entry from buffers umap. Buffer will still be destroyed because it goes out of scope. - Neither of these help with the original segfault in this ticket but these were found through visual observation.
- A hack, in createBufferFromTensorRef(), remove when proper bug fix is made in tt-metal and propagates here. - Cannot remove UNSUPPORTED on TTMetal/to_layout.mlir yet because CI mixes ttnn/ttmetal binaries fine, but closing device fails, need some kind of ttrt fix
… that CI mixing tests issue resolved
- Update test/ttmlir/Silicon/TTMetal/tiled_reblock.mlir with CHECK to avoid errors and add missing flag --ttmetal-serialize-to-binary to run ttm binary in CI
ab4c153
to
d86260a
Compare