Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A pair of runtime::ttmetal CQExecutor buffer map improvements (#408) #529

Merged
merged 5 commits into from
Aug 31, 2024

Conversation

kmabeeTT
Copy link
Contributor

  • Prevent duplicate Buffers from being created inside CreateBufferCommand handler by checking for existence in buffers umap.
  • Change to use buffers.erase() in DeallocateBufferCommand to actually remove the entry from buffers umap. Buffer will still be destroyed because it goes out of scope.
  • Neither of these help with the original segfault in this ticket but these were found through visual observation.

@kmabeeTT kmabeeTT requested a review from nsmithtt August 29, 2024 00:45
@kmabeeTT kmabeeTT force-pushed the kmabee/issue_408_improvements branch from 8192fa8 to 03ca6f0 Compare August 29, 2024 05:32
@kmabeeTT
Copy link
Contributor Author

Heads up - @nsmithtt, I pushed commit just now to include a quick-and-dirty hack in ttmetal.h's createBufferFromTensorRef() (the OG source of headaches for this ticket) that avoids the segfault by "correcting" the address stroed in tt-metal's BUFFER_MAP, that I hope can be merged to hold us over until proper fix gets made in metal and intregrated here (which I worry might take a week or more). I made it in such a way that it would be harmless if the metal fix + uplift happens to show up unexpectedly and it's not removed, although I will definitely keep my finger on the pulse and not forget about this. Original failing test added back to CI too.

Comment on lines +166 to +175
auto &buffer_map = tt::tt_metal::detail::BUFFER_MAP;
auto map_copy = buffer_map.value();
auto old_key = std::make_tuple(device->id(), buffer->address());
if (auto it = map_copy.find(old_key); it != map_copy.end()) {
auto new_key = std::make_tuple(device->id(), tensorRef->address());
buffer_map.insert(new_key, it->second);
buffer_map.erase(old_key);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do all of this only if (tensorRef->address() != buffer->address())?
Also, is there a reason to have a map copy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rpavlovicTT , it's pretty much guaranteed that tensorRef->address() != buffer->address(). This is a hack that reaches into tt-metal data structure to fix things up and avoid a segfault. We have a proper fix in metal in flight and ideally lands in a week, but depending on metal reviewers could take longer. We just wanted something that could unblock our silicon CI. We'll remove this whole block once the fix in metal lands and we do another metal uplift.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, temporary hack, and yes it's guaranteed tensorRef->address() != buffer->address() today, the buffer->address() picked by metal will be completely unrelated, which is why it's later (after this hack) set to be tensorRef->address().

For the question about map copy, due to how tt::tt_metal::detail::BUFFER_MAP is implemented, it only returns a copy of it's map and so if I did something like this (without storing a single copy of the map) :

if (auto it = buffer_map.value().find(old_key); it != buffer_map.value().end())

it doesn't compile:

/localdev/kmabee/mlir2/runtime/include/tt/runtime/detail/ttmetal.h:197:19: error: object backing the pointer will be destroyed at the end of the full-expression [-Werror,-Wdangling-gsl]
  197 |     if (auto it = buffer_map.value().find(old_key); it != buffer_map.value().end()) {
      |                   ^~~~~~~~~~~~~~~~~~

@nsmithtt
Copy link
Contributor

CI failing with:

python: /__w/tt-mlir/tt-mlir/runtime/include/tt/runtime/types.h:51: T &tt::runtime::detail::RuntimeCheckedObjectImpl::as(DeviceRuntime) [T = std::vector<tt::tt_metal::Device *>]: Assertion `associatedRuntime == expectedRuntime && "Associated runtime does not match expected runtime of cast"' failed.

@kmabeeTT
Copy link
Contributor Author

CI failing with:

python: /__w/tt-mlir/tt-mlir/runtime/include/tt/runtime/types.h:51: T &tt::runtime::detail::RuntimeCheckedObjectImpl::as(DeviceRuntime) [T = std::vector<tt::tt_metal::Device *>]: Assertion `associatedRuntime == expectedRuntime && "Associated runtime does not match expected runtime of cast"' failed.

Yeah... From offline chat with @jnie-TT and @tapspatel looks like ttrt-run is mixing ttnn and ttmetal binaries together but it's not being handled properly, I will have to keep test excluded for now I think, and open issue for ttrt change/fix with details. Going to remove test change from this PR now if that helps to approve it.

@kmabeeTT kmabeeTT force-pushed the kmabee/issue_408_improvements branch from 03ca6f0 to e2aff3c Compare August 29, 2024 22:51
@tapspatel
Copy link
Contributor

Taking a look at the mixing runtimes issue: #563

@tapspatel
Copy link
Contributor

fix merged: 19c5407

@kmabeeTT
Copy link
Contributor Author

Since Taps fixed mixing test issue , I'll add passing test back and try again with above fix and merge if good. @nsmithtt any other concerns here before you can approve?

Copy link
Contributor

@nsmithtt nsmithtt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@kmabeeTT kmabeeTT force-pushed the kmabee/issue_408_improvements branch 2 times, most recently from 33d3ba6 to ab4c153 Compare August 31, 2024 16:55
 - Prevent duplicate Buffers from being created inside
   CreateBufferCommand handler by checking for existence
   in buffers umap.
 - Change to use buffers.erase() in DeallocateBufferCommand to
   actually remove the entry from buffers umap. Buffer will
   still be destroyed because it goes out of scope.
 - Neither of these help with the original segfault in this ticket
   but these were found through visual observation.
 - A hack, in createBufferFromTensorRef(), remove when proper
   bug fix is made in tt-metal and propagates here.

 - Cannot remove UNSUPPORTED on TTMetal/to_layout.mlir yet because
   CI mixes ttnn/ttmetal binaries fine, but closing device fails,
   need some kind of ttrt fix
 - Update test/ttmlir/Silicon/TTMetal/tiled_reblock.mlir with CHECK to
   avoid errors and add missing flag --ttmetal-serialize-to-binary to run
   ttm binary in CI
@kmabeeTT kmabeeTT force-pushed the kmabee/issue_408_improvements branch from ab4c153 to d86260a Compare August 31, 2024 18:15
@kmabeeTT kmabeeTT merged commit fd466fc into main Aug 31, 2024
19 checks passed
@kmabeeTT kmabeeTT deleted the kmabee/issue_408_improvements branch December 8, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants