-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
realloc(): invalid pointer
when parsing when Triton is built with clang
#2398
Comments
(@ThomasRaoux feel free to assign this to me.) |
This really looks like a bug in the MLIR parser to me. There isn't any Triton code in here. I wonder if this is fixed at LLVM HEAD and we're just on a bad LLVM revision? |
Oh, I take it back, it even says it: |
realloc(): invalid pointer
when parsing when Triton is built with clang
The parse is auto-generated from a declarative description in TableGen. That's "not allowed to crash" in theory ;) so still a MLIR bug maybe (unless there is some memory corruption before getting there, did you try running with sanitizers?) |
Aha, yup, when I build llvm with asan, it indicates its displeasure.
|
Interesting, the wrong memory access still happens in the tablegen generated parse function: that's where the crash was already. This would point to a bug in MLIR although this seems to be a commonly used case so it would be surprising.
|
are you running the debug version of MLIR? (can be done by setting |
Hm, I'm not really sure what's going on. Starting at frame `#4 0x559e43003fc4 in mlir::triton::ReduceReturnOp::parse(mlir::OpAsmParser&, mlir::OperationState&) /usr/local/google/home/jlebar/code/triton/python/build/cmake.linux-x86_64-cpython-3.11/include/triton/Dialect/Triton/IR/Ops.cpp.inc:9362:14 The relevant function is
We are inside parseTypeList() at
OK, presumably the problem is in this lambda. I changed it to this
and now it seems to work? But...surely |
Yes, I built LLVM myself with assertions enabled. |
Also confirmed that this code in Also also if there really is a bug in LLVM, how do we explain the fact that this crashes only when we build Triton with clang, irrespective of how we built LLVM itself? |
That's a good point. Are you building LLVM with the same compiler and standard lib as triton? |
actually even though this code is in LLVM since it is in a header it gets built along with triton and not with LLVM right? |
Ah, indeed. It should get compiled into whatever is the translation unit that includes the header, which is our (generated) code,
Same standard lib, which is important, but I was compiling LLVM with GCC. Maybe this is not okay somehow? I...don't really see how this is not okay, but I don't understand how exactly we are linking together MLIR+Triton. I'll try rebuilding LLVM with clang. Something really weird is going on; I just rebuilt LLVM exactly how I had done before (asan but without my
|
There is a known incompatibility between gcc and clang actually related to lambda... |
Oh, well that's almost definitely it then. Looks basically identical to https://discourse.llvm.org/t/gcc-abi-compatibility-lambda-captures/70850 ? |
Here is the bug I filed against GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109963 ; it has a reference to the clang discussion. In this case, I suspect we should just move this method out-of-line to avoid inlining and cross-compiler calls. |
Is that even worth doing? Like, this ABI incompatibility means that I can never safely pass a lambda across "compiler boundaries". Surely there are lots of places where this can happen in LLVM and MLIR, not just this one function? Almost feels like we should somehow check for it and warn/crash, but I'm also not sure how to do that. I was definitely not expecting to hit an ABI compatibility bug today. |
Does that mean that we should never be linking LLVM built with gcc with triton built with clang? |
It is a bit more tricky I believe: a lambda is never expressed in the ABI in general (because... how? Its type is anonymous right?). What's weird here is that the function is inlined and so the code that builds the lambda is done by clang (if you build triton with clang) but the actual symbol called (the lambda constructor?) is in the DSO built with gcc. |
This instance is fixed with llvm/llvm-project@76ce4736721a |
Maybe I misunderstand, but is that actually what's happening? The TU is The TU includes Eventually the lambda is wrapped in So wouldn't we have exactly the same problem in the much simpler situation of:
That is, the problem is caused anytime a lambda is passed across compiler boundaries (via function_ref or std::function or whatever)? Or am I missing something? |
I don't think so, in your example gcc compiles But let me try to write a minimal repro! |
Oh, I think I see what you're saying, Mehdi. I could write an arbitrary struct and pass it into the function_ref, and that should still work. So the fact that GCC and LLVM do something different inside the lambda shouldn't make a difference. |
Hm, I think I understand. itanium-cxx-abi/cxx-abi#141 (comment) seems to describe it pretty well. AIUI it's essentially like an ODR violation. We have two different definitions of the lambda body, one generated by clang and one generated by GCC. The linker may decide to use either of these definitions. In our case, when the function_ref type-erases the So, yeah, I think this "only" happens if:
In this case the lambda body can be compiled twice. |
This should be fixed with the last merge from upstream in #2403 |
This is an old issue that people ran into a few months ago on Slack. I'm now hitting it and wanted a place to record the debugging I'm doing.
Steps to reproduce:
This outputs the following (flakily -- sometimes it exits successfully). I think I'm getting a stacktrace now because I installed llvm-symbolize. Previously I just got
realloc(): invalid pointer
.I got the test MLIR by running one of the Triton Python tests with a patch to output the temp file that's created with Triton MLIR.
The text was updated successfully, but these errors were encountered: