-
Notifications
You must be signed in to change notification settings - Fork 505
Fix Compiled Parallel Aggregations #1211
base: master
Are you sure you want to change the base?
Fix Compiled Parallel Aggregations #1211
Conversation
…ented within LLVM engine
…33t/terrier into bugfix-compiled-parallel-agg
Some preliminary results from the benchmarks supported by OLTPBench. About to run through TPC-H benchmarks with the setup suggested by @thepinetree. Benchmarks run on Ubuntu 20.04 server instances (
Results for the benchmarks are averaged over 5 runs. Average Bytecode Count
Benchmarks (Interpreted)
Benchmarks (Compiled)
|
Results from running TPC-H benchmarks comparing the current state of Benchmarks run on Ubuntu 20.04 server instances (c5.9xlarge) on AWS with the following specs:
Interpreted
Compiled
In both interpreted and compiled execution modes, we see a very slight decrease in the overall runtime of the benchmark. Obviously it makes no sense that adding additional byetcode to be interpreted / compiled makes the system run faster; I think the takeaway from these results is: the overhead of introducing the additional bytecode (rather, not eliding the Interested to hear what other's takes may be. I also think it may be worthwhile at this point to mark this "ready for CI" to determine if perfguard uncovers anything different, although I suppose it has also been showing spurious changes in benchmark performance, so it is difficult to arrive at a definitive answer. |
The other thing to consider here is the fact that, currently, we don't have any automatic testing mechanism to verify this functionality. For instance, we can now run queries like
where we couldn't before, but this behavior is not verified by either existing SQL tests or unit tests because neither of these forms of tests are run in anything other than interpreted execution mode. We have issues open for both of these: #1223 for the former and #1165 for the latter. From what I can tell looking at the test infrastructures and corresponding relevant implementation details (in the DBMS itself) support for both of these will require some minor refactors:
|
Nice analysis! Just a few points:
|
Thanks for the feedback @pmenon!
|
Bytecode counts for individual TPC-H queries. Need to qualify these results slightly because they are computed based on the query plans that @malin1993ml manually constructed (see this branch for reference).
The percentage increase in the number of bytecode instructions is relatively consistent across the individual queries (or at least, more consistent than I anticipated). Furthermore, the increases are also larger in magnitude than I expected, at least based on the previous runtime benchmark results - a ~10% increase in the number of bytecode instructions seems less trivial than what I have been making it out to be. This is likely the type of result that you may have been expecting to see @pmenon, or at least the kind that you'd want to be aware of before anything gets merged. I am not super familiar with the internals of the interpreter, so I can't say with any authority how worrying a 10% increase in bytecode instructions is in that context. However, I can definitely say that such an increase is worrying in the context of machine code. Each of these additional bytecodes corresponds to an additional Is eliding this optimization prohibitively expensive? Should I revisit the solution? The alternative that comes to mind is somehow manually "collapsing" the inconsistencies at the LLVM type-system level during bytecode generation, but I haven't actually pursued this yet to see how feasible it would be. |
Proposing a new approach to solving this problem. While the above does constitute a fix, it does not actually address the root cause. Namely, it is not the fact that we are eliding Per usual, the setup queries:
And the query of interest:
For this query, we generate the following bytecode for
So far, so good. As we have seen before, it is the final llvm::Type *pointee_type = args[1]->getType()->getPointerElementType();
int64_t offset = llvm::cast<llvm::ConstantInt>(args[2])->getSExtValue();
if (auto struct_type = llvm::dyn_cast<llvm::StructType>(pointee_type)) {
const uint32_t elem_index = dl.getStructLayout(struct_type)->getElementContainingOffset(offset);
llvm::Value *addr = ir_builder->CreateStructGEP(args[1], elem_index);
ir_builder->CreateStore(addr, args[0]);
} So now apply this to the final
The value of So, I argue, we actually have a logic bug here in the translation of the Accordingly, the fix should occur here at the level of bytecode -> IR translation, rather than at the level of bytecode generation (as I was previously pursuing). I wrote a proof-of-concept fix that rewrites the logic here to account for nested structures. One complication that arises, however, is the fact that some of the types that are non-structural at the level of bytecode are nonetheless represented as structural types in the LLVM IR (e.g. SQL Any way it is implemented, this new approach has a number of benefits over the old one:
|
I disagree with this. Two things:
|
Came back to this PR today in light of the fact that it is blocking a number of other PRs, namely #1470 and #1402. I sat down and fired up the usual query to break the DBMS and dump the bytecode so I could look at it:
Imagine my surprise then when the final query returned the correct results...
So someone beat me to the punch (not that it was that difficult, this PR has been open since September 2020...), the question was: who? and how? Further investigation reveals that it was none other than @lmwnshn and his PR #1482. Now, hats off to Wan for his PR and the fact that it gets these queries to run successfully, but unfortunately it really circumvents the issue, rather than addressing it directly. Prior to Wan's PR, we generate the following bytecode for
And after Wan's PR, this bytecode becomes:
The salient point here is the updates to the way we compute the address of the second Upside: we can now execute compiled parallel aggregations successfully, but there is likely still a logic bug (or oversight) in code generation that we don't hit under any of the current tests. |
at times like this, github needs a 💩 react |
Per the discussion in standup, although the updates by @lmwnshn fixed the bug as we currently experience it, I am still going to look at addressing the root cause. High-level plan now is:
|
Closes #1099.
Parallel static aggregations fail in compiled mode. A simple procedure to reproduce follows:
Executing the final query produces the following error messages:
The majority of this error message is produced by the
Verifier::visitStoreInst()
function inVerifier.cpp
within LLVM. As the message suggests, we generate IR with a type mismatch between the stored value and the location (address) at which it is stored. The (simplified) syntax for an LLVM IRstore
looks like:store <ty> <value> <ty>* <pointer>
and the LLVM type system (enforced in part byVerifier.cpp
) requires that "the type of the operand must be a pointer to the first class type of the operand".The issue arises in the
MergeAggregates
function. We produce the following bytecode forMergeAggregates
:The error occurs at the final
Lea
bytecode:Lea &tmp4, pipelineState, 16
. Within the LLVM engine, theLea
TPL instruction is implemented in terms of astore
in LLVM IR.We have that the
pipelineState
type looks like:While the type of local
tmp4
is simply*IntegerSumAggregate
.To construct the
store
instruction from the bytecode, we perform the following procedure:Lea
instruction (16)pipelineState
that contains this offset;pipelineState
is a structure with only a single member (itself a structure) so the result of this computation is the sole member ofpipelineState
with type:struct {IntegerSumAggregate, IntegerSumAggregate}
pipelineState
); this calculation results in 0 as the index*struct {IntegerSumAggregate, IntegerSumAggregate}
store
instruction that stores this computed pointer to&tmp4
The resulting
store
IR therefore looks like (in pseudo-IR):Verifier::visitStoreInst()
contains an assertion that verifies (essentially) the following type constraint:We fail this assertion because we attempt to store a pointer to the structure
struct {IntegerSumAggregate, IntegerSumAggregate}
to a location that expects a pointer to a loneIntegerSumAggregate
.