-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support DataSketches Theta Sketche #54568
base: main
Are you sure you want to change the base?
Conversation
… Sketches. Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
… Sketches. Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
…cks into support_dataSketches
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
Signed-off-by: chenminghua8 <[email protected]>
} | ||
}; | ||
|
||
} // namespace starrocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
Potential undefined behavior due to uninitialized memory access when serializing the data.
You can modify the code like this:
void serialize_to_column([[maybe_unused]] FunctionContext* ctx, ConstAggDataPtr __restrict state,
Column* to) const override {
DCHECK(to->is_binary());
auto* column = down_cast<BinaryColumn*>(to);
if (UNLIKELY(!this->data(state).is_inited())) {
column->append_default();
} else {
size_t serialized_size = this->data(state).serialize_size();
std::vector<uint8_t> result(serialized_size); // Use vector to manage memory safely
size_t actual_size = this->data(state).serialize(result.data());
column->append(Slice(result.data(), actual_size));
}
}
This change uses a std::vector<uint8_t>
instead of a stack-allocated array with variable length, ensuring memory is properly initialized and managed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fields.add(new StructField("upper_bound", Type.BIGINT)); | ||
return new ArrayType(new StructType(fields, true)); | ||
}; | ||
|
||
public List<Function> getBuiltinFunctions() { | ||
List<Function> builtinFunctions = Lists.newArrayList(); | ||
for (Map.Entry<String, List<Function>> entry : vectorizedFunctions.entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
The function DS_HLL_COUNT_DISTINCT
appears to be registered twice, which could lead to incorrect behavior if each registration has different characteristics or versions.
You can modify the code like this:
// In registerBuiltinDsFunction(), remove redundant registrations for DS_HLL_COUNT_DISTINCT
private void registerBuiltinDsFunction() {
for (Type t : Type.getSupportedTypes()) {
if (t.isFunctionType()) {
continue;
}
if (t.isNull()) {
continue; // NULL is handled through type promotion.
}
if (t.isChar()) {
continue; // promoted to STRING
}
if (t.isPseudoType()) {
continue; // promoted to pseudo
}
// Ensure no duplicate registrations for the same function.
// ds_hll_count_distinct(col)
addBuiltin(AggregateFunction.createBuiltin(
DS_HLL_COUNT_DISTINCT, Lists.newArrayList(t), Type.BIGINT, Type.VARBINARY,
true, false, true));
// ds_theta(col)
addBuiltin(AggregateFunction.createBuiltin(
DS_THETA, Lists.newArrayList(t), Type.BIGINT, Type.VARBINARY,
true, false, true));
}
}
This change ensures that the DS_HLL_COUNT_DISTINCT
function gets registered correctly without adding it multiple times unintentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function parameters are different and need to be registered multiple times.
} | ||
}; | ||
|
||
} // namespace starrocks No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
Dereferencing a potentially null pointer ds_sketch_wrapper
in the update
, update_batch_single_state_with_frame
, and serialize
functions without checking if it's initialized.
You can modify the code like this:
void update(const Column* data_column, size_t row_num) const {
if (!is_inited()) return; // Check for initialization
uint64_t value = 0;
const ColumnType* column = down_cast<const ColumnType*>(data_column);
if constexpr (lt_is_string<LT>) {
Slice s = column->get_slice(row_num);
value = HashUtil::murmur_hash64A(s.data, s.size, HashUtil::MURMUR_SEED);
} else {
const auto& v = column->get_data();
value = HashUtil::murmur_hash64A(&v[row_num], sizeof(v[row_num]), HashUtil::MURMUR_SEED);
}
ds_sketch_wrapper->update(value);
}
void update_batch_single_state_with_frame(const Column* data_column, int64_t frame_start, int64_t frame_end) const {
if (!is_inited()) return; // Check for initialization
const ColumnType* column = down_cast<const ColumnType*>(data_column);
if constexpr (lt_is_string<LT>) {
uint64_t value = 0;
for (size_t i = frame_start; i < frame_end; ++i) {
Slice s = column->get_slice(i);
value = HashUtil::murmur_hash64A(s.data, s.size, HashUtil::MURMUR_SEED);
if (value != 0) {
ds_sketch_wrapper->update(value);
}
}
} else {
uint64_t value = 0;
const auto& v = column->get_data();
for (size_t i = frame_start; i < frame_end; ++i) {
value = HashUtil::murmur_hash64A(&v[i], sizeof(v[i]), HashUtil::MURMUR_SEED);
if (value != 0) {
ds_sketch_wrapper->update(value);
}
}
}
}
size_t serialize(uint8_t* dst) const {
if (!is_inited()) return 0; // Check for initialization
return ds_sketch_wrapper->serialize(dst);
}
This modification ensures that ds_sketch_wrapper
is initialized before calling its methods, preventing potential null pointer dereferences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to check whether it has been initialized, because the method calling it has already done the corresponding processing.
Signed-off-by: chenminghua8 <[email protected]>
[FE Incremental Coverage Report]✅ pass : 29 / 36 (80.56%) file detail
|
Signed-off-by: chenminghua8 <[email protected]>
[BE Incremental Coverage Report]✅ pass : 2 / 2 (100.00%) file detail
|
Signed-off-by: chenminghua8 <[email protected]>
Quality Gate passedIssues Measures |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Why I'm doing:
What I'm doing:
Implement the aggregation function that supports DataSketches Theta:
Fixes #50671
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: