-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement kurtosis_pop
UDAF
#12273
Implement kurtosis_pop
UDAF
#12273
Conversation
query R | ||
SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col); | ||
---- | ||
0.194323231917 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this function with the CLI
DataFusion CLI v41.0.0
> SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col);
+-----------------------+
| kurtosis_pop(tab.col) |
+-----------------------+
| 0.19432323191699075 |
+-----------------------+
I'm not sure but I guess the sqllogicttest may do some rounds for the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sqllogictest will do the rounding according to https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
floating point values are rounded to the scale of "12",
# The result is -1.153061224489787 actually | ||
query R | ||
SELECT kurtosis_pop(col) FROM VALUES (1), (2), (3), (2), (1) as tab(col); | ||
---- | ||
-1.15306122449 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result is different from DuckDB but I'm not sure why.
D SELECT kurtosis_pop(col) FROM VALUES (1), (2), (3), (2), (1) as tab(col);
┌────────────────────┐
│ kurtosis_pop(col) │
│ double │
├────────────────────┤
│ -1.153061224489769 │
└────────────────────┘
let count_64 = 1_f64 / self.count as f64; | ||
let m4 = count_64 | ||
* (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64 | ||
+ 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2) | ||
- 3.0 * self.sum.powi(4) * count_64.powi(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the DuckDB way to get the divisor here.
The result will same as DuckDB but it's different from Clickhouse.
I did some test to compare the behavior between DuckDB and Clickhouse:
DuckDB
D SELECT kurtosis_pop(col) FROM VALUES (1), (10), (100), (10), (1) as tab(col);
┌─────────────────────┐
│ kurtosis_pop(col) │
│ double │
├─────────────────────┤
│ 0.19432323191699075 │
└─────────────────────┘
Clickhouse
:) SELECT kurtPop(value) FROM (SELECT arrayJoin([1, 10, 100, 10, 1]) AS value);
SELECT kurtPop(value)
FROM
(
SELECT arrayJoin([1, 10, 100, 10, 1]) AS value
)
Query id: abdea377-40b1-4437-a87a-4814f11cc866
┌─────kurtPop(value)─┐
1. │ 3.1943232319169903 │
└────────────────────┘
1 row in set. Elapsed: 0.002 sec.
Because DuckDB's kurtosis_pop calculates the population kurtosis using Fisher's definition, which results in the excess kurtosis, i.e., the value minus 3, ClickHouse directly provides the population kurtosis value without subtracting 3.
However, if we change the code like
let count_64 = 1_f64 / self.count as f64; | |
let m4 = count_64 | |
* (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64 | |
+ 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2) | |
- 3.0 * self.sum.powi(4) * count_64.powi(3)); | |
let count_64 = self.count as f64; | |
let m4 = | |
(self.sum_four - 4.0 * self.sum_cub * self.sum / count_64 | |
+ 6.0 * self.sum_sqr * self.sum.powi(2) / count_64.powi(2) | |
- 3.0 * self.sum.powi(4) / count_64.powi(3)) / count_64; |
The result will same as Clikhouse, 3.1943232319169903 - 3 = 0.1943232319169903
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could follow DuckDB in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, the implementation looks good to me.
I think it's a good idea to follow DuckDB's behavior
One thing to do is to update the function doc also https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md
} | ||
} | ||
|
||
impl Accumulator for KurtosisPopAccumulator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to add a link to the algorithm (something like wikipedia or duckdb's implementation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding this. I have added the doc for KurtosisPopAccumulator
and updated the function doc.
impl KurtosisPopFunction { | ||
pub fn new() -> Self { | ||
Self { | ||
signature: Signature::numeric(1, Volatility::Immutable), |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think user_defined with Float64 is more suitable here.
fn coerce_types(&self, _arg_types: &[DataType]) -> Result<Vec<DataType>> {
Ok(vec![DataType::Float64])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldmedal If we handle the coercion before the function, they will be coerced to f64, therefore we just need to deal with f64 only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can take #12275 as reference, we can have signature coercible(vec![Float64])
if this function expect any type that is coercible to f64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks great! Thanks
} | ||
|
||
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
if !arg_types[0].is_null() && !arg_types[0].is_numeric() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't require additional check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, we have the mechanism to coerce the type implicitly. The case will work after this check is removed.
query R
SELECT kurtosis_pop(col) FROM VALUES ('1'), ('10'), ('100'), ('10'), ('1') as tab(col);
----
0.194323231917
|
||
impl Accumulator for KurtosisPopAccumulator { | ||
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
let values = &cast(&values[0], &DataType::Float64)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let values = &cast(&values[0], &DataType::Float64)?; | |
let array = values[0].as_primitive::<Float64Type>(); | |
for value in array.iter().flatten() { | |
self.count += 1; | |
self.sum += value; | |
self.sum_sqr += value.powi(2); | |
self.sum_cub += value.powi(3); | |
self.sum_four += value.powi(4); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also use as_float64_array
or as_primitive_opt
if you prefer Result than panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. I prefer to use as_float64_array
. However, I think the &cast
can't be removed. We should cast from another type array to the float64
array first, then downcast to Float64Array
by as_float64_array
.
let count_64 = 1_f64 / self.count as f64; | ||
let m4 = count_64 | ||
* (self.sum_four - 4.0 * self.sum_cub * self.sum * count_64 | ||
+ 6.0 * self.sum_sqr * self.sum.powi(2) * count_64.powi(2) | ||
- 3.0 * self.sum.powi(4) * count_64.powi(3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could follow DuckDB in this case
|
||
impl Accumulator for KurtosisPopAccumulator { | ||
fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
let values = &cast(&values[0], &DataType::Float64)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need the cast here? 🤔
The coercion is handled in Signature::Coercible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @goldmedal @2010YOUY01 |
Thanks @jayzhan211 @2010YOUY01 |
I filed #12625 to propose moving |
Which issue does this PR close?
Closes #12251 .
Rationale for this change
I followed the algorithm of the DuckDB implementation to implement this function. The behavior is the same but there are some precision issues for the double value.
I guess that it's also a part of #12250.
What changes are included in this PR?
Are these changes tested?
yes
Are there any user-facing changes?