-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Implement standard deviation #3005
Conversation
- move all test mods into their own separate files - if `blah.rs` had a submodule `tests`, then `blah/mod.rs` would contain the original code and `blah/tests.rs` would contain the tests - removed all enum imports - ran `cargo clippy --fix ...`
CodSpeed Performance ReportMerging #3005 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3005 +/- ##
==========================================
+ Coverage 78.43% 78.49% +0.05%
==========================================
Files 603 609 +6
Lines 71504 71693 +189
==========================================
+ Hits 56086 56273 +187
- Misses 15418 15420 +2
|
- factored out some common logic into a util::stats module - refactored mean to use the new module
- some additional refactors to `utils::stats` module
- summing or counting may result in a `None` first element
- forgot to add these bindings
- this tests the non-singular-partition based implementation
- located in translate.rs - impl'd it with an `unimplemented!()` because a user-facing stddev_merge function is not exposed
…Expr::Agg(AggExpr::BinaryOp { .. }))`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @universalmind303, does this work with sql? can i do 'SELECT stddev(value) FROM df GROUP BY group`
also, i may have missed it but could you clarify your top comment about how the implementation differs slightly for local and distributed?
…rtion on re-insertion of id - it is possible to have an existing key already in the map; thus shouldn't panic - keeping the count as a u64 would require casting to f64 in the loop, which leads to poor performance - instead store it as an f64 eagerly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two additional comments:
- I think we didn't update the docs?
- This is not welford's :P
The implementations slightly differ in their computations. In the non-partitioned one, I just perform the straight shot stddev. This is essentially calculating the mean, and then calculating In the multi-partitioned one, doing the above approach requires a 3 stage agg (or some weird cardinalities being passed along). Therefore, I instead leverage the fact that the stddev formula can be expanded into |
@desmondcheongzx Thanks for the docs reminder! Updated docs in latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks!
// where X is the sub_expr. | ||
// | ||
// First stage, we compute `sum(X^2)`, `sum(X)` and `count(X)`. | ||
// Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage. | |
// Second stage, we `global_sum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage. |
// | ||
// First stage, we compute `sum(X^2)`, `sum(X)` and `count(X)`. | ||
// Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage. | ||
// In the final projection, we then compute `sqrt((global_sqsum / global_count) - (global_sum / global_count) ^ 2)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// In the final projection, we then compute `sqrt((global_sqsum / global_count) - (global_sum / global_count) ^ 2)`. | |
// In the final projection, we then compute `sqrt((global_sum / global_count) - (global_sum / global_count) ^ 2)`. |
Overview
AggExpr::count
andAggExpr::Mean
workNotes
Implementations differ slightly for non- vs multi- partitioned based dataframes:
stddev(X) = sqrt(sum((x_i - mean(X))^2) / N)
).stddev(X) = sqrt(E(X^2) - E(X)^2)
.