[FEAT] Implement standard deviation #3005

raunakab · 2024-10-06T18:36:59Z

Overview

Add a standard deviation function
- similar in implementation to how AggExpr::count and AggExpr::Mean work

Notes

Implementations differ slightly for non- vs multi- partitioned based dataframes:

The non-partitioned implementation uses the simple, naive approach, derived from definition of stddev (i.e., stddev(X) = sqrt(sum((x_i - mean(X))^2) / N)).
The multi-partitioned implementation calculates stddev(X) = sqrt(E(X^2) - E(X)^2).

- move all test mods into their own separate files - if `blah.rs` had a submodule `tests`, then `blah/mod.rs` would contain the original code and `blah/tests.rs` would contain the tests - removed all enum imports - ran `cargo clippy --fix ...`

src/daft-dsl/src/join/mod.rs

codspeed-hq · 2024-10-06T18:47:23Z

CodSpeed Performance Report

Merging #3005 will not alter performance

_{Comparing feat/stddev (1874f43) with main (3f37a69)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-10-06T18:57:54Z

Codecov Report

Attention: Patch coverage is 86.98980% with 102 lines in your changes missing coverage. Please review.

Project coverage is 78.49%. Comparing base (3f37a69) to head (1874f43).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-dsl/src/expr/mod.rs	79.31%	30 Missing ⚠️
src/daft-dsl/src/lit.rs	67.46%	27 Missing ⚠️
src/daft-plan/src/logical_ops/project.rs	0.00%	15 Missing ⚠️
src/daft-dsl/src/resolve_expr/mod.rs	60.86%	9 Missing ⚠️
src/daft-schema/src/dtype.rs	50.00%	4 Missing ⚠️
src/daft-table/src/lib.rs	20.00%	4 Missing ⚠️
src/daft-dsl/src/arithmetic/tests.rs	70.00%	3 Missing ⚠️
src/daft-dsl/src/expr/tests.rs	95.58%	3 Missing ⚠️
daft/dataframe/dataframe.py	60.00%	2 Missing ⚠️
daft/series.py	33.33%	2 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3005      +/-   ##
==========================================
+ Coverage   78.43%   78.49%   +0.05%     
==========================================
  Files         603      609       +6     
  Lines       71504    71693     +189     
==========================================
+ Hits        56086    56273     +187     
- Misses      15418    15420       +2

Files with missing lines	Coverage Δ
daft/expressions/expressions.py	`93.78% <100.00%> (+0.02%)`	⬆️
src/daft-core/src/array/ops/mean.rs	`100.00% <100.00%> (ø)`
src/daft-core/src/array/ops/stddev.rs	`100.00% <100.00%> (ø)`
src/daft-core/src/datatypes/agg_ops.rs	`100.00% <100.00%> (ø)`
src/daft-core/src/datatypes/mod.rs	`40.00% <ø> (ø)`
src/daft-core/src/series/ops/agg.rs	`75.59% <100.00%> (+3.72%)`	⬆️
src/daft-dsl/src/arithmetic/mod.rs	`100.00% <ø> (ø)`
src/daft-dsl/src/functions/map/mod.rs	`92.30% <100.00%> (ø)`
src/daft-dsl/src/functions/mod.rs	`84.21% <100.00%> (ø)`
src/daft-dsl/src/functions/partitioning/mod.rs	`100.00% <100.00%> (ø)`
... and 23 more

... and 1 file with indirect coverage changes

- factored out some common logic into a util::stats module - refactored mean to use the new module

- some additional refactors to `utils::stats` module

- summing or counting may result in a `None` first element

- forgot to add these bindings

- this tests the non-singular-partition based implementation

- located in translate.rs - impl'd it with an `unimplemented!()` because a user-facing stddev_merge function is not exposed

…Expr::Agg(AggExpr::BinaryOp { .. }))`

src/daft-core/src/utils/stats.rs

daft/daft/__init__.pyi

colin-ho

cc @universalmind303, does this work with sql? can i do 'SELECT stddev(value) FROM df GROUP BY group`

also, i may have missed it but could you clarify your top comment about how the implementation differs slightly for local and distributed?

src/daft-plan/src/physical_planner/translate.rs

src/daft-dsl/src/functions/python/mod.rs

src/daft-core/src/datatypes/agg_ops.rs

src/daft-dsl/src/expr/mod.rs

src/daft-core/src/array/ops/mean.rs

src/daft-core/src/utils/stats.rs

src/daft-core/src/array/ops/mean.rs

src/daft-core/src/utils/stats.rs

daft/dataframe/dataframe.py

…rtion on re-insertion of id - it is possible to have an existing key already in the map; thus shouldn't panic - keeping the count as a u64 would require casting to f64 in the loop, which leads to poor performance - instead store it as an f64 eagerly

desmondcheongzx

Two additional comments:

I think we didn't update the docs?
This is not welford's :P

src/daft-plan/src/physical_planner/translate.rs

tests/dataframe/test_stddev.py

…stage is doing

raunakab · 2024-10-08T21:07:06Z

@colin-ho

The implementations slightly differ in their computations.

In the non-partitioned one, I just perform the straight shot stddev. This is essentially calculating the mean, and then calculating sum((x_i - mean)^2 ) / N, and then finally sqrt-ing that.

In the multi-partitioned one, doing the above approach requires a 3 stage agg (or some weird cardinalities being passed along). Therefore, I instead leverage the fact that the stddev formula can be expanded into stddev(X) = sqrt(E(X^2) - E(X)^2). Thus, in that situation, the first stage only requires me to compute the local sq.sum, the sum, and the count. The second stage requires the global version of all of that, and the final stage is a simple projection to calculate the final result using the previous aggs.

raunakab · 2024-10-08T21:17:31Z

@desmondcheongzx Thanks for the docs reminder! Updated docs in latest commit.

desmondcheongzx

Looks good to me, thanks!

colin-ho · 2024-10-08T22:04:20Z

src/daft-plan/src/physical_planner/translate.rs

+                // where X is the sub_expr.
+                //
+                // First stage, we compute `sum(X^2)`, `sum(X)` and `count(X)`.
+                // Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.


Suggested change

// Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.

// Second stage, we `global_sum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.

colin-ho · 2024-10-08T22:04:30Z

src/daft-plan/src/physical_planner/translate.rs

+                //
+                // First stage, we compute `sum(X^2)`, `sum(X)` and `count(X)`.
+                // Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.
+                // In the final projection, we then compute `sqrt((global_sqsum / global_count) - (global_sum / global_count) ^ 2)`.


Suggested change

// In the final projection, we then compute `sqrt((global_sqsum / global_count) - (global_sum / global_count) ^ 2)`.

// In the final projection, we then compute `sqrt((global_sum / global_count) - (global_sum / global_count) ^ 2)`.

Clean up code

818c076

- move all test mods into their own separate files - if `blah.rs` had a submodule `tests`, then `blah/mod.rs` would contain the original code and `blah/tests.rs` would contain the tests - removed all enum imports - ran `cargo clippy --fix ...`

github-actions bot added the enhancement New feature or request label Oct 6, 2024

raunakab requested review from colin-ho and jaychia October 6, 2024 18:37

jaychia reviewed Oct 6, 2024

View reviewed changes

src/daft-dsl/src/join/mod.rs Show resolved Hide resolved

Raunak Bhagat added 18 commits October 6, 2024 16:35

Add all structure for stddev

a53dfaa

Implement structure for local and distributed stddev

2ae9d08

Implement non-grouped stddev

c7f189c

- factored out some common logic into a util::stats module - refactored mean to use the new module

Implement grouped standard deviation

797358f

- some additional refactors to `utils::stats` module

Remove unwraps that may have panicked because of invalid first element

93b125b

- summing or counting may result in a `None` first element

Merge branch 'main' into feat/stddev

db48aa7

Add #[pyfunctions] functions to code

a55ed2b

- forgot to add these bindings

Add basic test for stddev

bd509d4

Add partition based testing

1b4d039

- this tests the non-singular-partition based implementation

Add first stage pass to stddev distributed implementation

633f486

Add StddevMerge variant to finish the second stage aggregations

9b9ac18

Implement stddev_merge todo

9669a08

- located in translate.rs - impl'd it with an `unimplemented!()` because a user-facing stddev_merge function is not exposed

Finish distribtued stddev

64bff42

Merge branch 'main' into feat/stddev

81d7f75

Edit data-type of square_sum field in to_field impl

d825a21

Fix errors in multi-partition aggregation planning

9b94626

Add some tests for stddev (single- and multi- partitioned)

70577ab

Finish tests for stddev feature

265a7a7

raunakab marked this pull request as ready for review October 8, 2024 15:33

Explicitly import typing module; fix lints

a76fade

raunakab requested review from andrewgazelka and desmondcheongzx and removed request for colin-ho October 8, 2024 16:14

Remove SquareSum since it can just be implemented as `AggExpr::Sum(…

7a5a36a

…Expr::Agg(AggExpr::BinaryOp { .. }))`

raunakab requested a review from jaychia October 8, 2024 16:32

universalmind303 reviewed Oct 8, 2024

View reviewed changes

src/daft-core/src/utils/stats.rs Outdated Show resolved Hide resolved

universalmind303 reviewed Oct 8, 2024

View reviewed changes

daft/daft/__init__.pyi Show resolved Hide resolved

colin-ho reviewed Oct 8, 2024

View reviewed changes

Raunak Bhagat added 4 commits October 8, 2024 12:22

Add debug_assertions to length checking during stats calculations

c6eba4e

Remove dead function and remove re-calculation of mean

4581104

Change name of data-type function

823e3af

desmondcheongzx reviewed Oct 8, 2024

View reviewed changes

src/daft-plan/src/physical_planner/translate.rs Show resolved Hide resolved

tests/dataframe/test_stddev.py Outdated Show resolved Hide resolved

Raunak Bhagat added 5 commits October 8, 2024 13:33

Add comment to populate_aggregation_stages; explains what each agg-…

53a0566

…stage is doing

Add docs to dataframe stddev API

e4222f5

Change assert_eq to debug_assert_eq

0c976a4

Update grouped-mean impl to use stats

65e443a

Merge branch 'main' into feat/stddev

04dcb04

raunakab requested review from desmondcheongzx, colin-ho and universalmind303 October 8, 2024 21:14

Add to docs

1874f43

github-actions bot added the documentation Improvements or additions to documentation label Oct 8, 2024

desmondcheongzx approved these changes Oct 8, 2024

View reviewed changes

colin-ho reviewed Oct 8, 2024

View reviewed changes

colin-ho approved these changes Oct 8, 2024

View reviewed changes

raunakab merged commit 64b8699 into main Oct 8, 2024
46 checks passed

raunakab deleted the feat/stddev branch October 8, 2024 22:13

colin-ho mentioned this pull request Oct 9, 2024

Standard deviation aggregation expression #2997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Implement standard deviation #3005

[FEAT] Implement standard deviation #3005

raunakab commented Oct 6, 2024 •

edited

Loading

codspeed-hq bot commented Oct 6, 2024 •

edited

Loading

codecov bot commented Oct 6, 2024 •

edited

Loading

colin-ho left a comment

desmondcheongzx left a comment

raunakab commented Oct 8, 2024 •

edited

Loading

raunakab commented Oct 8, 2024

desmondcheongzx left a comment

colin-ho Oct 8, 2024

colin-ho Oct 8, 2024

	// Second stage, we `global_sqsum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.
	// Second stage, we `global_sum := sum(sum(X^2))`, `global_sum := sum(sum(X))` and `global_count := sum(count(X))` in order to get the global versions of the first stage.

	// In the final projection, we then compute `sqrt((global_sqsum / global_count) - (global_sum / global_count) ^ 2)`.
	// In the final projection, we then compute `sqrt((global_sum / global_count) - (global_sum / global_count) ^ 2)`.

[FEAT] Implement standard deviation #3005

[FEAT] Implement standard deviation #3005

Conversation

raunakab commented Oct 6, 2024 • edited Loading

Overview

Notes

codspeed-hq bot commented Oct 6, 2024 • edited Loading

CodSpeed Performance Report

Merging #3005 will not alter performance

Summary

codecov bot commented Oct 6, 2024 • edited Loading

Codecov Report

colin-ho left a comment

Choose a reason for hiding this comment

desmondcheongzx left a comment

Choose a reason for hiding this comment

raunakab commented Oct 8, 2024 • edited Loading

raunakab commented Oct 8, 2024

desmondcheongzx left a comment

Choose a reason for hiding this comment

colin-ho Oct 8, 2024

Choose a reason for hiding this comment

colin-ho Oct 8, 2024

Choose a reason for hiding this comment

raunakab commented Oct 6, 2024 •

edited

Loading

codspeed-hq bot commented Oct 6, 2024 •

edited

Loading

codecov bot commented Oct 6, 2024 •

edited

Loading

raunakab commented Oct 8, 2024 •

edited

Loading