-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More rigorous treatment of floats in tests #13590
base: main
Are you sure you want to change the base?
Conversation
44fec95
to
9942c60
Compare
Thank you @leoyvens -- this looks epic. I will review this PR but I may not have a chance to do so for a day or two. It looks awesome |
9942c60
to
70e7e62
Compare
Make sense to me👍 |
4ca9592
to
bb9dce5
Compare
bb9dce5
to
a456db3
Compare
There was the following test failure on amd64 and win64:
This is surfacing non-determinism across target platforms. This is expected behaviour for Rust std. For
So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM. To gain confidence that we'd not be risking any significant performance regression, I analysed some benchmark results taken from the CI of the metallic-rs project (credit to @jdh8). It benchmarks only f32, not f64. From this data, I made a chart comparing std and libm results: libm and std seem to have similar performance. If we value portability, I'd propose that we switch to |
3 1956035476 9.590891361237 | ||
4 16155718643 9.531112968922 | ||
5 6449337880 7.074412226677 | ||
1 -438598674 12.15325379371643 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 digits is overspecified. double arithmetics is inherently imprecise and so we should compare with epsilon
truncating digits is not enough, given 2 ~= 1.9999999999999999
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms.
So the old scheme of rounding half to a certain precision seems to be a good solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.
I agree
why would we want to use it? |
I think portability is not necessary, and PostgreSQL doesn't guarantee it either. |
There are myths and truths to floating-point reproducibility across platforms. Some facts I've gathered while working on this:
For many real-world DataFusion use cases, floating-point operations are reproducible. In my use case, I care about reproducibility so it's not just a tests problem, it's a product problem that I'd like the tests to cover. Epsilon comparison should be used for any test that surfaces a concrete reproducibility problem due to point 4, non-associativity. So far, the only such situation surfaced by local and CI testing was the tests for the On the |
These two points imply that a database's Floating-point arithmetic results are not supposed to be reproducible. Let's consider an example CREATE OR REPLACE TABLE doubles(a double);
INSERT INTO doubles VALUES (1e30);
INSERT INTO doubles VALUES (-1e30);
INSERT INTO doubles VALUES (1);
INSERT INTO doubles VALUES (-1);
SELECT sum(a) FROM doubles; If we now run SELECT sum(a) FROM doubles; database can return 0, 1 or -1. |
I agree with @jonahgao and @findepi -- ensuring exact floating point reproducibility is not something most database systems do, I think due to the (performance) cost of doing so
I think this is the core challenge of why getting reproduceable results on a multi-threaded processing system like DataFusion will be hard without major changes. To ensure the same floating point results requires ensuring the same order of calculation. It implies, for example, that the order processing intermediate aggregate rows must be the same, even if one core is done before another. So TLDR is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution @leoyvens -- it is great to see you tacking and working on improving the tests. 🙏
@@ -92,7 +92,6 @@ arrow-ipc = { version = "53.3.0", default-features = false, features = [ | |||
arrow-ord = { version = "53.3.0", default-features = false } | |||
arrow-schema = { version = "53.3.0", default-features = false } | |||
async-trait = "0.1.73" | |||
bigdecimal = "0.4.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to remove bigdecimal
3 1956035476 9.590891361237 | ||
4 16155718643 9.531112968922 | ||
5 6449337880 7.074412226677 | ||
1 -438598674 12.15325379371643 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.
I agree
Which issue does this PR close?
My motivation was to improve DF testing of float outputs, even at the least-significant digits.
The situation in #13569 seemed a bit uncomfortable, and generally the SLTs were formatting floats and decimal in complicated ways. So I took a shot at tackling it "the hard way", which involved changing virtually all SLTs that print out floats.
Rationale for this change
The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in
datafusion-cli
, when writing float outputs to CSV or when usingarrow_cast
to convert a float to string.Tests do become more sensitive to minor changes in the handling of floats by DataFusion. But if the user will have to deal with it, then the tests should also have to deal with it.
The interaction with pg_compat tests is an interesting one. The Postgres
avg
over integers returns anumeric
, while the DataFusionavg
over integers returns aFloat64
. The SLTs previously dealt with this by rounding everything to 12 decimal digits. This PR deals with it by testing the value within an epsilon, with the justification of allowing us to entirely remove the dependency on thebigdecimal
crate.The
regr_*
family of UDAFs is also interesting. If the input is split across multiple batches, the output is not deterministic, so this PR also uses an epsilon there.What changes are included in this PR?
This PR does code changes only to
sqllogictest/src/engines/conversion.rs
. But the fallout to.slt
files makes the diff large.ryu
which is what arrow-rs uses, improving consistency with user-visible outputs.avg
of integers within an epsilon, rather than relying on implicit truncation by the SLT engine.bigdecimal
is removed (this was a test-only dependency).Are these changes tested?
The changes are to the tests. Who tests the tests? 😄
Are there any user-facing changes?
No.