More rigorous treatment of floats in tests #13590

leoyvens · 2024-11-28T10:08:48Z

Which issue does this PR close?

My motivation was to improve DF testing of float outputs, even at the least-significant digits.

The situation in #13569 seemed a bit uncomfortable, and generally the SLTs were formatting floats and decimal in complicated ways. So I took a shot at tackling it "the hard way", which involved changing virtually all SLTs that print out floats.

Rationale for this change

The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in datafusion-cli, when writing float outputs to CSV or when using arrow_cast to convert a float to string.

Tests do become more sensitive to minor changes in the handling of floats by DataFusion. But if the user will have to deal with it, then the tests should also have to deal with it.

The interaction with pg_compat tests is an interesting one. The Postgres avg over integers returns a numeric, while the DataFusion avg over integers returns a Float64. The SLTs previously dealt with this by rounding everything to 12 decimal digits. This PR deals with it by testing the value within an epsilon, with the justification of allowing us to entirely remove the dependency on the bigdecimal crate.

The regr_* family of UDAFs is also interesting. If the input is split across multiple batches, the output is not deterministic, so this PR also uses an epsilon there.

What changes are included in this PR?

This PR does code changes only to sqllogictest/src/engines/conversion.rs. But the fallout to .slt files makes the diff large.

SLTs now formats floats using ryu which is what arrow-rs uses, improving consistency with user-visible outputs.
pg_compat tests now test avg of integers within an epsilon, rather than relying on implicit truncation by the SLT engine.
Dependency on bigdecimal is removed (this was a test-only dependency).

Are these changes tested?

The changes are to the tests. Who tests the tests? 😄

Are there any user-facing changes?

No.

alamb · 2024-11-28T11:06:20Z

Thank you @leoyvens -- this looks epic. I will review this PR but I may not have a chance to do so for a day or two. It looks awesome

jonahgao · 2024-11-29T02:08:25Z

The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in datafusion-cli, when writing float outputs to CSV or when using arrow_cast to convert a float to string.

Make sense to me👍

leoyvens · 2024-11-29T13:03:30Z

There was the following test failure on amd64 and win64:

External error: query result mismatch:
[SQL] select acos(0), acos(0.5), acos(1);
[Diff] (-expected|+actual)
-   1.5707963267948966 1.0471975511965976 0
+   1.5707963267948966 1.0471975511965979 0
at test_files/scalar.slt:93

This is surfacing non-determinism across target platforms. This is expected behaviour for Rust std. For acos, and many other float math functions, the Rust std docs say:

The precision of this function is non-deterministic.

So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM.

To gain confidence that we'd not be risking any significant performance regression, I analysed some benchmark results taken from the CI of the metallic-rs project (credit to @jdh8). It benchmarks only f32, not f64. From this data, I made a chart comparing std and libm results:

libm and std seem to have similar performance. If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

findepi · 2024-11-29T13:19:32Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

-3 1956035476 9.590891361237
-4 16155718643 9.531112968922
-5 6449337880 7.074412226677
+1 -438598674 12.15325379371643


16 digits is overspecified. double arithmetics is inherently imprecise and so we should compare with epsilon
truncating digits is not enough, given 2 ~= 1.9999999999999999

And we need the results of sqllogicaltest complete mode to be the same across different platforms.
So the old scheme of rounding half to a certain precision seems to be a good solution.

And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.

I agree

findepi · 2024-11-29T13:21:35Z

So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM.

why would we want to use it?
are we changing the implementation just to write better test easier?
exact double comparisons is a test problem, not the product problem, and should be solved in the test layer.

jonahgao · 2024-11-29T15:12:32Z

If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

I think portability is not necessary, and PostgreSQL doesn't guarantee it either.
I prefer to keep using std, as it should be more mature.

leoyvens · 2024-11-29T15:48:14Z

There are myths and truths to floating-point reproducibility across platforms. Some facts I've gathered while working on this:

f32 and f64 in Rust follow IEEE-754.
The IEEE-754 basic arithmetic operations are reproducible.
All modern hardware correctly implements IEEE-754.
Floating-point arithmetic is not associative.
The floating point functions fall under "Recommended Operations" under IEEE-754. These seem to be non-compliant across libc and hardware implementations.

For many real-world DataFusion use cases, floating-point operations are reproducible. In my use case, I care about reproducibility so it's not just a tests problem, it's a product problem that I'd like the tests to cover.

Epsilon comparison should be used for any test that surfaces a concrete reproducibility problem due to point 4, non-associativity. So far, the only such situation surfaced by local and CI testing was the tests for the regr_* functions.

On the libm question, I'm proposing it because it solves a problem for me. Otherwise I might have to redefine portable versions of those UDFs for my application. Maturity is a valid question. libm is under the rust-lang organization, it is used by rustc for WASM and has the goal of eventually being made part of std::core. It is tested by comparing outputs to musl, so output quality can be considered to have parity with musl. If libm conflicts with other DataFusion use cases, or if general prudence dictates that we stick with std, I will follow the decision made by the DataFusion maintainers.

findepi · 2024-11-29T17:16:47Z

The IEEE-754 basic arithmetic operations are reproducible.
...

Floating-point arithmetic is not associative.

These two points imply that a database's Floating-point arithmetic results are not supposed to be reproducible.
Taking sum(a) for example -- a database is free to parallelize the work if it wishes so, and thus input values may end up being added on the CPU in different order or in different groups altogether. This is emphasized by the fact that in SQL, the input tables have no intrinsic ordering.

Let's consider an example

CREATE OR REPLACE TABLE doubles(a double);
INSERT INTO doubles VALUES (1e30);
INSERT INTO doubles VALUES (-1e30);
INSERT INTO doubles VALUES (1);
INSERT INTO doubles VALUES (-1);
SELECT sum(a) FROM doubles;

If we now run

SELECT sum(a) FROM doubles;

database can return 0, 1 or -1.

alamb · 2024-11-30T12:34:16Z

If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

I think portability is not necessary, and PostgreSQL doesn't guarantee it either. I prefer to keep using std, as it should be more mature.

I agree with @jonahgao and @findepi -- ensuring exact floating point reproducibility is not something most database systems do, I think due to the (performance) cost of doing so

Floating-point arithmetic is not associative.

I think this is the core challenge of why getting reproduceable results on a multi-threaded processing system like DataFusion will be hard without major changes. To ensure the same floating point results requires ensuring the same order of calculation.

It implies, for example, that the order processing intermediate aggregate rows must be the same, even if one core is done before another.

So TLDR is

I think just changing to a different math library is unlikely to be enough
If you need reporoducable values I think you might be able to use ordering to achieve it (e.g. order by the group keys in grouing, etc)

alamb

Thank you very much for this contribution @leoyvens -- it is great to see you tacking and working on improving the tests. 🙏

alamb · 2024-11-30T12:34:48Z

Cargo.toml

@@ -92,7 +92,6 @@ arrow-ipc = { version = "53.3.0", default-features = false, features = [
 arrow-ord = { version = "53.3.0", default-features = false }
 arrow-schema = { version = "53.3.0", default-features = false }
 async-trait = "0.1.73"
-bigdecimal = "0.4.6"


It is great to remove bigdecimal

alamb · 2024-11-30T12:35:19Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

-3 1956035476 9.590891361237
-4 16155718643 9.531112968922
-5 6449337880 7.074412226677
+1 -438598674 12.15325379371643


And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.

I agree

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 28, 2024

leoyvens force-pushed the float-full-prec branch from 44fec95 to 9942c60 Compare November 28, 2024 10:44

leoyvens force-pushed the float-full-prec branch from 9942c60 to 70e7e62 Compare November 28, 2024 11:21

github-actions bot added the functions label Nov 29, 2024

leoyvens added 2 commits November 29, 2024 12:52

More rigorous treatment of floats in tests

5d1975e

switch float math udfs from std to libm

0720d0b

leoyvens force-pushed the float-full-prec branch from 4ca9592 to bb9dce5 Compare November 29, 2024 11:52

cargo update datafusion-cli

a456db3

leoyvens force-pushed the float-full-prec branch from bb9dce5 to a456db3 Compare November 29, 2024 13:03

findepi reviewed Nov 29, 2024

View reviewed changes

fix tpc-h slt for non-truncated floats

0c83406

alamb reviewed Nov 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More rigorous treatment of floats in tests #13590

More rigorous treatment of floats in tests #13590

leoyvens commented Nov 28, 2024 •

edited

Loading

alamb commented Nov 28, 2024

jonahgao commented Nov 29, 2024

leoyvens commented Nov 29, 2024

findepi Nov 29, 2024

jonahgao Nov 29, 2024

alamb Nov 30, 2024

findepi commented Nov 29, 2024

jonahgao commented Nov 29, 2024

leoyvens commented Nov 29, 2024 •

edited

Loading

findepi commented Nov 29, 2024 •

edited

Loading

alamb commented Nov 30, 2024

alamb left a comment

alamb Nov 30, 2024

alamb Nov 30, 2024

More rigorous treatment of floats in tests #13590

Are you sure you want to change the base?

More rigorous treatment of floats in tests #13590

Conversation

leoyvens commented Nov 28, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Nov 28, 2024

jonahgao commented Nov 29, 2024

leoyvens commented Nov 29, 2024

findepi Nov 29, 2024

Choose a reason for hiding this comment

jonahgao Nov 29, 2024

Choose a reason for hiding this comment

alamb Nov 30, 2024

Choose a reason for hiding this comment

findepi commented Nov 29, 2024

jonahgao commented Nov 29, 2024

leoyvens commented Nov 29, 2024 • edited Loading

findepi commented Nov 29, 2024 • edited Loading

alamb commented Nov 30, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 30, 2024

Choose a reason for hiding this comment

alamb Nov 30, 2024

Choose a reason for hiding this comment

leoyvens commented Nov 28, 2024 •

edited

Loading

leoyvens commented Nov 29, 2024 •

edited

Loading

findepi commented Nov 29, 2024 •

edited

Loading