Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More rigorous treatment of floats in tests #13590

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

leoyvens
Copy link
Contributor

@leoyvens leoyvens commented Nov 28, 2024

Which issue does this PR close?

My motivation was to improve DF testing of float outputs, even at the least-significant digits.

The situation in #13569 seemed a bit uncomfortable, and generally the SLTs were formatting floats and decimal in complicated ways. So I took a shot at tackling it "the hard way", which involved changing virtually all SLTs that print out floats.

Rationale for this change

The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in datafusion-cli, when writing float outputs to CSV or when using arrow_cast to convert a float to string.

Tests do become more sensitive to minor changes in the handling of floats by DataFusion. But if the user will have to deal with it, then the tests should also have to deal with it.

The interaction with pg_compat tests is an interesting one. The Postgres avg over integers returns a numeric, while the DataFusion avg over integers returns a Float64. The SLTs previously dealt with this by rounding everything to 12 decimal digits. This PR deals with it by testing the value within an epsilon, with the justification of allowing us to entirely remove the dependency on the bigdecimal crate.

The regr_* family of UDAFs is also interesting. If the input is split across multiple batches, the output is not deterministic, so this PR also uses an epsilon there.

What changes are included in this PR?

This PR does code changes only to sqllogictest/src/engines/conversion.rs. But the fallout to .slt files makes the diff large.

  • SLTs now formats floats using ryu which is what arrow-rs uses, improving consistency with user-visible outputs.
  • pg_compat tests now test avg of integers within an epsilon, rather than relying on implicit truncation by the SLT engine.
  • Dependency on bigdecimal is removed (this was a test-only dependency).

Are these changes tested?

The changes are to the tests. Who tests the tests? 😄

Are there any user-facing changes?

No.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 28, 2024
@alamb
Copy link
Contributor

alamb commented Nov 28, 2024

Thank you @leoyvens -- this looks epic. I will review this PR but I may not have a chance to do so for a day or two. It looks awesome

@jonahgao
Copy link
Member

The rationale is that this makes the SLT test output closer to the output a DataFusion user would typically see, in datafusion-cli, when writing float outputs to CSV or when using arrow_cast to convert a float to string.

Make sense to me👍

@leoyvens
Copy link
Contributor Author

There was the following test failure on amd64 and win64:

External error: query result mismatch:
[SQL] select acos(0), acos(0.5), acos(1);
[Diff] (-expected|+actual)
-   1.5707963267948966 1.0471975511965976 0
+   1.5707963267948966 1.0471975511965979 0
at test_files/scalar.slt:93

This is surfacing non-determinism across target platforms. This is expected behaviour for Rust std. For acos, and many other float math functions, the Rust std docs say:

The precision of this function is non-deterministic.

So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM.

To gain confidence that we'd not be risking any significant performance regression, I analysed some benchmark results taken from the CI of the metallic-rs project (credit to @jdh8). It benchmarks only f32, not f64. From this data, I made a chart comparing std and libm results:

output

libm and std seem to have similar performance. If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

3 1956035476 9.590891361237
4 16155718643 9.531112968922
5 6449337880 7.074412226677
1 -438598674 12.15325379371643
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 digits is overspecified. double arithmetics is inherently imprecise and so we should compare with epsilon
truncating digits is not enough, given 2 ~= 1.9999999999999999

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we need the results of sqllogicaltest complete mode to be the same across different platforms.
So the old scheme of rounding half to a certain precision seems to be a good solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.

I agree

@findepi
Copy link
Member

findepi commented Nov 29, 2024

So I went looking to see if there was a performant but portable float math library we could use. libm seems to be it, it's what rustc uses when targeting WASM.

why would we want to use it?
are we changing the implementation just to write better test easier?
exact double comparisons is a test problem, not the product problem, and should be solved in the test layer.

@jonahgao
Copy link
Member

If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

I think portability is not necessary, and PostgreSQL doesn't guarantee it either.
I prefer to keep using std, as it should be more mature.

@leoyvens
Copy link
Contributor Author

leoyvens commented Nov 29, 2024

There are myths and truths to floating-point reproducibility across platforms. Some facts I've gathered while working on this:

  1. f32 and f64 in Rust follow IEEE-754.
  2. The IEEE-754 basic arithmetic operations are reproducible.
  3. All modern hardware correctly implements IEEE-754.
  4. Floating-point arithmetic is not associative.
  5. The floating point functions fall under "Recommended Operations" under IEEE-754. These seem to be non-compliant across libc and hardware implementations.

For many real-world DataFusion use cases, floating-point operations are reproducible. In my use case, I care about reproducibility so it's not just a tests problem, it's a product problem that I'd like the tests to cover.

Epsilon comparison should be used for any test that surfaces a concrete reproducibility problem due to point 4, non-associativity. So far, the only such situation surfaced by local and CI testing was the tests for the regr_* functions.

On the libm question, I'm proposing it because it solves a problem for me. Otherwise I might have to redefine portable versions of those UDFs for my application. Maturity is a valid question. libm is under the rust-lang organization, it is used by rustc for WASM and has the goal of eventually being made part of std::core. It is tested by comparing outputs to musl, so output quality can be considered to have parity with musl. If libm conflicts with other DataFusion use cases, or if general prudence dictates that we stick with std, I will follow the decision made by the DataFusion maintainers.

@findepi
Copy link
Member

findepi commented Nov 29, 2024

  • The IEEE-754 basic arithmetic operations are reproducible.
    ...
  • Floating-point arithmetic is not associative.

These two points imply that a database's Floating-point arithmetic results are not supposed to be reproducible.
Taking sum(a) for example -- a database is free to parallelize the work if it wishes so, and thus input values may end up being added on the CPU in different order or in different groups altogether. This is emphasized by the fact that in SQL, the input tables have no intrinsic ordering.

Let's consider an example

CREATE OR REPLACE TABLE doubles(a double);
INSERT INTO doubles VALUES (1e30);
INSERT INTO doubles VALUES (-1e30);
INSERT INTO doubles VALUES (1);
INSERT INTO doubles VALUES (-1);
SELECT sum(a) FROM doubles;

If we now run

SELECT sum(a) FROM doubles;

database can return 0, 1 or -1.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2024

If we value portability, I'd propose that we switch to libm, which is what I've implemented in the second commit.

I think portability is not necessary, and PostgreSQL doesn't guarantee it either. I prefer to keep using std, as it should be more mature.

I agree with @jonahgao and @findepi -- ensuring exact floating point reproducibility is not something most database systems do, I think due to the (performance) cost of doing so

Floating-point arithmetic is not associative.

I think this is the core challenge of why getting reproduceable results on a multi-threaded processing system like DataFusion will be hard without major changes. To ensure the same floating point results requires ensuring the same order of calculation.

It implies, for example, that the order processing intermediate aggregate rows must be the same, even if one core is done before another.

So TLDR is

  1. I think just changing to a different math library is unlikely to be enough
  2. If you need reporoducable values I think you might be able to use ordering to achieve it (e.g. order by the group keys in grouing, etc)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution @leoyvens -- it is great to see you tacking and working on improving the tests. 🙏

@@ -92,7 +92,6 @@ arrow-ipc = { version = "53.3.0", default-features = false, features = [
arrow-ord = { version = "53.3.0", default-features = false }
arrow-schema = { version = "53.3.0", default-features = false }
async-trait = "0.1.73"
bigdecimal = "0.4.6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to remove bigdecimal

3 1956035476 9.590891361237
4 16155718643 9.531112968922
5 6449337880 7.074412226677
1 -438598674 12.15325379371643
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we need the results of sqllogicaltest complete mode to be the same across different platforms. So the old scheme of rounding half to a certain precision seems to be a good solution.

I agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants