Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop More Sophisticated Pass/Fail Criteria for Testing #346

Open
mjs271 opened this issue Sep 17, 2024 · 0 comments
Open

Develop More Sophisticated Pass/Fail Criteria for Testing #346

mjs271 opened this issue Sep 17, 2024 · 0 comments

Comments

@mjs271
Copy link
Contributor

mjs271 commented Sep 17, 2024

I propose a different criteria for determining whether tests pass or fail. As it stands, this is more related to validation testing, since I believe most unit tests determine the criteria internally with ASSERT() and the like.

My suggestion would be to employ relative error, or another similarly scale-aware calculation, for pass/fail criteria in our tests. One bonus for this would be that the magnitude of errors could be more easily compared between quantities within a test or among tests. Some have mentioned that employing only relative error may not be the correct choice, so a good option may be to calculate both absolute error and relative error and throw an error or print a warning/error message if only one fails. This would require 2 tolerances for many cases, but that wouldn't be a huge issue.

My first crack at employing relative error as the pass/fail criteria can be found in #331 and the associated mam_x_validation PR #72.

Currently, our validation testing employs the compare_mam4xx_mam4.py script that determines pass/fail criteria according to absolute L-norm errors (2, 1, $\infty$) with a tolerance defined for each test in the associated CMakeLists.txt file.

The issue with this method is that a given test may have multiple output quantities with large differences in the magnitude of the values contained therein. Examples follow for possible bugs that could be induced under the current method:

  1. A test with mostly small output values defines the tolerance to be $1\text{e-}10$ and all small values have absolute errors less than this value. However, we have one output with a magnitude much larger $a_1 = 1.234567831415926\text{e}16$ and $a_2 = 1.234567816180339\text{e}16$.
    • The absolute error is $\mathcal{E_{\text{abs}}} = \vert a_1 - a_2 \vert = 152350.0$ while the relative difference (scaling by the largest value in either quantity) is
      $\mathcal{E_{\text{rel}}} = \frac{\mathcal{E_{\text{abs}}}}{\max(a_1, a_2)} = 1.2340350101987489\text{e-}11$
    • Admittedly this case is unlikely but would lead to a spurious failure.
  2. A test with mostly "small" output values defines the tolerance to be $1\text{e-}10$ and most values have absolute errors less than this value. However, we have one output with a magnitude much smaller $a_1 = 1.314159265358979\text{e-}11$ and $a_2 = 1.161803398874989\text{e-}11$
    • In this case, we have $\mathcal{E_{\text{abs}}} = 1.5235586648399005\text{e-}12$ and $\mathcal{E_{\text{rel}}} = 0.1159340960415267$.
    • This would lead to a (possibly) spuriously passing test and I would argue is more likely to occur.

There are several cases where I believe tests are incorrectly passing due to the latter case above. See the following test outputs on the mjs/largeRelError branch for examples of tests with large-magnitude relative errors that are passing on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant