You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I propose a different criteria for determining whether tests pass or fail. As it stands, this is more related to validation testing, since I believe most unit tests determine the criteria internally with ASSERT() and the like.
My suggestion would be to employ relative error, or another similarly scale-aware calculation, for pass/fail criteria in our tests. One bonus for this would be that the magnitude of errors could be more easily compared between quantities within a test or among tests. Some have mentioned that employing only relative error may not be the correct choice, so a good option may be to calculate both absolute error and relative error and throw an error or print a warning/error message if only one fails. This would require 2 tolerances for many cases, but that wouldn't be a huge issue.
My first crack at employing relative error as the pass/fail criteria can be found in #331 and the associated mam_x_validation PR #72.
Currently, our validation testing employs the compare_mam4xx_mam4.py script that determines pass/fail criteria according to absolute L-norm errors (2, 1, $\infty$) with a tolerance defined for each test in the associated CMakeLists.txt file.
The issue with this method is that a given test may have multiple output quantities with large differences in the magnitude of the values contained therein. Examples follow for possible bugs that could be induced under the current method:
A test with mostly small output values defines the tolerance to be $1\text{e-}10$ and all small values have absolute errors less than this value. However, we have one output with a magnitude much larger $a_1 = 1.234567831415926\text{e}16$ and $a_2 = 1.234567816180339\text{e}16$.
The absolute error is $\mathcal{E_{\text{abs}}} = \vert a_1 - a_2 \vert = 152350.0$ while the relative difference (scaling by the largest value in either quantity) is $\mathcal{E_{\text{rel}}} = \frac{\mathcal{E_{\text{abs}}}}{\max(a_1, a_2)} = 1.2340350101987489\text{e-}11$
Admittedly this case is unlikely but would lead to a spurious failure.
A test with mostly "small" output values defines the tolerance to be $1\text{e-}10$ and most values have absolute errors less than this value. However, we have one output with a magnitude much smaller $a_1 = 1.314159265358979\text{e-}11$ and $a_2 = 1.161803398874989\text{e-}11$
In this case, we have $\mathcal{E_{\text{abs}}} = 1.5235586648399005\text{e-}12$ and $\mathcal{E_{\text{rel}}} = 0.1159340960415267$.
This would lead to a (possibly) spuriously passing test and I would argue is more likely to occur.
There are several cases where I believe tests are incorrectly passing due to the latter case above. See the following test outputs on the mjs/largeRelError branch for examples of tests with large-magnitude relative errors that are passing on main.
I propose a different criteria for determining whether tests pass or fail. As it stands, this is more related to validation testing, since I believe most unit tests determine the criteria internally with
ASSERT()
and the like.My suggestion would be to employ relative error, or another similarly scale-aware calculation, for pass/fail criteria in our tests. One bonus for this would be that the magnitude of errors could be more easily compared between quantities within a test or among tests. Some have mentioned that employing only relative error may not be the correct choice, so a good option may be to calculate both absolute error and relative error and throw an error or print a warning/error message if only one fails. This would require 2 tolerances for many cases, but that wouldn't be a huge issue.
My first crack at employing relative error as the pass/fail criteria can be found in #331 and the associated
mam_x_validation
PR #72.Currently, our validation testing employs the compare_mam4xx_mam4.py script that determines pass/fail criteria according to absolute L-norm errors (2, 1,$\infty$ ) with a tolerance defined for each test in the associated
CMakeLists.txt
file.The issue with this method is that a given test may have multiple output quantities with large differences in the magnitude of the values contained therein. Examples follow for possible bugs that could be induced under the current method:
There are several cases where I believe tests are incorrectly passing due to the latter case above. See the following test outputs on the mjs/largeRelError branch for examples of tests with large-magnitude relative errors that are passing on
main
.The text was updated successfully, but these errors were encountered: