-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix problem with percentage_central
argument in check_outliers()
with MCD method
#673
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #673 +/- ##
==========================================
+ Coverage 55.93% 55.95% +0.02%
==========================================
Files 84 84
Lines 5996 5999 +3
==========================================
+ Hits 3354 3357 +3
Misses 2642 2642 ☔ View full report in Codecov by Sentry. |
Errors appear related to |
@bwiernik I fixed the bug that |
The lower the value, the more outliers detected. As I thought the method flags too many outliers, anyway, we should stick to the higher value as default? |
Note that the message is printed before the results. n_outliers_MCD <- function(N, p) {
data <- data.frame(MASS::mvrnorm(N, rep(0, p), diag(p)))
results <- performance::check_outliers(data, method = "mcd")
cat(sprintf("\nNumber of rows: %d, Number of columns: %d\n", nrow(data), ncol(data)))
print(results)
cat("\n")
}
grid <- expand.grid(
N = c(50, 100, 150, 200, 300, 500),
p = c(5, 10, 15, 20, 25)
)
mapply(n_outliers_MCD, N = grid$N, p = grid$p)
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 50, Number of columns: 5
#> 5 outliers detected: cases 3, 4, 5, 11, 37.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#>
#>
#> Number of rows: 100, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#>
#>
#>
#> Number of rows: 150, Number of columns: 5
#> 1 outlier detected: case 22.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#>
#>
#> Number of rows: 200, Number of columns: 5
#> 1 outlier detected: case 189.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#>
#>
#> Number of rows: 300, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#>
#>
#>
#> Number of rows: 500, Number of columns: 5
#> 1 outlier detected: case 465.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 50, Number of columns: 10
#> 5 outliers detected: cases 20, 22, 26, 29, 42.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 100, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#>
#>
#>
#> Number of rows: 150, Number of columns: 10
#> 1 outlier detected: case 82.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#>
#>
#> Number of rows: 200, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#>
#>
#>
#> Number of rows: 300, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#>
#>
#>
#> Number of rows: 500, Number of columns: 10
#> 4 outliers detected: cases 93, 180, 417, 497.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 50, Number of columns: 15
#> 12 outliers detected: cases 1, 6, 11, 14, 15, 18, 21, 24, 25, 30, 33,
#> 35.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 100, Number of columns: 15
#> 4 outliers detected: cases 2, 71, 76, 96.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 150, Number of columns: 15
#> 1 outlier detected: case 29.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#>
#>
#> Number of rows: 200, Number of columns: 15
#> 5 outliers detected: cases 24, 48, 54, 98, 146.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#>
#>
#> Number of rows: 300, Number of columns: 15
#> 5 outliers detected: cases 23, 30, 175, 230, 240.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#>
#>
#> Number of rows: 500, Number of columns: 15
#> 2 outliers detected: cases 53, 314.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 50, Number of columns: 20
#> 12 outliers detected: cases 2, 3, 5, 6, 13, 19, 23, 27, 29, 33, 36, 45.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 100, Number of columns: 20
#> 19 outliers detected: cases 4, 5, 8, 10, 29, 45, 50, 51, 55, 60, 66, 77,
#> 78, 81, 85, 90, 93, 94, 96.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 150, Number of columns: 20
#> 6 outliers detected: cases 27, 37, 82, 119, 127, 130.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 200, Number of columns: 20
#> 3 outliers detected: cases 42, 145, 197.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#>
#>
#> Number of rows: 300, Number of columns: 20
#> 3 outliers detected: cases 4, 146, 261.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#>
#>
#> Number of rows: 500, Number of columns: 20
#> 3 outliers detected: cases 226, 241, 424.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 50, Number of columns: 25
#> 11 outliers detected: cases 1, 2, 7, 20, 24, 31, 33, 38, 41, 45, 47.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 100, Number of columns: 25
#> 21 outliers detected: cases 1, 3, 8, 9, 10, 15, 19, 28, 32, 40, 49, 56,
#> 60, 62, 63, 73, 76, 80, 92, 94, 99.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 150, Number of columns: 25
#> 12 outliers detected: cases 28, 29, 78, 83, 86, 101, 103, 115, 128, 136,
#> 146, 148.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#> data for MCD to be reliable.
#>
#> Number of rows: 200, Number of columns: 25
#> 6 outliers detected: cases 12, 40, 51, 90, 94, 155.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#>
#>
#> Number of rows: 300, Number of columns: 25
#> 3 outliers detected: cases 15, 161, 247.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#>
#>
#> Number of rows: 500, Number of columns: 25
#> 6 outliers detected: cases 85, 171, 259, 313, 351, 390.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#> X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25. Created on 2024-02-04 with reprex v2.1.0 |
@strengejacke I think this should be a proper warning. |
@rempsyc You may want to add one or two sentences, in a similar fashion like we did in our paper, to the vignette. But that's not urgent, maybe in a different PR. |
Thanks! Remaining failing tests should not be related to this PR. |
Fixes #672
percentage_central
now works. As discussed in #672Reprex:
Created on 2024-01-24 with reprex v2.0.2