Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem with percentage_central argument in check_outliers() with MCD method #673

Merged
merged 18 commits into from
Feb 4, 2024

Conversation

rempsyc
Copy link
Member

@rempsyc rempsyc commented Jan 24, 2024

Fixes #672

percentage_central now works. As discussed in #672


Reprex:

library(performance)
alpha <- 0.1
check_outliers(mtcars, method = "mcd", 
               percentage_central = .50,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 15

check_outliers(mtcars, method = "mcd", 
               percentage_central = .75,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 10

check_outliers(mtcars, method = "mcd", 
               percentage_central = .25,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> Error in MASS::cov.rob(x, quantile.used = percentage_central * nrow(x), : 'quantile' must be at least 12

Created on 2024-01-24 with reprex v2.0.2

Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: 38 lines in your changes are missing coverage. Please review.

Comparison is base (9760393) 55.93% compared to head (388609a) 55.95%.
Report is 1 commits behind head on main.

❗ Current head 388609a differs from pull request most recent head 092a6a3. Consider uploading reports for the commit 092a6a3 to get more accurate results

Files Patch % Lines
R/icc.R 47.05% 9 Missing ⚠️
R/r2_coxsnell.R 33.33% 8 Missing ⚠️
R/test_bf.R 22.22% 7 Missing ⚠️
R/r2_loo.R 76.00% 6 Missing ⚠️
R/r2.R 0.00% 3 Missing ⚠️
R/check_model.R 60.00% 2 Missing ⚠️
R/test_likelihoodratio.R 81.81% 2 Missing ⚠️
R/test_performance.R 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
+ Coverage   55.93%   55.95%   +0.02%     
==========================================
  Files          84       84              
  Lines        5996     5999       +3     
==========================================
+ Hits         3354     3357       +3     
  Misses       2642     2642              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rempsyc
Copy link
Member Author

rempsyc commented Jan 25, 2024

Errors appear related to Matrix and not to this minimal PR

@rempsyc rempsyc requested a review from bwiernik January 25, 2024 09:02
@rempsyc
Copy link
Member Author

rempsyc commented Jan 25, 2024

@bwiernik I fixed the bug that percentage_central was being overwritten with the MCD method in check_outliers(), but that means instead of a forced .66, it came back to a default of .50. I changed it to .75 to be consistent with Leys' 2018 recommendations, but let me know if you think it should really be .66 and not .75. If we differ from recommendations, we might have to explain this choice in the paper.

@rempsyc rempsyc requested a review from mattansb February 2, 2024 07:03
@strengejacke
Copy link
Member

The lower the value, the more outliers detected. As I thought the method flags too many outliers, anyway, we should stick to the higher value as default?

@strengejacke
Copy link
Member

strengejacke commented Feb 4, 2024

Note that the message is printed before the results.

n_outliers_MCD <- function(N, p) {
  data <- data.frame(MASS::mvrnorm(N, rep(0, p), diag(p)))
  results <- performance::check_outliers(data, method = "mcd")
  cat(sprintf("\nNumber of rows: %d, Number of columns: %d\n", nrow(data), ncol(data)))
  print(results)
  cat("\n")
}

grid <- expand.grid(
  N = c(50, 100, 150, 200, 300, 500),
  p = c(5, 10, 15, 20, 25)
)

mapply(n_outliers_MCD, N = grid$N, p = grid$p)
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 5
#> 5 outliers detected: cases 3, 4, 5, 11, 37.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 100, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#> 
#> 
#> 
#> Number of rows: 150, Number of columns: 5
#> 1 outlier detected: case 22.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 200, Number of columns: 5
#> 1 outlier detected: case 189.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 300, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#> 
#> 
#> 
#> Number of rows: 500, Number of columns: 5
#> 1 outlier detected: case 465.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 10
#> 5 outliers detected: cases 20, 22, 26, 29, 42.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 150, Number of columns: 10
#> 1 outlier detected: case 82.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> 
#> 
#> Number of rows: 200, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 300, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 500, Number of columns: 10
#> 4 outliers detected: cases 93, 180, 417, 497.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 15
#> 12 outliers detected: cases 1, 6, 11, 14, 15, 18, 21, 24, 25, 30, 33,
#>   35.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 15
#> 4 outliers detected: cases 2, 71, 76, 96.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 15
#> 1 outlier detected: case 29.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 200, Number of columns: 15
#> 5 outliers detected: cases 24, 48, 54, 98, 146.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 300, Number of columns: 15
#> 5 outliers detected: cases 23, 30, 175, 230, 240.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 500, Number of columns: 15
#> 2 outliers detected: cases 53, 314.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 20
#> 12 outliers detected: cases 2, 3, 5, 6, 13, 19, 23, 27, 29, 33, 36, 45.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 20
#> 19 outliers detected: cases 4, 5, 8, 10, 29, 45, 50, 51, 55, 60, 66, 77,
#>   78, 81, 85, 90, 93, 94, 96.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 20
#> 6 outliers detected: cases 27, 37, 82, 119, 127, 130.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 200, Number of columns: 20
#> 3 outliers detected: cases 42, 145, 197.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> 
#> 
#> Number of rows: 300, Number of columns: 20
#> 3 outliers detected: cases 4, 146, 261.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> 
#> 
#> Number of rows: 500, Number of columns: 20
#> 3 outliers detected: cases 226, 241, 424.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 25
#> 11 outliers detected: cases 1, 2, 7, 20, 24, 31, 33, 38, 41, 45, 47.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 25
#> 21 outliers detected: cases 1, 3, 8, 9, 10, 15, 19, 28, 32, 40, 49, 56,
#>   60, 62, 63, 73, 76, 80, 92, 94, 99.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 25
#> 12 outliers detected: cases 28, 29, 78, 83, 86, 101, 103, 115, 128, 136,
#>   146, 148.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 200, Number of columns: 25
#> 6 outliers detected: cases 12, 40, 51, 90, 94, 155.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> 
#> 
#> Number of rows: 300, Number of columns: 25
#> 3 outliers detected: cases 15, 161, 247.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> 
#> 
#> Number of rows: 500, Number of columns: 25
#> 6 outliers detected: cases 85, 171, 259, 313, 351, 390.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.

Created on 2024-02-04 with reprex v2.1.0

R/check_outliers.R Outdated Show resolved Hide resolved
@mattansb
Copy link
Member

mattansb commented Feb 4, 2024

@strengejacke I think this should be a proper warning.

@strengejacke
Copy link
Member

@rempsyc You may want to add one or two sentences, in a similar fashion like we did in our paper, to the vignette. But that's not urgent, maybe in a different PR.

@strengejacke
Copy link
Member

Thanks! Remaining failing tests should not be related to this PR.

@strengejacke strengejacke merged commit e946088 into main Feb 4, 2024
16 of 25 checks passed
@strengejacke strengejacke deleted the check_outliers_mcd_percentage_central branch February 4, 2024 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigating the high % of outliers detected with the MCD method
3 participants