Fix problem with `percentage_central` argument in `check_outliers()` with MCD method #673

rempsyc · 2024-01-24T17:07:32Z

Fixes #672

percentage_central now works. As discussed in #672

Reprex:

library(performance)
alpha <- 0.1
check_outliers(mtcars, method = "mcd", 
               percentage_central = .50,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 15

check_outliers(mtcars, method = "mcd", 
               percentage_central = .75,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> [1] 10

check_outliers(mtcars, method = "mcd", 
               percentage_central = .25,
               threshold = stats::qchisq(p = 1 - alpha, df = ncol(mtcars))) |>
  which() |>
  length()
#> Error in MASS::cov.rob(x, quantile.used = percentage_central * nrow(x), : 'quantile' must be at least 12

^{Created on 2024-01-24 with reprex v2.0.2}

…cd method

codecov · 2024-01-24T17:18:56Z

Codecov Report

Attention: 38 lines in your changes are missing coverage. Please review.

Comparison is base (9760393) 55.93% compared to head (388609a) 55.95%.
Report is 1 commits behind head on main.

❗ Current head 388609a differs from pull request most recent head 092a6a3. Consider uploading reports for the commit 092a6a3 to get more accurate results

Files	Patch %	Lines
R/icc.R	47.05%	9 Missing ⚠️
R/r2_coxsnell.R	33.33%	8 Missing ⚠️
R/test_bf.R	22.22%	7 Missing ⚠️
R/r2_loo.R	76.00%	6 Missing ⚠️
R/r2.R	0.00%	3 Missing ⚠️
R/check_model.R	60.00%	2 Missing ⚠️
R/test_likelihoodratio.R	81.81%	2 Missing ⚠️
R/test_performance.R	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
+ Coverage   55.93%   55.95%   +0.02%     
==========================================
  Files          84       84              
  Lines        5996     5999       +3     
==========================================
+ Hits         3354     3357       +3     
  Misses       2642     2642

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rempsyc · 2024-01-25T09:01:53Z

Errors appear related to Matrix and not to this minimal PR

rempsyc · 2024-01-25T09:04:31Z

@bwiernik I fixed the bug that percentage_central was being overwritten with the MCD method in check_outliers(), but that means instead of a forced .66, it came back to a default of .50. I changed it to .75 to be consistent with Leys' 2018 recommendations, but let me know if you think it should really be .66 and not .75. If we differ from recommendations, we might have to explain this choice in the paper.

strengejacke · 2024-02-03T16:14:52Z

The lower the value, the more outliers detected. As I thought the method flags too many outliers, anyway, we should stick to the higher value as default?

strengejacke · 2024-02-04T10:59:57Z

Note that the message is printed before the results.

n_outliers_MCD <- function(N, p) {
  data <- data.frame(MASS::mvrnorm(N, rep(0, p), diag(p)))
  results <- performance::check_outliers(data, method = "mcd")
  cat(sprintf("\nNumber of rows: %d, Number of columns: %d\n", nrow(data), ncol(data)))
  print(results)
  cat("\n")
}

grid <- expand.grid(
  N = c(50, 100, 150, 200, 300, 500),
  p = c(5, 10, 15, 20, 25)
)

mapply(n_outliers_MCD, N = grid$N, p = grid$p)
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 5
#> 5 outliers detected: cases 3, 4, 5, 11, 37.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 100, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#> 
#> 
#> 
#> Number of rows: 150, Number of columns: 5
#> 1 outlier detected: case 22.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 200, Number of columns: 5
#> 1 outlier detected: case 189.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> 
#> 
#> Number of rows: 300, Number of columns: 5
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5
#> 
#> 
#> 
#> Number of rows: 500, Number of columns: 5
#> 1 outlier detected: case 465.
#> - Based on the following method and threshold: mcd (20.515).
#> - For variables: X1, X2, X3, X4, X5.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 10
#> 5 outliers detected: cases 20, 22, 26, 29, 42.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 150, Number of columns: 10
#> 1 outlier detected: case 82.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> 
#> 
#> Number of rows: 200, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 300, Number of columns: 10
#> OK: No outliers detected.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
#> 
#> 
#> 
#> Number of rows: 500, Number of columns: 10
#> 4 outliers detected: cases 93, 180, 417, 497.
#> - Based on the following method and threshold: mcd (30).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 15
#> 12 outliers detected: cases 1, 6, 11, 14, 15, 18, 21, 24, 25, 30, 33,
#>   35.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 15
#> 4 outliers detected: cases 2, 71, 76, 96.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 15
#> 1 outlier detected: case 29.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 200, Number of columns: 15
#> 5 outliers detected: cases 24, 48, 54, 98, 146.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 300, Number of columns: 15
#> 5 outliers detected: cases 23, 30, 175, 230, 240.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> 
#> 
#> Number of rows: 500, Number of columns: 15
#> 2 outliers detected: cases 53, 314.
#> - Based on the following method and threshold: mcd (40).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 20
#> 12 outliers detected: cases 2, 3, 5, 6, 13, 19, 23, 27, 29, 33, 36, 45.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 20
#> 19 outliers detected: cases 4, 5, 8, 10, 29, 45, 50, 51, 55, 60, 66, 77,
#>   78, 81, 85, 90, 93, 94, 96.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 20
#> 6 outliers detected: cases 27, 37, 82, 119, 127, 130.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 200, Number of columns: 20
#> 3 outliers detected: cases 42, 145, 197.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> 
#> 
#> Number of rows: 300, Number of columns: 20
#> 3 outliers detected: cases 4, 146, 261.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> 
#> 
#> Number of rows: 500, Number of columns: 20
#> 3 outliers detected: cases 226, 241, 424.
#> - Based on the following method and threshold: mcd (50).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 50, Number of columns: 25
#> 11 outliers detected: cases 1, 2, 7, 20, 24, 31, 33, 38, 41, 45, 47.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 100, Number of columns: 25
#> 21 outliers detected: cases 1, 3, 8, 9, 10, 15, 19, 28, 32, 40, 49, 56,
#>   60, 62, 63, 73, 76, 80, 92, 94, 99.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 150, Number of columns: 25
#> 12 outliers detected: cases 28, 29, 78, 83, 86, 101, 103, 115, 128, 136,
#>   146, 148.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> Sample size is too small resp. number of variables is too high in your
#>   data for MCD to be reliable.
#> 
#> Number of rows: 200, Number of columns: 25
#> 6 outliers detected: cases 12, 40, 51, 90, 94, 155.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> 
#> 
#> Number of rows: 300, Number of columns: 25
#> 3 outliers detected: cases 15, 161, 247.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.
#> 
#> 
#> Number of rows: 500, Number of columns: 25
#> 6 outliers detected: cases 85, 171, 259, 313, 351, 390.
#> - Based on the following method and threshold: mcd (52.62).
#> - For variables: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13,
#>   X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25.

^{Created on 2024-02-04 with reprex v2.1.0}

R/check_outliers.R

mattansb · 2024-02-04T11:47:22Z

@strengejacke I think this should be a proper warning.

strengejacke · 2024-02-04T17:27:21Z

@rempsyc You may want to add one or two sentences, in a similar fashion like we did in our paper, to the vignette. But that's not urgent, maybe in a different PR.

strengejacke · 2024-02-04T20:26:32Z

Thanks! Remaining failing tests should not be related to this PR.

fix problem with percentage_central argument in check_outliers with m…

605f9e3

…cd method

rempsyc requested a review from IndrajeetPatil January 25, 2024 09:00

rempsyc requested a review from bwiernik January 25, 2024 09:02

Merge branch 'main' into check_outliers_mcd_percentage_central

e451836

rempsyc requested a review from mattansb February 2, 2024 07:03

lintr, add test

a13c985

strengejacke added 4 commits February 3, 2024 22:18

lintr

13882ae

flag message

73e3428

fix

d40a452

fix

10a0c4e

message

9504f80

rempsyc commented Feb 4, 2024

View reviewed changes

R/check_outliers.R Outdated Show resolved Hide resolved

strengejacke added 4 commits February 4, 2024 16:40

warning instead msg

941da88

fix test

56eaf72

fix example

4ea505d

suppress warnings

7250bb7

rempsyc and others added 6 commits February 4, 2024 19:30

update warning message

bb17417

test, lintr

49b885f

lintr

435f975

lintr

388609a

minor fixes, lintr

f23bda8

version bump

092a6a3

strengejacke merged commit e946088 into main Feb 4, 2024
16 of 25 checks passed

strengejacke deleted the check_outliers_mcd_percentage_central branch February 4, 2024 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix problem with `percentage_central` argument in `check_outliers()` with MCD method #673

Fix problem with `percentage_central` argument in `check_outliers()` with MCD method #673

rempsyc commented Jan 24, 2024 •

edited by strengejacke

Loading

codecov bot commented Jan 24, 2024 •

edited

Loading

rempsyc commented Jan 25, 2024

rempsyc commented Jan 25, 2024 •

edited

Loading

strengejacke commented Feb 3, 2024

strengejacke commented Feb 4, 2024 •

edited

Loading

mattansb commented Feb 4, 2024

strengejacke commented Feb 4, 2024

strengejacke commented Feb 4, 2024

Fix problem with percentage_central argument in check_outliers() with MCD method #673

Fix problem with percentage_central argument in check_outliers() with MCD method #673

Conversation

rempsyc commented Jan 24, 2024 • edited by strengejacke Loading

codecov bot commented Jan 24, 2024 • edited Loading

Codecov Report

rempsyc commented Jan 25, 2024

rempsyc commented Jan 25, 2024 • edited Loading

strengejacke commented Feb 3, 2024

strengejacke commented Feb 4, 2024 • edited Loading

mattansb commented Feb 4, 2024

strengejacke commented Feb 4, 2024

strengejacke commented Feb 4, 2024

Fix problem with `percentage_central` argument in `check_outliers()` with MCD method #673

Fix problem with `percentage_central` argument in `check_outliers()` with MCD method #673

rempsyc commented Jan 24, 2024 •

edited by strengejacke

Loading

codecov bot commented Jan 24, 2024 •

edited

Loading

rempsyc commented Jan 25, 2024 •

edited

Loading

strengejacke commented Feb 4, 2024 •

edited

Loading