Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection in Linear mixed models failed? #711

Closed
tappituffi opened this issue Apr 16, 2024 · 5 comments · Fixed by #717
Closed

Outlier detection in Linear mixed models failed? #711

tappituffi opened this issue Apr 16, 2024 · 5 comments · Fixed by #717
Labels
Reprex 📊 We need a reproducible example for further investigation

Comments

@tappituffi
Copy link

Screenshot 2024-04-14 at 11 03 00 AM

Hi,
-> When I look at the outlier plot, there are some samples that fall outside the contour line but are not flagged in red, as demonstrated in the performance tutorial. Also, when I check the model with the check_outliers function, it says I don't have any outliers.
So my question is why those samples are not flagged as outliers even though they are outside the contour lines. Has this something to do with that I am using mixed models or how could I explain this?

Thanks in advance and best,
Niklas

@rempsyc
Copy link
Member

rempsyc commented Apr 18, 2024

Dear Niklas, thank you very much for following-up with this issue after your email. I was wondering if you could include some example code for both check_model() and check_outliers() so we are better able to investigate what is happening.

If using your own data, you probably cannot share it, however, you could share the code and output that generated it to give us an idea and test with a reproducible example on our end. Even better, would be if you would be able to generate a reprex using base R data, for example included in the help() examples of the linear mixed model function you use.

@rempsyc rempsyc added the Reprex 📊 We need a reproducible example for further investigation label Apr 18, 2024
@tappituffi
Copy link
Author

tappituffi commented Apr 20, 2024

Hi,

I am not familiar with reprex etc but I could generate a reproducible code that has the same issue. When you run below code it will generate the check_model plots and shows only a red dot for 40. However, all other points outside of the contour line are not in red even though they should be outliers?
Also, when you run check_outlier, it only states the 40 as outlier and when you run plot(outlier) it doesn't even show the dot for 40 in red. So I guess this code shows the issue I have for my study. So my question is if points are outside the contour line, should they be regarded as outliers and hence colored in red or not?
Thanks in advance!

# Install necessary packages if they are not already installed
if (!require("lme4")) install.packages("lme4", dependencies = TRUE)
if (!require("performance")) install.packages("performance", dependencies = TRUE)

# Load the packages
library(lme4)
library(performance)
library(see)

# Create the data frame
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)

# Introducing outliers
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers

data <- data.frame(subjectID, PN, alpha)

# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
plot(outlier)
check_model(model)

@rempsyc
Copy link
Member

rempsyc commented Apr 28, 2024

Thank you for providing reproducing code. For future reference, I have turned it into a reprex using the reprex package which allows us to also see the visual output of your code:

# Load the packages
library(lme4)
library(performance)
library(see)

# Create the data frame
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)

# Introducing outliers
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers

data <- data.frame(subjectID, PN, alpha)

# Fit the mixed model
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
outlier <- check_outliers(model)
outlier
#> 1 outlier detected: case 40.
#> - Based on the following method and threshold: cook (0.7).
#> - For variable: (Whole model).
# plot(outlier)

x <- check_model(model)

plot1 <- plot(x$OUTLIERS)
plot1

plot2 <- plot(x)$OUTLIERS
plot2

Created on 2024-04-28 with reprex v2.1.0

@rempsyc
Copy link
Member

rempsyc commented Apr 28, 2024

We can see that plot(x$OUTLIERS) (check_outliers()) and plot(x)$OUTLIERS (from check_model()) provide different results. It seems that this difference comes from the see packages rather than the performance package. Specifically, it comes from the function used internally, .plot_diag_outliers_new().

# Prepare model
library(see)
library(lme4)
library(performance)
set.seed(123)  # for reproducibility
n <- 100
subjectID <- paste("Subject", rep(1:10, each = 10))
PN <- runif(n, 0, 100)
alpha <- 0.5 * PN + rnorm(n, mean = 50, sd = 10)
alpha[c(20, 40)] <- c(150, 160)  # arbitrarily chosen subjects for outliers
data <- data.frame(subjectID, PN, alpha)
model <- lmer(alpha ~ PN + (1 | subjectID), data = data)
x <- check_model(model)

# These two functions provide different results:
# `check_model()`
see:::.plot_diag_outliers_new(x$INFLUENTIAL)

# `check_outliers()`
see:::plot.see_check_outliers(x$OUTLIERS)

# Methods don't agree on observation 40
x$INFLUENTIAL[40, ]
#>           Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025      0.9471816  65.41945  94.58055      94.58055    40
#>    Influential
#> 40 Influential

attributes(x$OUTLIERS)$influential_obs[40, ]
#>           Hat Cooks_Distance Predicted Residuals Std_Residuals Index
#> 40 0.04070025      0.9471816  65.41945  94.58055      94.58055    40
#>    Influential
#> 40          OK

# Even though it is identified correctly here:
attributes(x$OUTLIERS)$data[40, ]
#>    Row Distance_Cook Outlier_Cook Outlier
#> 40  40     0.9471816            1       1

# `check_model()`
attributes(x$INFLUENTIAL)$cook_levels
#> [1] 0.698073

As the cook’s distance is 0.94, which is higher than 0.7, it should be tagged as outlier though, so this appears like an internal bug from check_outliers().

Created on 2024-04-28 with reprex v2.1.0

rempsyc added a commit that referenced this issue Apr 28, 2024
@rempsyc
Copy link
Member

rempsyc commented Apr 28, 2024

I have implemented a fix in PR #717, thank you for the report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reprex 📊 We need a reproducible example for further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants