Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_ and get_ methods with models of multiple formula components like clm() & lme() #727

Closed
DrJerryTAO opened this issue Mar 10, 2023 · 18 comments · Fixed by #729
Closed
Assignees
Labels
Bug 🐛 Something isn't working

Comments

@DrJerryTAO
Copy link

DrJerryTAO commented Mar 10, 2023

Hi @strengejacke, I spotted some leaving-out-variable issues with find_...() and get_...() in certain model objects with multiple formula components, especially in the following functions

  • find_predictors()
  • find_variables()
  • find_weights()
  • get_data()

which causes issues in other packages that depend one {insight}.

With nlme::lme(), variables in the weight = varFunc() are not retained. The resolution may require an update to the insight:::find_weights.default() method or adding a new method to find_weights(); or rather treat variables in weight = varFunc() as predictors and update the insight:::find_predictors.default() method or adding a new method to find_predictors().

summary(Model <- lme(
  distance ~ age, data = Orthodont, 
  random = ~ 1 | Subject, 
  weights = varIdent(form = ~ 1 | Sex)))
summary(Model <- lme(
  distance ~ age, data = Orthodont, 
  random = ~ Sex | Subject))
find_predictors(Model)
'$conditional
[1] "age"'
find_variables(Model)
'$response
[1] "distance"
$conditional
[1] "age"
$random
[1] "Subject"'
find_weights(Model) # Did not extract Sex as a variable
'"varIdent(form = ~1 | Sex)"'
class(find_weights(Model)) # "character"
names(get_data(Model)) # Sex was left out
'"distance" "age"      "Subject" '

I think the weights argument in lme() may behave very differently from those in other models, as it is not a numeric vector by which all variables are divided nor an integer vector of observation counts but a function modeling the heteroscedasticity. This is why I suspect that a good solution is to get the variable names through find_predictors() instead of find_weights(). The lme() document shows that weights is

an optional varFunc object or one-sided formula describing the within-group heteroscedasticity structure. If given as a formula, it is used as the argument to varFixed, corresponding to fixed variance weights. See the documentation on varClasses for a description of the available varFunc classes. Defaults to NULL, corresponding to homoscedastic within-group errors.

With ordinal::clm(), variables in the scale = ~ ... and nominal = ~ ... formulae are not retained. The resolution may require an update to the insight:::find_predictors.default() method or adding a new method to find_predictors().

summary(Model <- clm( # nominal + scale effects
  cyl ~ wt, scale = ~ vs, nominal = ~ hp, 
  data = mtcars %>% mutate(cyl = factor(cyl))))
coef(summary(Model))
"                     Estimate   Std. Error       z value  Pr(>|z|)
4|6.(Intercept) 96990.6137075 8.972213e+08  1.081011e-04 0.9999137
6|8.(Intercept) 16987.7165352 4.252071e+11  3.995163e-08 1.0000000
4|6.hp           -772.5798577 7.137124e+06 -1.082481e-04 0.9999136
6|8.hp             -9.4092060 1.757182e+09 -5.354714e-09 1.0000000
wt               4591.3466175 4.280643e+07  1.072583e-04 0.9999144
vs                 -0.3962198 1.837861e+04 -2.155874e-05 0.9999828"
find_variables(Model)
'$response
[1] "cyl"
$conditional
[1] "wt"'
find_predictors(Model)
'$conditional
[1] "wt"'
find_weights(Model) # NULL
names(get_data(Model)) # only "cyl" "wt" got picked
@strengejacke
Copy link
Member

@vincentarelbundock I guess it would be good to have this issue resolved before the next CRAN submission?

@vincentarelbundock
Copy link
Contributor

Yes, would be great, but I'm not sure I'll have much time in the coming weeks, unfortunately.

@strengejacke
Copy link
Member

find_weights() now correctly extracts the name of the related weight-variable:

library(insight)
library(nlme)
data(Orthodont)
Orthodont$w <- abs(rnorm(nrow(Orthodont)))

m1 <- lme(
  distance ~ age, data = Orthodont, 
  random = ~ 1 | Subject, 
  weights = varIdent(form = ~ 1 | Sex))

m2 <- lme(
  distance ~ age, data = Orthodont, 
  random = ~ 1 | Subject)

m3 <- lme(
  distance ~ age,
  data = Orthodont,
  random = ~ 1 | Subject,
  weights = ~w
)

find_weights(m1)
#> [1] "Sex"
find_weights(m2)
#> NULL
find_weights(m3)
#> [1] "w"

Created on 2023-03-12 with reprex v2.0.2

@strengejacke
Copy link
Member

I think the clm-issues are also resolved:

library(insight)
library(ordinal)
data(mtcars)
m2 <- suppressWarnings(clm( # nominal + scale effects
  cyl ~ wt,
  scale = ~vs, nominal = ~hp,
  data = transform(mtcars, cyl = factor(cyl))
))

find_formula(m2)
#> $conditional
#> cyl ~ wt
#> 
#> $scale
#> ~vs
#> 
#> $nominal
#> ~hp
#> 
#> attr(,"class")
#> [1] "insight_formula" "list"

find_variables(m2)
#> $response
#> [1] "cyl"
#> 
#> $conditional
#> [1] "wt"
#> 
#> $scale
#> [1] "vs"
#> 
#> $nominal
#> [1] "hp"

head(get_data(m2))
#>                   cyl    wt vs  hp
#> Mazda RX4           6 2.620  0 110
#> Mazda RX4 Wag       6 2.875  0 110
#> Datsun 710          4 2.320  1  93
#> Hornet 4 Drive      6 3.215  1 110
#> Hornet Sportabout   8 3.440  0 175
#> Valiant             6 3.460  1 105

find_predicors(m2)
#> Error in find_predicors(m2): could not find function "find_predicors"

head(get_predictors(m2))
#>                      wt vs  hp
#> Mazda RX4         2.620  0 110
#> Mazda RX4 Wag     2.875  0 110
#> Datsun 710        2.320  1  93
#> Hornet 4 Drive    3.215  1 110
#> Hornet Sportabout 3.440  0 175
#> Valiant           3.460  1 105

get_parameters(m2)
#>         Parameter      Estimate
#> 1 4|6.(Intercept) 96990.6137075
#> 2 6|8.(Intercept) 16987.7165352
#> 3          4|6.hp  -772.5798577
#> 4          6|8.hp    -9.4092060
#> 5              wt  4591.3466175
#> 6              vs    -0.3962198

parameters::model_parameters(m2, effects = "all", components = "all")
#> # Intercept
#> 
#> Parameter       | Log-Odds |       SE |                       95% CI |         z |      p
#> -----------------------------------------------------------------------------------------
#> 4|6 (Intercept) | 96990.61 | 8.97e+08 | [       -1.76e+09, 1.76e+09] |  1.08e-04 | > .999
#> 6|8 (Intercept) | 16987.72 | 4.25e+11 | [       -8.33e+11, 8.33e+11] |  4.00e-08 | > .999
#> 4|6 hp          |  -772.58 | 7.14e+06 | [       -1.40e+07, 1.40e+07] | -1.08e-04 | > .999
#> 6|8 hp          |    -9.41 | 1.76e+09 | [       -3.44e+09, 3.44e+09] | -5.35e-09 | > .999
#> 
#> # Location Parameters
#> 
#> Parameter | Log-Odds |       SE |                   95% CI |        z |      p
#> ------------------------------------------------------------------------------
#> wt        |  4591.35 | 4.28e+07 | [   -8.39e+07, 8.39e+07] | 1.07e-04 | > .999
#> 
#> # Scale Parameters
#> 
#> Parameter | Log-Odds |       SE |                95% CI |         z |      p
#> ----------------------------------------------------------------------------
#> vs        |    -0.40 | 18378.61 | [-36021.82, 36021.02] | -2.16e-05 | > .999
#> 
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#>   using a Wald z-distribution approximation.
#> 
#> The model has a log- or logit-link. Consider using `exponentiate =
#>   TRUE` to interpret coefficients as ratios.

Created on 2023-03-12 with reprex v2.0.2

@strengejacke
Copy link
Member

@vincentarelbundock can you check if #729 fixes the marginal effects issues?

@DrJerryTAO
Copy link
Author

DrJerryTAO commented Mar 12, 2023

Great. Just a heads-up: in clm() and other discrete choice models, when scale effects are included, which do not measure log-odds, coefficients in location and intercept parameters no longer measure log-odds directly, and exp(coef) are no longer odds ratios. So the following parameter interpretation can be misleading.

Parameter | Log-Odds | SE | 95% CI | z | p

@vincentarelbundock
Copy link
Contributor

@strengejacke it looks like it is working for ordinal::clm, but I won't know if it works with lme until I try to add full support for it. For that, I at least need to wait for the functionality to be merged in insight master, so don't wait for a review of those before merging.

Thanks for the updates!

strengejacke added a commit that referenced this issue Mar 12, 2023
…e clm() & lme() (#729)

* find_ and get_ methods with models of multiple formula components like clm() & lme()
Fixes #727

* fix lme issue

* news, version

* fix for clm

* fix test
@DrJerryTAO
Copy link
Author

Hi @vincentarelbundock,
For lme() model objects, find_weights() does find all variable names used in weights = ... and get_data() retrieve all variables as expected. However, because weight = varFunc() can contain multiple terms, it might be better to treat it as a predictor or formula component to be listed in find_variables() instead. Otherwise, it has caused problems like strengejacke/ggeffects#295 in other packages that expect one single character from find_weights().

@strengejacke
Copy link
Member

Do you have an example where it fails for multiple predictors in find_weights()? From a conceptional perspective, I'm a bit hesitant to include all weight-predictors in the output of find_variables().

@DrJerryTAO
Copy link
Author

See strengejacke/ggeffects#295 (comment). The problem is in other packages that expect one single character from find_weights().

@strengejacke
Copy link
Member

what would get_weights() to be expected to return? This here?

library(nlme)

data("Orthodont", package = "nlme")
m <- lme( # a model of variance only
  distance ~ 1, data = Orthodont, # grand mean
  weights = varConstPower(form = ~ age | Sex))

attributes(m$modelStruct$varStruct)$weights
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 0.03039196 
#>       Male       Male       Male       Male       Male       Male       Male 
#> 0.06375142 0.04744622 0.03727197 0.03039196 0.06375142 0.04744622 0.03727197 
#>       Male     Female     Female     Female     Female     Female     Female 
#> 0.03039196 0.07470279 0.07491239 0.07506866 0.07519109 0.07470279 0.07491239 
#>     Female     Female     Female     Female     Female     Female     Female 
#> 0.07506866 0.07519109 0.07470279 0.07491239 0.07506866 0.07519109 0.07470279 
#>     Female     Female     Female     Female     Female     Female     Female 
#> 0.07491239 0.07506866 0.07519109 0.07470279 0.07491239 0.07506866 0.07519109 
#>     Female     Female     Female     Female     Female     Female     Female 
#> 0.07470279 0.07491239 0.07506866 0.07519109 0.07470279 0.07491239 0.07506866 
#>     Female     Female     Female     Female     Female     Female     Female 
#> 0.07519109 0.07470279 0.07491239 0.07506866 0.07519109 0.07470279 0.07491239 
#>     Female     Female     Female     Female     Female     Female     Female 
#> 0.07506866 0.07519109 0.07470279 0.07491239 0.07506866 0.07519109 0.07470279 
#>     Female     Female     Female 
#> 0.07491239 0.07506866 0.07519109

Created on 2023-03-13 with reprex v2.0.2

@DrJerryTAO
Copy link
Author

DrJerryTAO commented Mar 13, 2023

what would get_weights() to be expected to return?

I am not entirely sure. It depends on how other functions and packages use get_weights(). I think you are on the right path but what you got are in-sample estimates. But I suspect that it should be coefficients of the variance function to be evaluated with future values of predictors. So, it will be coef(object$modelStruct$varStruct) plus several other $modelStruct$varStruct components. Using the same model

summary(Model <- lme( # a model of variance only
  distance ~ 1, data = Orthodont, # grand mean
  weights = varConstPower(form = ~ age | Sex))) # var = b0 + |age|^b1
"       AIC      BIC    logLik
  514.5723 533.2821 -250.2861

Random effects:
 Formula: ~1 | Subject
        (Intercept) Residual
StdDev:     1.84895 0.109692

Variance function:
 Structure: Constant plus power of variance covariate, different strata
 Formula: ~age | Sex 
 Parameter estimates:
              Male     Female
const 0.0001144912 13.0138472
power 1.3237960288 -0.4748515

Fixed effects:  distance ~ 1 
               Value Std.Error DF  t-value p-value
(Intercept) 23.42394 0.4046777 81 57.88296       0"

The model above in Variance function describes that the weights for the first Subject over four measurements are

1 / ( # same as varWeights
  exp(Model$modelStruct$varStruct[["const"]]['Male']) + 
    Model$data$age[1:4] ^
    Model$modelStruct$varStruct[["power"]]['Male'] ) 

As confirmed below

class(Model$modelStruct$varStruct) # "varConstPower" "varFunc"
Model$sigma # 0.109692
Model$modelStruct$varStruct # prints exp(const) and power
summary(Model$modelStruct$varStruct) # prints exp(const) and power
str(summary(Model$modelStruct$varStruct)) # many components
attributes(Model$modelStruct$varStruct) # # many components

varWeights(Model$modelStruct$varStruct)[1:4] # sample estimates
"      Male       Male       Male       Male 
0.06375142 0.04744622 0.03727197 0.03039196"
attributes(Model$modelStruct$varStruct)$weights[5:8] # Subject M02, same
"      Male       Male       Male       Male 
0.06375142 0.04744622 0.03727197 0.03039196"
coef( # coef() const different from Parameter estimates in summary() print
  Model$modelStruct$varStruct) 
"  const.Male const.Female   power.Male power.Female 
  -9.0750128    2.5660140    1.3237960   -0.4748515"
Model$modelStruct$varStruct[["const"]] # same as coef($varStruct)
exp(Model$modelStruct$varStruct[["const"]]) # same in summary() print
Model$modelStruct$varStruct[["power"]] # same in summary() print
1 / ( # same as varWeights
  exp(Model$modelStruct$varStruct[["const"]]['Male']) + 
    Model$data$age[1:4] ^
    Model$modelStruct$varStruct[["power"]]['Male'] ) 
"0.06375142 0.04744622 0.03727197 0.03039196"

I am not entirely sure, but I think that the population SE is related to both age and Sex in the way that

  • For Male, population SE = residual_SE * (0.0001144912 + age^1.3237960288)
  • For Female, population SE = residual_SE * (13.0138472 + age^1.3237960288)
mean(Model$sigma / varWeights(Model$modelStruct$varStruct)) # 2.132228
summary(Model <- lme(distance ~ 1, data = Orthodont))
Model$sigma # 2.220312 similar to above 

Given that age could change in a future prediction and the prediction could be made for either Sex group, the population SE of the Male group at age 7 and 15 (out of sample range) should be calculated as

Model$sigma * (0.0001144912 + 7^1.3237960288) # 1.441837
Model$sigma * (0.0001144912 + 15^1.3237960288) # 3.954406

where Model$sigma / varWeights(Model$modelStruct$varStruct) cannot be used because it is static and does not change with age.

For weights = varIdent(), the coefficient must be interpreted differently. So the evaluation of weights = in prediction needs to be varClass specific.

data("Orthodont", package = "nlme")
summary(Model <- lme( # a model of variance only
  distance ~ 1, data = Orthodont, # grand mean
  weights = varIdent(form = ~ 1 | Sex)))
"Random effects:
 Formula: ~1 | Subject
        (Intercept) Residual
StdDev:    2.016065 2.591358

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | Sex 
 Parameter estimates:
     Male    Female 
1.0000000 0.5651101"

Model$sigma # 2.591358
str(summary(Model$modelStruct$varStruct)) # many components
varWeights(Model$modelStruct$varStruct) # sample estimates
"Male 1.000000 Female 1.769567"
1/varWeights(Model$modelStruct$varStruct) # Male has larger SE
"1.0000000 0.5651101 " # matches Parameter estimates above
coef(Model$modelStruct$varStruct) # -0.5707348
exp(coef(Model$modelStruct$varStruct)) # 0.5651101 same as SE ratio

@vincentarelbundock
Copy link
Contributor

I had never used these model-fitting functions before, and am I just now learning that weights does something very different in those models.

I do not believe that we should use the same get_weights() function for a different functionality, even if that different functionality shares the same name. get_weights() should behave consistently, and simply not work for models where weights play a different role.

@strengejacke
Copy link
Member

So what does this mean for get_weights() for this example:

lme( distance ~ 1, data = Orthodont,  weights = varConstPower(form = ~ age | Sex))
  1. Should it return NULL, and only work when weights is a one-sided formula indicating an actual weight-variable in the data?
  2. Should it return a data frame with variables age and Sex? (and their original values from the data frame)
  3. Should it return a vector of weights that is saved as attribute?

@vincentarelbundock
Copy link
Contributor

What does the function return for all other models? Just a numeric vector, right? If that's irrelevant, I vote for raising an informative error. But I don't have an especially strong view, except against proposals like "returning coefficients of the variance function", which would be very different from what we have now.

@strengejacke
Copy link
Member

Just a numeric vector, right?

Yes. that's what usually is returned.

@vincentarelbundock
Copy link
Contributor

Right, so I would say return a numeric vector or an error. But again, I can't have a strong view because I don't know if/when that's a relevant data structure.

@DrJerryTAO
Copy link
Author

Yes. that's what usually is returned.
Right, so I would say return a numeric vector or an error. But again, I can't have a strong view because I don't know if/when that's a relevant data structure.

The current resolution seems the best before some more work is done to extract estimated variance functions from lme() as a regular function like w = function(x, coef){} where coef is estimated in lme(). I tried but couldn't figure out how to form a function w. There are multiple different varFunc classes and each seems to behave differently. Once done, it would be possible to calculate SE for prediction intervals by making an additive term to fixed-effect SE. Or customize predict.lm(..., weights = 1/w) when generating prediction intervals, because predict.lm() does allow weights to be a user-supplied function of variables supplied in newdata.

Should it return NULL, and only work when weights is a one-sided formula indicating an actual weight-variable in the data?

No. When weights is a one-sided formula, it is still a varFunc class, and the variable could be assessed very differently depending on the specific varFunc used, as shown https://fw8051statistics4ecologists.netlify.app/gls.html#variance-increasing-with-x_i

Should it return a data frame with variables age and Sex? (and their original values from the data frame)

Not sure. This is stored in the attribute() though.

Should it return a vector of weights that is saved as attribute?

Seems best. Not sure how other functions and packages use the weights. In WLS and GLS models, weight are mostly useful to get the correct vcov estimates. But since vcov() can easily extract the already estimated vcov, I don't know if weights can be useful again. But perhaps when making in-sample prediction intervals, where sigma/weights of lme() and gls() can be used to adjust SE and CI over the sample ranges. But it won't work correctly (or as intended) for newdata where weights-variables are given but actual weights cannot be evaluated to find the prediction variance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Something isn't working
Projects
None yet
3 participants