-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hugely different R2 values for slightly different models #332
Comments
One thought is whether it could be related to offset terms. In This made me wonder whether the R2 defined for Quasi-Poisson / negative-binomial needs to be tweaked to handle offset terms. The Nakagawa R2 is defined in eqn 3.3 here, as:
In the standard no-offset model, |
A couple of thoughts: Comparing pseudo-R2sIn a linear model, R2 has several interpretations:
In a generalized linear model, no single statistic has all of these meanings at once. It's not always easy to compare a pseudo-R2 with R2 from a normal model, eg:
Here is a nice FAQ on various pseudo-R2 values. Here is a comparison of various pseudo-R2 estimators for binomial/logistic models: For mixed models, the Nakagawa R2 values are most similar to the McKelvey & Zavoina’s R2. Personally, I don't think these are particuarly meaningful--the distribution variance has really different meanings across families and in basically any non-Normal family, it doesn't meaningfully indicate "model fit" in any real sense (in my opinion). Model choiceI wouldn't recommend the logcpm model. It's kind of a cludgy solution versus modeling the counts with a more appropriate distribution (such as Poisson and friends). As you note, with the many 0s, the residuals in the logcpm model aren't close to normally distributed. The reason you are getting warning message is because the estimated mean of the null distribution (and thus the variance of the null distribution) is zero. This is because of the huge offset you are applying. In the lognormal model, you effectively have an offset of fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype, data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
#
# Conditional R2: 0.127
# Marginal R2: 0.081 In the figure you give, the cell types aren't very different from each other, so I would expect a modest pseudo-R2 in terms of variance accounted for. The values for the normal linear models seem oddly high based on the figure. |
Hi @bwiernik, thanks very much for the comments and resources. Regarding model choice, something like a mixed negative binomial model is clearly the right choice for this kind of data; I just fit the logcpm models to give more interpretable comparators for the pseudo R2 methods. I agree that a variance-based model doesn't make much sense for something counts-based, especially when the counts are small. (I'm also not a big fan of an R2 measure whose value changes so radically with a trivial reparameterization of the model...) Although for genes with larger counts, the GLMM and log-normal models should be closer, so should make more sense, right? Conceptually, I like the relative likelihood approaches. However, it doesn't seem like these are an option for non-logistic GLMMs, at least at the moment. Is that correct? Is there no equivalent to Nagelkerke's R2 where you calculate it with both The broader context is that I am fitting this model to all of our genes (i.e. the design matrix remains identical, and we fit each gene's different set of counts). The reason the Nakagawa R2 is appealing is that it allows us to see, broadly, which genes are highly determined by celltype, and which vary a lot across individuals. Do you have any thoughts on what the best approach to do something like this would be? At the moment I'm thinking of using the Nakagawa results, but adding a warning that these may not be so trustworthy for genes with smaller counts. Many thanks |
Also a brief note on your zero-inflated model - you get R2 values of around 13%: fit_glmm2 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
offset = log(libsize) - log(1e6), ziformula = ~ 1 + celltype,
data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm2)
# # R2 for Mixed Models
#
# Conditional R2: 0.127
# Marginal R2: 0.081 However, if I fit a model with your suggested offset adjustment, but without the zero inflation, I get much higher R2 values: fit_glmm3 = glmmTMB::glmmTMB(counts ~ celltype + (1|individual),
offset = log(libsize) - log(1e6),
data = data_dt, family = glmmTMB::nbinom2)
performance::r2(fit_glmm3)
# # R2 for Mixed Models
# Conditional R2: 0.670
# Marginal R2: 0.427 This seems odd - surely a more complex model should better explain the data? |
I had forgotten to remove the zero-inflation bit. Thanks for checking. With respect to the values, the "more complex model" heuristic you are using just doesn't work outside of normal linear models. I am not exactly sure off hand how the Nakagwa R2 functions are written to handle mixture models with zero-inflation or dispersion parameters. In particular, the distributional variance is going to be extremely tricky with both binomial and Poisson components in the model; I don't know if we handle that well here. But, I don't think that those are going to be meaningful. Like I said above, the distributional variance is determined by which likelihood family you choose, and it doesn't really work for the type of interpretation you're wanting to do. These are fairly trivial reparameterizations for the model coefficients, but the underlying likelihood functions being used to fit them is quite different and, especially, their statements about the variance about the response variable are extremely different (a normal distribution has a free variance, a negative binomial model has a variance proportional to the mean). Statistics that rely on those variance estimates will be very dependent on the model form. With that in mind, I have 3 suggestions:
You can compute a likelihood based R2 for any model by asking for it specifically |
I'm not sure I get why a more complex model wouldn't give a better fit here. The set of models without zero-inflation is a strict subset of the models with zero-inflation, no? Can the change in the model mean that the best option without zero-inflation no longer has the highest likelihood? (In any case, I wasn't intending to use a zero-inflated model.) I get that that in the negative binomial, the variance is Thanks for the suggestions. Would you mind clarifying what you mean by "null model" here? Thinking about it, there are four models that are potentially relevant, at least the first two of which seem like possible nulls: I'm a little confused by "directly compare the size of the random effects SDs for individuals and observations (residuals) across the full and null models". My attempt at parsing this:
There are several points where I don't find this so clear...
(Regarding your final comment, if I call |
R2 does not indicate "better fit" except in the case of linear regression. The likelihood and statistics derived from it, such as AIC, is what indicates model fit. You really should not try to interpret R2 values in this way when using GLMs (I would regard the log-normal model as a GLM in this case as well). So for example: mod_full <- glmmTMB(counts ~ cell_type + (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
mod_null <- glmmTMB(counts ~ 1+ (1 | individual), offset = log(libsize) - log(1e6), data = data_dt, family = glmmTMB::nbinom2)
parameters::compare_parameters(mod_full, mod_null, effects = "random") Either the ratio of the individual or observation/residual SDs across the models, depending on which was your level of interest (I assume the residual since I assume individual-level variation is not of substantive interest?). |
I've been fitting some mixed models to genomic (transcriptomic) count data, and wanted to assess model fit. I've tried out a few models and a few options for R2, however they give wildly different R2 values.
I have a couple of questions:
Thanks!
Will
MWE
The data has 8 different celltypes, across 196 individuals (for almost all individuals we have all 8 possible values). counts corresponds to how many times we observe a given gene, libsize is the offset for this observation, and logcpm = log( counts / libsize * 1e6 + 1) i.e. log values slightly pushed away from 0 and normalized to the expected range.
Plotting these log values indicates that we expect both celltype and invididual to be relevant:
I fit 4 models and calculate R2 for each:
The main thing I don't understand is why there is such a large difference between models 1 and 2. An R2 of 8% just doesn't seem plausible when the distributions are so tight (while an R2 of 74% for model 2 seems about right). The coefficient estimates are pretty similar, so it seems more likely that the difference comes from how the R2 is calculated, rather the fits being wildly different.
Does the warning for model 1 indicate that there are issues here? If so, what could they be? To me it doesn't seem like an unreasonably complex model for the data...
I've half-read the Nakagawa paper, so could it be that for model 1 the R2 is based on log values, while for R2 it uses deviance? Could that explain the huge difference?
(The differences between the GLM and log-linear models seem to make sense, as with many zeros present the GLM should better capture the data.)
The text was updated successfully, but these errors were encountered: