-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
comment on the p-direction post #40
Comments
Hi Bob, Thanks for you comment! I generally agree with your comment - MAPs have been shown to be poor estimates (but their appeal as "most probable value" / any their close match with MLEs is just too tempting I guess). The "penalization" placed by the prior is not "good" for error rates, but also for estimation. One failure of freq methods is its lack of incorporation of priors (be they from previous data collection, or from subjective expectations), making all decisions and estimations based on the observed data alone. We really should add some comment section (disqus for blogdown?)... |
Bob, thank you for your comment! I',m not sure if you have noticed our accompanying paper related to the posting, Indices of Effect Existence and Significance in the Bayesian Framework (there was a kind of "mutual inspiration" for both the development the bayestestR package and writing the paper, thus the relationship...).
I think what is important, from our perspective, not only in Bayesian inference, but also in a frequentist framework, is to take the uncertainty of estimation into account (or even focus on it). Thus we suggest using at least two indices of effect "existence" and "significance" (not necessarily in the sense of statistical significance). The pd is one of these possible indices that works well, in particular due to its 1:1 correspondence to the p-value, which makes it easier to attract Bayesian methods for "non-Bayesians". I guess this blog post could not carry this bigger picture of "focus on uncertainty"...
You mean shrinkage? But isn't that a desired property of mixed models to avoid overly "strong effects" (estimates)?
We had a blogpost some months ago, where Aki responded (via email). He wrote about predictive model checking, but though I read some papers from Aki and you (just cursory), I did not fully understand how to best use posterior predictive inference (I hope I understood you correct here).
True, but I think this should apply to frequentist framework as well.
That's a good point, in particular since we are working on a paper on how to use informative priors in rstanarm, and how different priors affect the model results.
I think I got what you mean. So it would make sense to "warn" the users of |
That's somthing I'm 100% behind :-)
This isn't true. While a strict philosophical frequentist won't let you talk about a distribution over a parameter (so no prior or posterior distributions), they're totally OK with penalized maximum likelihood. That is, they're often OK introducing some bias for reduced variance. Given parameters
A penalized maximum likelihood estimate just adds a penalty term. For instance, the equivalent of a normal prior would be an L2 penalty, e.g.,
The resulting penalized MLE is equivalent to the MAP estimate a Bayesian model with Here's a link to the classic Efron and Morris (1975) paper about Stein's estimator. They go much further and build a hierarchical model to which they apply max marginal likelihood by marginalizing out the lower-level coefficients and optimizing the hierarchical regularization parameters. This is the same as modern system like
It kind of looks like a p-value, but it doesn't act like a p-value, so I fear this is just going to confuse people. Worse yet, I fear it may encourage them to think in terms of a binary significant/not-signficant decision, which is what we want to leave behind with the frequentists. Instead, we want to do something like decision analysis (e.g., as described in Bayesian Data Analysis) to make decisions, which considers not only posterior uncertainty, but magnitude and its effect on things we care about like utility of decisions.
No, I mean regularization to population means. For example, see the baseball ability estimates from the Efron and Morris paper I link above. The idea is that you only have a few dozen observations per player, so you want to shrink not to zero, but to the population average. Then if you see no data, your guess is the population average, not zero. The point of mixed models is to figure out how much pooling to do between no pooling (infinite hierarchical variance) on one side to complete pooling (hierarchical variance zero) on the other. I discuss this in my Stan case study that recasts Efron and Morris in Bayesian terms.
Glad to hear I'm not the only person obsessing about definitions on the web.
The idea is that you average over your posterior, so that's
which in MCMC terms is just
where
How so? P.S. You might want to compare notes with Rasmus Bååth's work on his Bayesian first aid package, which translates many classical notions like hypothesis tests to Bayesian frameworks. |
Thanks Bob for all the explanations! It wan't clear to me until now how hierarchical shrinkage actually works!
Can you elaborate on this? To me, they hold the same amount of information (for p-values in NHST at least), and the behave the same in the sense that they are consistent under H1 (p-value gets smaller, and pd gets bigger), but not under H0 (p-value is uniform between 0-1, pd is uniform between 0.5-1). |
On Feb 28, 2020, at 2:18 AM, Mattan S. Ben-Shachar ***@***.***> wrote:
Thanks Bob for all the explanations! It wan't clear to me until now how hierarchical shrinkage actually works!
It kind of looks like a p-value, but it doesn't act like a p-value
Can you elaborate on this?
The basis is completely different.
For p-values, the basis is the distribution of an estimator under varying data samples, i.e., p(f(y)) where y = y[1], ..., y[N] is the data and f is some estimator. The data y are assumed to be random variables so that f(y) is also a random variable. For example, the maximum likelihood estimator for standard deviation is
sigma*(y) = sqrt(sum((y - mean(y))^2) / N.
But it's biased to the low side because it's based on the sample mean mean(y). Luckily, there's a well known corection,
sigma-hat(y) = sqrt(sum((y - mean(y))^2) / (N - 1).
These are two different estimators of sigma from data and they will have slightly different properties.
For posteriors, the basis is a distribution of parameters conditional on data, i.e., p(sigma | y[1:N]). This is a distribution over sigma, which is now being considered "random" unlike in the frequentist version. You can ask things like what's the probability that sigma is in some interval given the posterior density, which is not legal in the frequentist methodology. Note that this doesn't mean Bayesians believe there isn't a true value for sigma, just that we treat randomness as representing epistemic uncertainty.
And while the distribution of an estimator of a parameter might look like a Bayesian posterior, they're not interpreted that way. The estimator distribution is used to ask a question about how likely it would be to get an estimate as extreme as you got from random data (that's how p-values are defined). The posterior distribution is used to ask questions like the probability of the true value falling in an interval.
|
Right, computation and interpretation are (and should be) different (likelihood vs posterior prob), but it is still hard to ignore the basically 1:1 correspondence between the values, and other properties / behaviors of the p-value and pd (for the lay user, at the very least, and I think that is what we were aiming for here)... Thanks for engaging! Definitely got a lot to mull over! |
On Feb 28, 2020, at 3:17 PM, Mattan S. Ben-Shachar ***@***.***> wrote:
Right, computation and interpretation are (and should be) different (likelihood vs posterior prob),
The p-value is based on the distribution of the estimator, which is a proper random variable because it's a function of the data, which is random. This is why the bootstrap works, for example.
In contrast, the likelihood isn't even a probability distribution---it's a function of parameters for fixed data.
but it is still hard to ignore the basically 1:1 correspondence between the values, and other properties / behaviors of the p-value and pd (for the lay user, at the very least, and I think that is what we were aiming for here)...
They're not even distributions over the same thing! This is precisely the kind of analogy that I fear will confuse newcomers.
I'd suggest section 4.5 of Gelman et al.'s Bayesian Data Analysis, which is about Bayesian interpretations of other statistical methods. I'd also recommend 4.4, which is about frequency calibration of Bayesain methods, something we recommend for Stan, but is considered bad practice by the so-called objective Bayesians and the so-called subjective Bayesians.
|
I couldn't figure out how to leave comments on the blog, so I hope you don't mind if I open an issue relating to the post the p-direction. Apologies in advance if this is obvious and just considered too much for the blog post!
You can compute traditional p-values for Bayesian estimators using the bootstrap. Using max a posteriori (MAP) will then produce results identical to the traditional p-value derived from penalized maximum likelihood where the prior is considered the "penalty". But MAP isn't a Bayesian estimator and doesn't have the nice properties of the two common Bayesian estimators, the posterior mean (minimizes expected square error) and posterior median (mimimizes expected absolute error). Deriving a point estimate isn't particularly Bayesian, but at least the posterior mean and median have natural probabilistic interpretations as an expectation and the point at which a 50% probability obtains. With those estimators, results will vary from MAP based on how skewed the distribution is.
A bigger issue is that MAP doesn't even exist for our bread and butter hierarchical models. The frequentist approach is to use maximum marginal likelihood (this is often called "empirical Bayes" in that the MML estimate is for the hierarchical or "prior" parameters). This leads to underestimates of lower-level regression coefficient uncertainty by construction, as you see in packages like lme4 in R.
Part of the point of Bayesian inference is to not have to collapse to a point estimate. When we want to do downstream predictive inference, we don't want to just plug in an estimate, we want to do posterior predictive inference and average over our uncertainty in parameter estimation.
Defining what it means for a prior to be "informative" is tricky and it wasn't defined in this post. This is particularly vexed because of changes of variables. A prior defined for a probability variable in [0, 1] that's flat is very different from a flat prior for a log odds variable in (-infinity, infinity). A flat prior on [0, 1] under the logit transform leads to a standard logistic prior on the log odds. That's not flat. In MLE, changing variables doesn't matter, but it does in Bayes.
I wouldn't say that changing the threshold for significance with regularization is a good thing. While regularlization can be good for error control (trading variance for bias), the whole notion of a dichotomous up/down decision through signfiicance is the problem, not the threshold used. Also, we tend to use regularization that is not to zero, but to the population mean. This is also common in frequentist penalized maximum likelihood estimates (see, e.g., Efron and Morris's famous paper on predicting batting average in baseball using empirical Bayes, which despite the name, is a frequentist max-marginal likelihood method). That's even better for error control than shrinkage, but it's going to have the "wrong" effect on this notion of p-direction unless you talk about p-direction of the difference from the population estimate, rather than the random effect itself (that is, you don't want to say Connecticut is significantly different than zero, but significantly different than other states).
P.S. For reference, Gelman et al. use a similar, but not equivalent notion, in calculating posterior predictive p-values in Bayesian Data Analysis, but without flipping signs (so that either values near 0 or near 1 are evidence the model doesn't fit the data well). These are not intended to be used in hypothesis tests, though, just as a diagnostic.
The text was updated successfully, but these errors were encountered: