hierarchical models | Pillow Lab Blog

This week, we talked about how to deal with hyperparameters in Bayesian hierarchical models, by reading the paper: “Hyperparameters, optimize or integrate out?” by David MacKay (in Maximum entropy and Bayesian methods, 1996).

The basic setup as follows. We have a model for data $X$ with parameters $\theta$ , described by the conditional distribution $P(X|\theta)$ , which is the likelihood when considered as a function of $\theta$ . Regularized estimates for $\theta$ can be obtained by placing a prior over the parameters, $P(\theta|\alpha)$ , governed by hyperparameters $\alpha$ . The paper focuses on a comparison between two methods for making inferences about $\theta$ , which involve two different methods for making a Gaussian approximation to the posterior $P(\theta | X, \alpha)$ . (Note: both the likelihood and the prior take Gaussian forms in this setup). The two methods are:

Evidence approximation (EA) – finds the hyperparameters $\hat \alpha_{ML}$ that maximize the evidence $P(X | \alpha) = \int P(X|\theta) P(\theta|\alpha) d\alpha$ . The optimized hyperparameters are then “fixed” so that the posterior takes the form $P(\theta | X, \hat \alpha_{ML}) \propto P(X|\theta) P(\theta|\hat \alpha_{ML})$ . (Note this is a form of Empirical Bayes parameter estimation, where the prior is estimated from the data and then used to regularize the parameter estimate). Since the two terms on the right are Gaussian, the posterior is truly Gaussian if $\alpha$ is fixed. This approximation is accurate if the evidence (or the posterior distribution over $\alpha$ ) is very tightly distributed around its maximum.
MAP method – involves integrating out the hyperparameters to obtain the true posterior $P(\theta | X) \propto P(X|\theta) \int P(\theta|\alpha) P(\alpha) d \alpha$ . (This involves assuming a prior over $\alpha$ ; Mackay uses a flat (improper) improper prior, though one could as easily assume a proper prior). A Gaussian approximation is then made using the mode and Hessian of this (true) log-posterior around its maximum, which is the so-called Laplace Approximation. (Might have been better to call this the “Laplace Method” than the “MAP method”).

The paper claims the latter method is much worse. The Laplace approximation will in general produce a much-too-narrow posterior if there’s a spike at zero in the true posterior, which will often arise if the problem is ill-posed (i.e., the likelihood is nearly flat in some directions, placing no constraints on the parameters in that subspace). There are some arguments about the effective number of free parameters and “effective $\alpha$ ” governing the MAP method posterior that we didn’t have time to unpack.