This week, we talked about how to deal with hyperparameters in Bayesian hierarchical models, by reading the paper: “Hyperparameters, optimize or integrate out?” by David MacKay (in* Maximum entropy and Bayesian methods*, 1996).

The basic setup as follows. We have a model for data with parameters , described by the conditional distribution , which is the *likelihood* when considered as a function of . Regularized estimates for can be obtained by placing a prior over the parameters, , governed by hyperparameters . The paper focuses on a comparison between two methods for making inferences about , which involve two different methods for making a Gaussian approximation to the posterior . (Note: both the likelihood and the prior take Gaussian forms in this setup). The two methods are:

**Evidence approximation (EA)** – finds the hyperparameters that maximize the evidence . The optimized hyperparameters are then “fixed” so that the posterior takes the form . *(Note this is a form of Empirical Bayes parameter estimation, where the prior is estimated from the data and then used to regularize the parameter estimate). *Since the two terms on the right are Gaussian, the posterior is truly Gaussian if is fixed. This approximation is accurate if the evidence (or the posterior distribution over ) is very tightly distributed around its maximum.
**MAP method** – involves integrating out the hyperparameters to obtain the *true posterior *. (This involves assuming a prior over ; Mackay uses a flat (improper) improper prior, though one could as easily assume a proper prior). A Gaussian approximation is then made using the mode and Hessian of this (true) log-posterior around its maximum, which is the so-called *Laplace Approximation. (Might have been better to call this the “Laplace Method” than the “MAP method”).*

The paper claims the latter method is much worse. The Laplace approximation will in general produce a much-too-narrow posterior if there’s a spike at zero in the true posterior, which will often arise if the problem is ill-posed (i.e., the likelihood is nearly flat in some directions, placing no constraints on the parameters in that subspace). There are some arguments about the effective number of free parameters and “effective ” governing the MAP method posterior that we didn’t have time to unpack.

### Like this:

Like Loading...

*Related*

A criticism: The question addressed is not really “optimize or integrate out”, but rather “ignore uncertainty in your hyper-parameters or make the Laplace approximation to the true posterior”? (I guess that’s not quite as catchy). It seems like a hard-core Bayesian would say you should always integrate out (and use the full posterior!). But I’d still be curious to know: are there cases where it *really* is better to maximize than to integrate out (whether using a proper or improper prior)? Methods like ARD and RVM achieve sparsity only by maximizing—they *don’t* give you a sparse model if you integrate.

Aesthetically, I like the prescriptive advice at the end: “when given a choice of which variables to integrate over and which to maximize over, one should integrate over as many variables as possible, in order to capture the relevant volume information”. But I’m not fully convinced—e.g., is this still true regardless of what level of the hierarchy your parameters live at?