Lab meeting 7/11/11

This week, we talked about how to deal with hyperparameters in Bayesian hierarchical models, by reading the paper: “Hyperparameters, optimize or integrate out?” by David MacKay (in Maximum entropy and Bayesian methods, 1996).

The basic setup as follows.  We have a model for data X with parameters \theta, described by the conditional distribution P(X|\theta), which is the likelihood when considered as a function of \theta.  Regularized estimates for \theta can be obtained by placing a prior over the parameters, P(\theta|\alpha), governed by hyperparameters \alpha.  The paper focuses on a comparison between two methods for making inferences about \theta, which involve two different methods for making a Gaussian approximation to the posterior P(\theta | X, \alpha).  (Note: both the likelihood and the prior take Gaussian forms in this setup).  The two methods are:

  1. Evidence approximation (EA) – finds the hyperparameters \hat \alpha_{ML} that maximize the evidence P(X | \alpha) = \int P(X|\theta) P(\theta|\alpha) d\alpha.  The optimized hyperparameters are then “fixed” so that the posterior takes the form P(\theta | X, \hat \alpha_{ML}) \propto P(X|\theta) P(\theta|\hat \alpha_{ML})(Note this is a form of Empirical Bayes parameter estimation, where the prior is estimated from the data and then used to regularize the parameter estimate). Since the two terms on the right are Gaussian, the posterior is truly Gaussian if \alpha is fixed.  This approximation is accurate if the evidence (or the posterior distribution over \alpha) is very tightly distributed around its maximum.
  2. MAP method – involves integrating out the hyperparameters to obtain the true posterior P(\theta | X) \propto P(X|\theta) \int P(\theta|\alpha) P(\alpha) d \alpha.   (This involves assuming a prior over \alpha; Mackay uses a flat (improper) improper prior, though one could as easily assume a proper prior).  A Gaussian approximation is then made using the mode and Hessian of this (true) log-posterior around its maximum, which is the so-called Laplace Approximation. (Might have been better to call this the “Laplace Method” than the “MAP method”).

The paper claims the latter method is much worse.  The Laplace approximation will in general produce a much-too-narrow posterior if there’s a spike at zero in the true posterior, which will often arise if the problem is ill-posed  (i.e., the likelihood is nearly flat in some directions, placing no constraints on the parameters in that subspace).  There are some arguments about the effective number of free parameters and “effective \alpha” governing the MAP method posterior that we didn’t have time to unpack.

1 thought on “Lab meeting 7/11/11

  1. A criticism: The question addressed is not really “optimize or integrate out”, but rather “ignore uncertainty in your hyper-parameters or make the Laplace approximation to the true posterior”? (I guess that’s not quite as catchy). It seems like a hard-core Bayesian would say you should always integrate out (and use the full posterior!). But I’d still be curious to know: are there cases where it *really* is better to maximize than to integrate out (whether using a proper or improper prior)? Methods like ARD and RVM achieve sparsity only by maximizing—they *don’t* give you a sparse model if you integrate.

    Aesthetically, I like the prescriptive advice at the end: “when given a choice of which variables to integrate over and which to maximize over, one should integrate over as many variables as possible, in order to capture the relevant volume information”. But I’m not fully convinced—e.g., is this still true regardless of what level of the hierarchy your parameters live at?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s