During the last lab meeting, we talked about using *expectation propagation (EP),* an approximate Bayesian inference method, to fit generalized linear models (Poisson-GLMs) under Gaussian and Laplace (or exponential) priors on the filter coefficients. Both priors give rise to log-concave posteriors, and the Laplace prior has the useful property that the MAP estimate is often sparse (i.e., many weights are exactly zero). EP attempts to find the posterior mean, which is not (ever) sparse, however.

Bayesian inference under a Laplace prior is quite challenging. Unfortunately, our best friend the *Laplace approximation** *is intractable, since the prior is non-differentiable at zero.

EP is one of many well-known methods for obtaining a Gaussian approximation to an intractable posterior distribution. It aims to minimize the KL divergence between the true posterior and an approximating Gaussian. If the approximating distribution is in the exponential family, moment-matching minimizes the KL divergence. However, direct minimization is typically intractable due to the difficulty of computing the moments of the true (high-dimensional) posterior. Instead, EP represents the posterior with a product of Gaussian “site potentials”, one for each term in the likelihood, each of which can be updated sequentially. (EP works well when the true posterior involves a product of independent likelihood terms and a prior.)

The main three steps in EP are as follows: (i) given site approximations (and the current approximate Gaussian posterior), we compute the cavity distribution by leaving out one of the site potentials; (2) we then compute the tilted distribution which is a product of the cavity distribution and the true factor from the model; and (3) finally we estimate site parameters such that the new Gaussian posterior has the same mean and covariance as the tilted distribution. This description sounds complicated; however, it’s just a simple 1-d update in the Gaussian posterior mean and covariance.

Some might wonder how to set the hyperparameter (in the Laplace prior). One can do cross-validation since there is only one hyperparameter in this case; however, one can do 0-th moment (normalizer) matching between the tilted distribution and the new Gaussian posterior to set it. (this is a real messy part, in my opinion.)

After going over the technical details mentioned above, we looked at the results in the papers. Long story short, in terms of MSE, the posterior mean of the approximate Gaussian posterior obtained by EP performed better than the numerically optimized MAP estimate (as it should be). However, what’s bizarre to me is that the posterior mean from EP was better than the MAP estimate in terms of prediction performance (measured by KL distance of the estimated model from ground truth), when the weights are truly* sparse*. Isn’t posterior mean usually less sparse than MAP, if the likelihood is non-symmetric around zero? If so, shouldn’t the MAP estimate be better than the posterior mean of EP estimate if the weights are truly *sparse*?

Papers:

http://infoscience.epfl.ch/record/161312/files/ecml07.pdf?version=1

http://bethgelab.org/media/publications/fncom_2010_00012.pdf

If you’re coming up with a single model fit for prediction, you want the one that makes predictions closest to what Bayesian model averaging would do. That model may itself be implausible! In the Netflix challenge, competitors predicted integer movie ratings with fractional values, because that’s the way to minimize expected RMSE. Here, a non-sparse fit can predict the outcome of an unknown sparse model, because that may also be the way to minimize expected loss.

Iain, thanks for the comment! The perplexing thing here is that prediction metric is not RMSE, but the cross-validated log-likelihood (i.e., log-likelihood on test data), conditioned on the parameter estimate. I would have thought that the MAP estimate would have been better than the Bayesian least squares estimate here (since it maximizes penalized log-likelihood on the training data).

(But maybe your response is that conditioning on the BLS estimate still gives you something closer to the predictive distribution?)

The overall pattern of performance shown in the paper doesn’t fully make sense to me. For example, Fig 3B shows that for weights sampled from a Laplace distribution, you get best (test log-likelihood) performance performance from a MAP estimate under a Gaussian prior. Huh? (The text says something about mismatched hyperparameters, but this still seems pretty weird.) It’s possible I once understood this but have since forgotten.

Agreed, optimizing loss on the training set is often a first sensible step if wanting low loss at test time. However, if the sparse prior is important, fitting is presumably prone to overfitting, and it can make sense to consider what the optimal Bayesian solution might look like. As a thought experiment, consider the following procedure [1] that would find the model best matching the Bayesian predictive distribution: 1) sample the hell out of the posterior; 2) generate large amounts of synthetic data using each of your posterior parameters in turn; and 3) do a maximum likelihood fit to the synthetic data. The maximum likelihood fit isn’t penalized, you can have as much synthetic data as you want, so don’t need to include a prior penalization. Given that you don’t do penalization, none of your parameters are going to be zero, unless you had enough real data to be totally sure of that in the parameter posterior.

[1] Snelson and Ghahramani (2005) http://www.gatsby.ucl.ac.uk/~snelson/Bayes_pred.pdf ICML 2005, 840–847.

This paper mentions the posterior mean as an approximation to the “Bayes point”, the best single setting of the parameters. Of course the posterior mean depends on parametrization, so you better be working in a parametrization where the mean is a good central point for approximating the integral you should be doing. The approximation that you’re doing to an integral over the posterior parameters is crude: quadrature with a single knot. The MAP point in these sparse models will be at some corner of the domain of the integral over parameters, so probably a bad location for your single knot.

(I haven’t made time to read the paper you’re discussing, so don’t have comments on the final part of your comment.)

Thanks! That’s very helpful. I hadn’t heard of “Bayes’ points” before, but the idea makes good intuitive sense. (Iain, you may be unwittingly helping solve the brain — I hope the thought doesn’t alarm you too much!)

Snelson and Ghahramani (2005) paper looks interesting; maybe i would present their work next time!