During the last lab meeting, we talked about using expectation propagation (EP), an approximate Bayesian inference method, to fit generalized linear models (Poisson-GLMs) under Gaussian and Laplace (or exponential) priors on the filter coefficients. Both priors give rise to log-concave posteriors, and the Laplace prior has the useful property that the MAP estimate is often sparse (i.e., many weights are exactly zero). EP attempts to find the posterior mean, which is not (ever) sparse, however.
Bayesian inference under a Laplace prior is quite challenging. Unfortunately, our best friend the Laplace approximation is intractable, since the prior is non-differentiable at zero.
EP is one of many well-known methods for obtaining a Gaussian approximation to an intractable posterior distribution. It aims to minimize the KL divergence between the true posterior and an approximating Gaussian. If the approximating distribution is in the exponential family, moment-matching minimizes the KL divergence. However, direct minimization is typically intractable due to the difficulty of computing the moments of the true (high-dimensional) posterior. Instead, EP represents the posterior with a product of Gaussian “site potentials”, one for each term in the likelihood, each of which can be updated sequentially. (EP works well when the true posterior involves a product of independent likelihood terms and a prior.)
The main three steps in EP are as follows: (i) given site approximations (and the current approximate Gaussian posterior), we compute the cavity distribution by leaving out one of the site potentials; (2) we then compute the tilted distribution which is a product of the cavity distribution and the true factor from the model; and (3) finally we estimate site parameters such that the new Gaussian posterior has the same mean and covariance as the tilted distribution. This description sounds complicated; however, it’s just a simple 1-d update in the Gaussian posterior mean and covariance.
Some might wonder how to set the hyperparameter (in the Laplace prior). One can do cross-validation since there is only one hyperparameter in this case; however, one can do 0-th moment (normalizer) matching between the tilted distribution and the new Gaussian posterior to set it. (this is a real messy part, in my opinion.)
After going over the technical details mentioned above, we looked at the results in the papers. Long story short, in terms of MSE, the posterior mean of the approximate Gaussian posterior obtained by EP performed better than the numerically optimized MAP estimate (as it should be). However, what’s bizarre to me is that the posterior mean from EP was better than the MAP estimate in terms of prediction performance (measured by KL distance of the estimated model from ground truth), when the weights are truly sparse. Isn’t posterior mean usually less sparse than MAP, if the likelihood is non-symmetric around zero? If so, shouldn’t the MAP estimate be better than the posterior mean of EP estimate if the weights are truly sparse?