# Bayesian inference for Poisson-GLM with Laplace prior

During the last lab meeting, we talked about using expectation propagation (EP), an approximate Bayesian inference method, to fit generalized linear models (Poisson-GLMs) under Gaussian and Laplace (or exponential) priors on the filter coefficients. Both priors give rise to log-concave posteriors, and the Laplace prior has the useful property that the MAP estimate is often sparse (i.e., many weights are exactly zero).  EP attempts to find the posterior mean, which is not (ever) sparse, however.

Bayesian inference under a Laplace prior is quite challenging. Unfortunately, our best friend the Laplace approximation is intractable, since the prior is non-differentiable at zero.

# Lab meeting 7/11/11

This week, we talked about how to deal with hyperparameters in Bayesian hierarchical models, by reading the paper: “Hyperparameters, optimize or integrate out?” by David MacKay (in Maximum entropy and Bayesian methods, 1996).

The basic setup as follows.  We have a model for data $X$ with parameters $\theta$, described by the conditional distribution $P(X|\theta)$, which is the likelihood when considered as a function of $\theta$.  Regularized estimates for $\theta$ can be obtained by placing a prior over the parameters, $P(\theta|\alpha)$, governed by hyperparameters $\alpha$.  The paper focuses on a comparison between two methods for making inferences about $\theta$, which involve two different methods for making a Gaussian approximation to the posterior $P(\theta | X, \alpha)$.  (Note: both the likelihood and the prior take Gaussian forms in this setup).  The two methods are:

1. Evidence approximation (EA) – finds the hyperparameters $\hat \alpha_{ML}$ that maximize the evidence $P(X | \alpha) = \int P(X|\theta) P(\theta|\alpha) d\alpha$.  The optimized hyperparameters are then “fixed” so that the posterior takes the form $P(\theta | X, \hat \alpha_{ML}) \propto P(X|\theta) P(\theta|\hat \alpha_{ML})$(Note this is a form of Empirical Bayes parameter estimation, where the prior is estimated from the data and then used to regularize the parameter estimate). Since the two terms on the right are Gaussian, the posterior is truly Gaussian if $\alpha$ is fixed.  This approximation is accurate if the evidence (or the posterior distribution over $\alpha$) is very tightly distributed around its maximum.
2. MAP method – involves integrating out the hyperparameters to obtain the true posterior $P(\theta | X) \propto P(X|\theta) \int P(\theta|\alpha) P(\alpha) d \alpha$.   (This involves assuming a prior over $\alpha$; Mackay uses a flat (improper) improper prior, though one could as easily assume a proper prior).  A Gaussian approximation is then made using the mode and Hessian of this (true) log-posterior around its maximum, which is the so-called Laplace Approximation. (Might have been better to call this the “Laplace Method” than the “MAP method”).

The paper claims the latter method is much worse.  The Laplace approximation will in general produce a much-too-narrow posterior if there’s a spike at zero in the true posterior, which will often arise if the problem is ill-posed  (i.e., the likelihood is nearly flat in some directions, placing no constraints on the parameters in that subspace).  There are some arguments about the effective number of free parameters and “effective $\alpha$” governing the MAP method posterior that we didn’t have time to unpack.

# Lab meeting 6/20/2011

Today I presented a paper from Liam’s group:  “A new look at state-space models for neural data”, Paninski et al,  JCNS 2009

The paper presents a high-level overview of state-space models for neural data, with an emphasis on statistical inference methods.  The basic setup of these models is the following:

• Latent variable $Q$  defined by dynamics distribution:  $P(q_{t+1}|q_t)$
• Observed variable $Y$ defined by observation distribution: $P(y_t | q_t)$.

These two ingredients ensure that the joint probability of latents and observed variables is
$P(Q,Y) = P(q_1 ) P(y_1|q_1) \prod_{t=2}^T P(y_t | q_t) P(q_{t}|q_{t-1})$.
A variety of applications are illustrated (e.g., $Q$ = common input noise; $Y$ = multi-neuron spike trains).

The two problems we’re interested in solving, in general, are:
(1) Filtering / Smoothing:  inferring $Q$ from noisy observations $Y$, given the model parameters $\theta$.
(2) Parameter Fitting: inferring $\theta$ from observations $Y$.

The “standard” approach to these problems involves: (1) recursive approximate inference methods that involve updating a Gaussian approximation to $P(q_t|Y)$ using its first two moments; and (2) Expectation-Maximization (EM) for inferring $\theta$.  By contrast, this paper emphasizes: (1) exact maximization for $Q$, which is tractable in $O(T)$ via Newton’s Method, due to the banded nature of the Hessian; and (2) direct inference for $\theta$ using the Laplace approximation to $P(Y|\theta)$.  When the dynamics are linear and the noise is Gaussian, the two methods are exactly the same (since a Gaussian’s maximum is the same as its mean; the forward and backward recursions in Kalman Filtering/Smoothing are the same set of operations needed by Newton’s method). But for non-Gaussian noise or non-linear dynamics, the latter method may (the paper argues) provide much more accurate answers with approximately the same computational cost.

Key ideas of the paper are:

• exact maximization of a log-concave posterior
• $O(T)$ computational cost, due to sparse (tridiagonal or banded) Hessian.
• the Laplace approximation (Gaussian approximation to the posterior using its maximum and second-derivative matrix), which is (more likely to be) justified for log-concave posteriors
• log-boundary method for constrained problems (which preserves sparsity)

Next week: we’ll do a basic tutorial on Kalman Filtering / Smoothing (and perhaps, EM).