Bayesian inference for Poisson-GLM with Laplace prior

During the last lab meeting, we talked about using expectation propagation (EP), an approximate Bayesian inference method, to fit generalized linear models (Poisson-GLMs) under Gaussian and Laplace (or exponential) priors on the filter coefficients. Both priors give rise to log-concave posteriors, and the Laplace prior has the useful property that the MAP estimate is often sparse (i.e., many weights are exactly zero).  EP attempts to find the posterior mean, which is not (ever) sparse, however.

Bayesian inference under a Laplace prior is quite challenging. Unfortunately, our best friend the Laplace approximation is intractable, since the prior is non-differentiable at zero.

Continue reading

Lab meeting 06/25/12

In the paper “Automating the design of informative sequences of sensory stimuli” (by Lewi, Schneider, Woolley & Paninski, JCNS 11), the authors developed an algorithm to adaptively select stimuli during real-time sensory neurophysiology experiments. Given a set of already recorded responses, their algorithm determines which stimuli to present next so that the recorded data can provide as much information about the structure of the receptive field as possible.

Unlike their previous paper (NC 09), in this paper, they focused on the selection of informative stimulus “sequences” (or batches) to keep temporal or other types of correlations in stimuli.  They denoted the length of sequence by b and talked about two cases, where b is some finite number and b goes to infinity. In both cases, selecting a sequence of stimuli turned out to be computationally challenging, so they developed lower bounds using Jensen’s inequality when computing the expected information gain. When b goes to infinity, they restricted the stimulus distribution to Gaussian to make the high dimensional optimization problem (over stimulus distribution) tractable.  They tested the developed algorithm to real songbird auditory responses, and showed that the chosen stimulus sequences decreased the error significantly faster than i.i.d. experimental designs.

Lab meeting 7/11/11

This week, we talked about how to deal with hyperparameters in Bayesian hierarchical models, by reading the paper: “Hyperparameters, optimize or integrate out?” by David MacKay (in Maximum entropy and Bayesian methods, 1996).

The basic setup as follows.  We have a model for data X with parameters \theta, described by the conditional distribution P(X|\theta), which is the likelihood when considered as a function of \theta.  Regularized estimates for \theta can be obtained by placing a prior over the parameters, P(\theta|\alpha), governed by hyperparameters \alpha.  The paper focuses on a comparison between two methods for making inferences about \theta, which involve two different methods for making a Gaussian approximation to the posterior P(\theta | X, \alpha).  (Note: both the likelihood and the prior take Gaussian forms in this setup).  The two methods are:

  1. Evidence approximation (EA) – finds the hyperparameters \hat \alpha_{ML} that maximize the evidence P(X | \alpha) = \int P(X|\theta) P(\theta|\alpha) d\alpha.  The optimized hyperparameters are then “fixed” so that the posterior takes the form P(\theta | X, \hat \alpha_{ML}) \propto P(X|\theta) P(\theta|\hat \alpha_{ML})(Note this is a form of Empirical Bayes parameter estimation, where the prior is estimated from the data and then used to regularize the parameter estimate). Since the two terms on the right are Gaussian, the posterior is truly Gaussian if \alpha is fixed.  This approximation is accurate if the evidence (or the posterior distribution over \alpha) is very tightly distributed around its maximum.
  2. MAP method – involves integrating out the hyperparameters to obtain the true posterior P(\theta | X) \propto P(X|\theta) \int P(\theta|\alpha) P(\alpha) d \alpha.   (This involves assuming a prior over \alpha; Mackay uses a flat (improper) improper prior, though one could as easily assume a proper prior).  A Gaussian approximation is then made using the mode and Hessian of this (true) log-posterior around its maximum, which is the so-called Laplace Approximation. (Might have been better to call this the “Laplace Method” than the “MAP method”).

The paper claims the latter method is much worse.  The Laplace approximation will in general produce a much-too-narrow posterior if there’s a spike at zero in the true posterior, which will often arise if the problem is ill-posed  (i.e., the likelihood is nearly flat in some directions, placing no constraints on the parameters in that subspace).  There are some arguments about the effective number of free parameters and “effective \alpha” governing the MAP method posterior that we didn’t have time to unpack.