Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is:

\quad y = \phi(x)^T w + \epsilon, \quad  \epsilon \sim \mathcal{N}(0,\sigma^2)

and an independent zero-mean Gaussian prior on each element of w:

w_i \sim \mathcal{N}(0,1/\alpha_i),

where \alpha_i denotes the prior precision of each weight.

The basic idea is to set the hyperparameters \{\alpha_i\}  by maximizing marginal likelihood P(Y|X,\{\alpha_i\}). The resulting estimate for w under the prior P(w | \{\alpha_i\}) will be sparse because many of the \alpha_i will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters \alpha_i and \sigma^2 iteratively, but the draw back is the initial \alpha vector would be a huge, and every matrix operation would be time consuming with large scale data. When an \alpha_i reaches infinity, it’s abandoned and \alpha vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned \alpha would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets \alpha_i‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each \alpha_i from all other hyperparameters, labelled “\alpha_{-i}“. The authors analyze l(\alpha_i) to show that L(\alpha) has a unique maximum with respect to \alpha_i. Then they update \alpha_i given:
\alpha_i=\frac{s_i^2}{q_i^2-s_i},\mbox{ if }q_i^2>s_i
\alpha_i=\infty,\mbox{ if }q_i^2\le s_i
where s_i and q_i are statistics needed for computing \alpha_i. Let \theta_i=q_i^2-s_i, then
If \theta_i>0 and \alpha_i<\infty, re-estimate \alpha_i;
If \theta_i>0 and \alpha_i=\infty, add \phi_i to the model with updated \alpha_i;
If \theta_i\le 0 and \alpha_i<\infty, delete \phi_i from the model and set \alpha_i=\infty;

After updating \alpha_i, it recomputes \Sigma and \mu (covariance and mean of posterior of w) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for w in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the \alpha_i sequentially. It could also be borrowed to solve similar problems with independent \alpha vector.

Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

They showed that a factor model with 1-dimensional latent process captures the excess noise correlations observed during anesthesia. They extended GPFA [4] to include a linear stimulus drive component, and fit the trials corresponding to the same stimulus. This model assumes that the square-root transformed spike counts in 100 ms bins y(t) follows,

y(t) \sim \mathcal{N}(f(s(t)) + cx(t) + d, R)

x(t) \sim \mathcal{N}(0, K)

where s(t) is the stimulus, f(\cdot) is the tuning curve drive, d is the mean, R represents the individual (diagonal) covariance, c is the factor loadings vector, and x(t) is the 1-dimensional population wide latent process with assumed smoothness given by the kernel K(t_1, t_2) = \exp(-(t_1 - t_2)^2 / 2 / \tau^2 ). The square-root transformation is for the variance stabilization, making Poisson-like spike count observations more like Gaussian distributed.

They show that the most of the extra noise correlation observed during anesthesia is explained by the 1D latent process, and the remaining noise correlation is in par with awake condition, while noise correlations in the awake data didn’t get explained away by it (figure below).

Noise correlation as a function of geometric mean of the neurons

A remarkable result! (red) anesthesia, (blue) awake (solid) raw noise correlation (dashed) residual noise correlation unexplained by the 1D-latent variable

Note that this nonlinear relation between the mean rate and noise correlation is not a natural consequence of the generative model they assumed (equation above), since the noise is additive while firing rate is modulated by stimulus. That’s why they had to fit different factor loadings vector c for different stimulus conditions. However, they show that a multiplicative LNP model where a 1D latent process multiplicatively interacts with the stimulus drive in the log firing rate does reproduce this quality (Fig. 6). (It’s a shame that they didn’t actually fit this more elegant generative model directly to their data.)

Furthermore, they showed that local field potentials (LFP) in 0.5-2Hz range is correlated with the inferred latent variable. Hence, low-pass filtered LFP averaged over electrodes is a good proxy for L4 population correlation under anesthesia.

1D-latent process, LFP, and LNP results are all consistent with Matteo Carandini’s soloist vs chorister story except his preparations were awake rodent as far as I recall. This story also bears similarity with [5-8]. This is certainly an exciting type of model and there are many similar models being developed (e.g. [9, 10] and references therein). I hope to see more developments in these kinds of latent variable models; they have the advantage of scaling well in terms of number of parameters, but the disadvantage of being non-convex optimization. Also, I would like to see similar analysis for awake but spontaneous (non-task performing) state.

Their data and code are all available online (awesome!).

  1. Ecker, A. S., Berens, P., Cotton, R. J., Subramaniyan, M., Denfield, G. H., Cadwell, C. R., Smirnakis, S. M., Bethge, M., and Tolias, A. S. (2014). State dependence of noise correlations in macaque primary visual cortexNeuron, 82(1):235-248.
  2. Ecker, A. S., Berens, P., Keliris, G. A., Bethge, M., Logothetis, N. K., and Tolias, A. S. (2010). Decorrelated neuronal firing in cortical microcircuitsScience, 327(5965):584-587.
  3. Cohen, M. R. and Kohn, A. (2011). Measuring and interpreting neuronal correlationsNat Neurosci, 14(7):811-819.
  4. Yu, B. M., Cunningham, J. P., Santhanam, G., Ryu, S. I., Shenoy, K. V., and Sahani, M. (2009). Gaussian-Process factor analysis for Low-Dimensional Single-Trial analysis of neural population activityJournal of Neurophysiology, 102(1):614-635.
  5. Luczak, A., Bartho, P., and Harris, K. D. (2013). Gating of sensory input by spontaneous cortical activityThe Journal of Neuroscience, 33(4):1684-1695.
  6. Goris, R. L. T., Movshon, J. A., and Simoncelli, E. P. (2014). Partitioning neuronal variabilityNat Neurosci, 17(6):858-865. [last week's blog post TBL]
  7. Tkačik, G., Marre, O., Mora, T., Amodei, D., Berry, M. J., and Bialek, W. (2013). The simplest maximum entropy model for collective behavior in a neural networkJournal of Statistical Mechanics: Theory and Experiment, 2013(03):P03011+.
  8. Okun, M., Yger, P., Marguet, S. L., Gerard-Mercier, F., Benucci, A., Katzner, S., Busse, L., Carandini, M., and Harris, K. D. (2012). Population rate dynamics and multineuron firing patterns in sensory cortexThe Journal of neuroscience : the official journal of the Society for Neuroscience, 32(48):17108-17119.
  9. Pfau, D., Pnevmatikakis, E. A., and Paninski, L. (2013). Robust learning of low-dimensional dynamics from large neural ensembles. NIPS 26

Computational Vision Course 2014 @ CSHL

Yesterday marked the start of the 2014 summer course in COMPUTATIONAL NEUROSCIENCE: VISION at Cold Spring Harbor. The course was founded in 1985 by Tony Movshon and Ellen Hildreth, with the goal of inspiring new generations of students to address problems at the intersection of vision, computation, and the brain. The list of past attendees is impressive.

For the next two weeks, students will have two 3-hour lectures per day from a smattering of leading experimental and theoretical neuroscientists (full list here), punctuated with breaks for conversation and recreation (swimming, soccer, frisbee, tennis, ping pong, squirt gun battles, to name a few) on the lush grounds of the Banbury Center at CSHL. Evenings will be devoted to informal discussions, tutorials, and to course projects inspired by lectures and/or interactions with students, TAs, and lecturers.

The course is a ridiculous amount of fun, despite the number of equations. I first attended as a second-year grad student in the summer of 2000 (part of a cohort that included Alex Huk, Anne Churchland, Hannah Bayer, Stephen David, Odelia Schwartz, and Richard Murray), and I returned in 2002 as TA alongside Anne Churchland. It’s an honor to be back this year as a co-organizer (with Geoff Boynton, Greg Horwitz, and Stefan Treue), and I’m joined by Jake Yates and two other students from UT.

Tony Movshon kicked things off yesterday with characteristic Tony levels of charm and wit. Here’s a short sampling of quotes:

  • “Matlab is an abomination. But I want you to learn as much as possible so that when you come to my lab, you will know how to get whatever it is you need to get done, done.”
  • “Only a theorist could use the terms ‘actual’ and ‘simulated’ simultaneously when referring to data.”
  • “The brain works the way it does because it’s made of meat, and meat is not deterministic.”
  • “Did I mention how dumb the motor system is?”

Multiple Object Tracking

On June 23rd, we discussed “Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model” by Edward Vul, Michael C Frank, George Alvarez, and Joshua B Tenenbaum.

    This paper proposes a dynamic probabilistic model as an ideal observer for multiple object tracking.  The ideal observer model uses a set of n Kalman filters to track n objects.  Observations at each time step are composed of n pairs of position and velocity values.  However, inference must be performed in order to associate each observation pair with the appropriate object identity (Kalman filter).

    The algorithm for multiple object tracking is accomplished by alternating between 1) a Rao-Blackwelized particle filter to assign identities to each observation and then 2) performing the Kalman filter to analytically track the objects.

    This paper goes on to compare the ideal observer performance (the multiple object tracking algorithm) to human confidence during a comparable psychophysical task.  The results are comparable, but it would be nice to compare the psychophysical tracking results to the model results directly.

    The second focus of this paper was to manipulate the influence of the velocity on the tracking in order to investigate the open question of whether velocity estimates are incorporated into multiple object tracking.  By manipulating the linear dynamics that generate the object motion, they are able to evaluate how different assumptions about velocity in the model multiple object tracking compare to human performance.  They conclude that the model most closely matches human performance when velocity is not weighted.

   This provides a nice algorithmic framework for thinking about how human subjects accomplish multiple object tracking.

Evaluating point-process likelihoods

We recently discussed two recent papers proposing improvements to the commonly used discrete approximation of the log-likelihood for a point process (Paninski, 2004). The likelihood is written as
ll(T|t_{1,..,N(T)}) = \sum_{i = 1}^{N(T)} \log( \lambda(t_i|\mathcal{H}_{t_i})) - \int_0^T \lambda(t|\mathcal{H}_{t}) dt
where t_i are the spike times and \lambda(t|\mathcal{H}_t) is the conditional intensity function (CIF) of the process at time t given the preceding spikes. Typically, the integral in this equation cannot be evaluated in closed form. The standard approximation computes the function by binning along a regular lattice with bins size \delta
ll(T|t_{1,..,N(T)}) \approx \sum_{i = 1}^{l} \Delta N_i \log(\lambda_i \delta) - \lambda_i\delta
where \Delta N_i is the number of spikes in the ith bin. Both papers demonstrate that smarter approximations to the integral are better for point-process statistics than naïvely binning spike train data.

The first paper we discussed (Citi et al., 2014) used the fact that neurons have an absolute refractory period after each spike to derive a correction term which makes up for the average spike placement within a bin
ll(T|t_{1,..,N(T)}) \approx \sum_{i = 1}^{l} \Delta N_i \log(\lambda_i \delta) - \left( 1-\frac{\Delta N_i}{2}\right)\lambda_i\delta
This correction is extremely simple to implement, but the authors demonstrate that it not provides a better approximation to the likelihood, but GLMs can be successfully fit to data using larger bin sizes than with the naïve approximation.

The second paper (Mena & Paninski, submitted) proposes to approximate the integral in the point-process likelihood using quadrature methods. The assuming the CIF is zero for some time $tau$ after each spike
\int_0^T \lambda(t|\mathcal{H}_{t}) dt = \int_0^{t_1} \lambda(t|\mathcal{H}_{t}) dt + \sum_{i=1}^{NT} \int_{t_i+\tau}^{t_{i+1}} \lambda(t|\mathcal{H}_{t}) dt
Then the smaller integrals (\int_{t_i+\tau}^{t_{i+1}} \lambda(t|\mathcal{H}_{t}) dt) are computed using Gaussian quadrature.
This method produces a very accurate approximation to the true (continuous) likelihood function, and is more accurate than the likelihood given by Citi et al. while evaluating the CIF at less points. In addition, they point out that their likelihood remains just as simple to maximize as the standard approximation when using non-adaptive quadrature methods. However, they do not show that their method improves parameter estimation of a GLM in practice.

Context-dependent computation by recurrent dynamics in prefrontal cortex

The past two lab meetings have focused on understanding the regression subspace analysis performed in Mante et al. 2013.

Context-Dependent Computation by Recurrent Dynamics in Prefrontal Cortex
Mante, Sussillo, Shenoy & Newsome; Nature 2013

The paper explores the hypothesis that context-dependent behavior is the result of early selection. However, they find no evidence for early selection mechanisms. Instead, they demonstrate that populations of neurons in the PFC can be projected into a neural state space in which the features of the task are separable regardless of context.

As we study larger populations of neurons in increasingly complex tasks, we must address the problem of how to meaningfully visualize and interrogate our data. This is a difficult challenge and Mante et al. 2013 provides a compelling approach.

Lab Meeting 11/12/13: Spike Identification Using Continuous Basis Pursuit

In a previous lab meeting (7/9/13) we discussed the Continuous Basis Pursuit algorithm presented in Ekanadham et al., 2011.  In today’s meeting, we considered the recent work by the authors in applying this method to the problem of identifying action potentials of different neurons from extracellular electrode recordings (Ekanadham et al., 2013).   Most current methods for automatic spike sorting involve identifying candidate regions where a spike may have occurred, projecting the data from these candidate time intervals onto some lower dimensional feature space, and then using clustering methods to group segments with similar voltage traces.   These methods tend to perform badly when the waveforms corresponding to spikes from different neurons overlap.

The authors of this paper model the voltage trace as the (noisy) linear combination of waveforms that are translated continuously in time:  V(t) = \sum_n^N \sum_i^C a_{n,i}W_n(t-\tau_{n,i}) + \epsilon(t).

The method proceeds, roughly, by alternately using least square to estimate the shape the waveforms W_{n}, and then using Continuous Basis Pursuit to estimate locations and amplitude (\tau_{n,i}, a_{n,i} ) of the waveforms present in the signal.


  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “A unified framework and method for automatic neural spike identification.” Journal of neuroscience methods (2013).
  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “Recovery of sparse translation-invariant signals with continuous basis pursuit.” Signal Processing, IEEE Transactions on 59.10 (2011): 4735-4744.
  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “A blind sparse deconvolution method for neural spike identification.” Advances in Neural Information Processing Systems. 2011.