Context-dependent computation by recurrent dynamics in prefrontal cortex

The past two lab meetings have focused on understanding the regression subspace analysis performed in Mante et al. 2013.

Context-Dependent Computation by Recurrent Dynamics in Prefrontal Cortex
Mante, Sussillo, Shenoy & Newsome; Nature 2013

The paper explores the hypothesis that context-dependent behavior is the result of early selection. However, they find no evidence for early selection mechanisms. Instead, they demonstrate that populations of neurons in the PFC can be projected into a neural state space in which the features of the task are separable regardless of context.

As we study larger populations of neurons in increasingly complex tasks, we must address the problem of how to meaningfully visualize and interrogate our data. This is a difficult challenge and Mante et al. 2013 provides a compelling approach.

Lab Meeting 11/12/13: Spike Identification Using Continuous Basis Pursuit

In a previous lab meeting (7/9/13) we discussed the Continuous Basis Pursuit algorithm presented in Ekanadham et al., 2011.  In today’s meeting, we considered the recent work by the authors in applying this method to the problem of identifying action potentials of different neurons from extracellular electrode recordings (Ekanadham et al., 2013).   Most current methods for automatic spike sorting involve identifying candidate regions where a spike may have occurred, projecting the data from these candidate time intervals onto some lower dimensional feature space, and then using clustering methods to group segments with similar voltage traces.   These methods tend to perform badly when the waveforms corresponding to spikes from different neurons overlap.

The authors of this paper model the voltage trace as the (noisy) linear combination of waveforms that are translated continuously in time:  V(t) = \sum_n^N \sum_i^C a_{n,i}W_n(t-\tau_{n,i}) + \epsilon(t).

The method proceeds, roughly, by alternately using least square to estimate the shape the waveforms W_{n}, and then using Continuous Basis Pursuit to estimate locations and amplitude (\tau_{n,i}, a_{n,i} ) of the waveforms present in the signal.


  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “A unified framework and method for automatic neural spike identification.” Journal of neuroscience methods (2013).
  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “Recovery of sparse translation-invariant signals with continuous basis pursuit.” Signal Processing, IEEE Transactions on 59.10 (2011): 4735-4744.
  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “A blind sparse deconvolution method for neural spike identification.” Advances in Neural Information Processing Systems. 2011.

Lab Meeting 7/23/13 Adaptive pooling of motion signals in humans

We discussed a recent paper that uses ocular following responses in humans to demonstrate a dissociation between the oculomotor system and the perceptual system. Using stimuli generated by filtering noise in the Fourier domain the authors could construct naturalistic random-phase textures that have identical speeds, but differ in their frequency content. Interestingly, as the stimuli became more broadband, the oculomotor system could take advantage of the richness of the stimulus and became more rapid and precise whereas the humans’ psychophysical performance decreased.

Rather than conclude that there are two distinct visual systems, the authors hypothesize that both the oculomotor and perceptual systems rely on the same encoding and decoding mechanism and instead have different gain control . Using likelihood base methods with an added gain control step, they fit the dissociated response of the two systems by using Naka-Rushton gain control for the eye movements and everybody’s friend, divisive normalization, for the perceptual system.


Lab Meeting 7/9/13: Continuous Basis Pursuit

We discussed the method of Continuous Basis Pursuit introduced in recent papers Ekanadham et al. for decomposing a signal into a linear combination continuously translated copies of a small set of elementary features.  A standard method for recovering the time-shifts and amplitudes for these features is to introduce a dictionary consisting of many shifted copies of the features, and then use basis pursuit or a related method to recover a representation of the signal as a sparse linear combination of these shifted copies of the elementary waveforms.  Accurately representing the signal requires relatively close spacing of the dictionary elements; however, such close spacing yields highly correlated dictionary elements, which decreases the performance of basis pursuit and related recovery algorithms.

With Continuous Basis Pursuit, Ekanadham et al. circumvent this problem by first augmenting the dictionary to allow for continuous interpolation, either using a first- or second- order Taylor interpolation or using an interpolation based on trigonometric splines.  These augmentations increase the ability to accurately represent the signal as a sparse sum of features, without introducing too much additional correlation in the set of dictionary vectors, and the smooth interpolation allows the recovery of continuous, rather than discrete, time shifts.  This kind of decomposition of a signal into (continuously) shifted copies of a few basic features is useful in the spike-sorting problem, for example.


  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero P. Simoncelli. “Recovery of sparse translation-invariant signals with continuous basis pursuit.” Signal Processing, IEEE Transactions on 59.10 (2011): 4735-4744.
  • Ekanadham, Chaitanya, Daniel Tranchina, and Eero Simoncelli. “A blind deconvolution method for neural spike identification.”

Lab Meeting 6/25/2013: Restricted Boltzmann Machines and Restricted Boltzmann Forests

Lab meeting focused on the tractability of using Restricted Boltzmann Machines (RBMs) in modelling and an extension of the RBM known as the Restricted Boltzmann Forest (RBF) as presented in Larochelle, Bengio and Turian (2010). This post will review the tractibility of RBMs and briefly discuss the main contribution of RBFs.

Restricted Boltzmann Machine (RBM)

An RBM (originally Harmonium, Smolensky 1986) is a generative neural network with two groups of neurons (hidden h; and visible x) that form a bipartite graph.

We can begin by noting the disadvantages in computation complexity of using an RBM (and other factor models).  The probability distribution over the RBM is calculated as follows:


In general, q(x) is relatively efficient to compute. However, the computational complexity/cost of calculating  Z limits the power of this model. In the case of the RBM Z is:


In this form Z must be computed for all values of x and thus is extremely prohibitive, however Z can be factorized:


In this form, Z is independent of x and thus only needs to be calculated once, reducing the overall cost of calculating p(x). This allows the RBM to grow much larger in its visible layer without an exponential penalty in computational cost. However there are limits on the size of the RBM in terms of the number of hidden units it may have.

Restricted Boltzmann Forests (RBFs)

Restricted Boltzmann Forests were designed to alleviate some of the computational limitations on hidden units encountered with RBMs. RBFs still have the same general structure, but RBFs place additional restrictions on the hidden units. The restrictions require that the hidden units be organized into n trees of depth d, such that if a hidden unit is active, its right
subtree must be inactive and its left may be active. Similarly if a hidden unit is inactive, its left subtree must be inactive and its right subtree may be active. These restrictions result in fewer possible configurations of the hidden units. Thus, the cost of calculating Z is reduced. Consider an RBM with 20 hidden units. Z would require summation over 2^20 possible configurations (settings of 0 or 1) of the hidden units. An RBF with 5 trees of depth 3 would have 75 hidden units and the same computational cost (number of possible configurations) as an RBM with 20 hidden units.

I have left out many of the details regarding RBFs, including the procedures for performing learning and calculating p(x), all of which can be found in Larochelle, Bengio & Turian (2010).


  • Hinton, Geoffrey. “A practical guide to training restricted Boltzmann machines.” Momentum 9.1 (2010).
  • Larochelle, Hugo, Yoshua Bengio, and Joseph Turian. “Tractable multivariate binary density estimation and the restricted Boltzmann forest.” Neural computation 22.9 (2010): 2285-2307.
  • Smolensky, Paul. “Information processing in dynamical systems: Foundations of harmony theory.” (1986): 194.

Lab Meeting 6/10/2013: Hessian free optimization

James Martens has been publishing bags of tips and tricks for large-scale non-convex optimization that occurs in training deep learning network and recurrent neural network (RNN). They were able to train deep learning network without pre-training and better than the state-of-the-art, and also for RNN, much better than back-propagation through time. Basically, it’s the use of 2nd order (curvature information) via heuristic modifications of the conjugate gradient (CG) method. CG is Hessian-free since one only needs to evaluate Hessian in the direction of a single direction which is much cheaper than computing the full Hessian (often it is prohibitive for large-scale problems). The objective function f(\theta) is repeatedly locally approximated as a quadratic function q_\theta(p) = f(\theta) + \nabla f(\theta)^\top p + \frac{1}{2} p^\top B p, and minimized. Some of the tricks are:

  1. Use conjugate gradient instead of other quasi-Newton methods like L-BFGS, or nonlinear conjugate gradient.
  2. Use Gauss-Newton approximation. For non-convex problems, the Hessian can have negative eigenvalues which can lead to erratic behavior of the CG step which assumes positive definite B. Hence, they propose using the Gauss-Newton approximation which discards the second-order derivatives, and is guaranteed to be positive definite. In the following Hessian, the second term is simply ignored.
    H_{ij} = \frac{\partial^2 L(g)}{\partial \theta_i \partial \theta_j} = L''(g) \frac{\partial g}{\partial \theta_i}\frac{\partial g}{\partial \theta_j} + L'(g) \frac{\partial^2 g}{\partial \theta_i \partial \theta_j}
  3. Use fraction of improvement as termination condition for CG (instead of the regular residual norm condition).
  4. Add regularization (dampening) on the Hessian (or its approximation), and update its trust-region parameter via Levenberg-Marquardt style heuristic.
  5. Do semi-online, mini-batch updates.
  6. For training RNNs, use structural dampening which limits changing parameters too much that are highly sensitive.


  • James Martens. Deep learning via Hessian-free optimization. ICML 2010
  • James Martens, Ilya Sutskever. Learning Recurrent Neural Networks with Hessian-Free Optimization. ICML 2011

new paper: detecting overlapped spikes

I’m proud to announce the publication of our “zombie” spike sorting paper (Pillow, Shlens, Chichilnisky & Simoncelli 2013), which addresses the problem of detecting overlapped spikes in multi-electrode recordings.

The basic problem we tried to address is that standard “clustering” based spike-sorting methods often miss near-synchronous spikes. As a result, you get cross-correlograms that look like this:


When I first saw these correlograms (back in 2005 or so), I thought: “Wow, amazing —retinal ganglion cells inhibit each other with 1-millisecond precision! Should we send this to Nature or Science?” My more sober experimental colleagues pointed out that that this was likely only a (lowly) spike sorting artifact. So we set out to address the problem (leading to the publication of this paper a mere 8 years later!)

Let me just give a little bit of intuition for the problem and our solution, mainly as an excuse for showing some of the beautiful figures that Jon Shlens made.

Clustering-based spike sorting
Most spike-sorting algorithms use some variant of clustering. They identify “candidate” spike waveforms (basically, large deviations in the recorded signal) and apply clustering to determine which spikes came from which neurons. The problem is that when two spikes are close together in time, they superimpose linearly, creating a composite waveform that may not look much like either of the two spike waveforms in isolation.  Here’s a figure illustrating the phenomenon, showing the superposition of two synchronous spikes on a single channel (A) and in principal component space (B):


In B, the black cloud of points represent waveforms assigned to the first neuron (black trace in A), and the red cloud of points represent waveforms assigned to the second neuron (red trace in A).  The blue arrow shows where you end up in PCA space if you add the mean black spike to the mean red spike.  Crazy, huh?  You end up way off in no-man’s land!

Let me point out: this is actual data from an actual pair of retinal ganglion cells, recorded by Jon Shlens in EJ Chichilnisky‘s lab.  See those little gray dots near the blue arrow? Those are projections of actual waveforms recorded during the experiment, which the clustering algorithm simply threw out.  (Makes sense, given how far they are from both clusters, right?) Discarding these spikes is what produced the curious artifact in the cross-correlogram shown above.

(The problem is actually slightly worse than this: if the spikes are not precisely synchronous but still overlap, then it turns out that the superposition of waveforms traces out an entire manifold in PCA space; see Fig 1C-D of the paper if your eyes can handle it.)

(Note #2: the problem also arises for window discriminators, matched filtering, and basically all methods that don’t explicitly take superposition into account.)

Binary pursuit
The good news is that we can address this problem using a generative model that  incorporates superposition of spike waveforms.  I won’t bore you with details here (read the paper if you like to be bored), but the basic idea is to use the clustering-based algorithm to identify the isolated spikes in the recording. Then, go through the recorded data and (greedily) insert spikes whenever doing so will reduce the residual error (essentially, whenever it looks like a particular spike should go there), with a sparse prior on spiking.  This results in an (approximate) MAP estimate of the spike train. (We call it binary pursuit since it’s essentially a binary version of matching pursuit).

Nothing too fancy here. Similar methods have been developed by Michael Berry’s and Vijay Balasubramanian’s groups (see Prentice et al 2011). There are no theoretical or practical guarantees with our method, but there are a few cool tricks and I think the results are pretty nice / compelling.  Here’s one of the comely figures Jon made:

Left: black dots: spikes detected by clustering; red dots: “missed” spikes.
Middle: colored ellipses: where we’d expect points to be, as a function of time-shift of second spike (based on generative model).
Right: overlapped spikes detected by binary pursuit (colored by time-shift). 

And finally: look what it does with the correlograms:
(Showing 8 pairs of neurons: gray=original clustering; red = corrected with our method.)

There is of course huge room for improvement. One would like to combine our approach with methods that identify the number of neurons (e.g., see Frank Wood’s work on Dirichlet-Process based spike sorting) and allow for non-stationarity in spike waveforms (see Calabrese & Paninski 2010).    (And I should add: the last section of the paper focuses on the signal detection theory issue of knowing when to trust the results of this or any algorithm.)

There’s lots of active work on this problem, and in my view it’s a great area for the development and application of Bayesian methods*, since we have a ton of prior information about spike trains and neural tissue and the statistical features of the recorded data, much of which has yet to be exploited. The challenge is to find computationally efficient inference methods, since the raw data is pretty big (512 electrodes sampled at 20KHz, for our dataset).

* However, in my view, Frank Wood‘s utopian vision of a world in which neuroscientists don’t spike sort, but rather perform all analyses on multiple samples from the posterior over spike sortings, is still (thankfully) a very long ways off.