new paper: detecting overlapped spikes

I’m proud to announce the publication of our “zombie” spike sorting paper (Pillow, Shlens, Chichilnisky & Simoncelli 2013), which addresses the problem of detecting overlapped spikes in multi-electrode recordings.

The basic problem we tried to address is that standard “clustering” based spike-sorting methods often miss near-synchronous spikes. As a result, you get cross-correlograms that look like this:

Pillow13_PLoSONE_SpikeSorting_xcfig

When I first saw these correlograms (back in 2005 or so), I thought: “Wow, amazing —retinal ganglion cells inhibit each other with 1-millisecond precision! Should we send this to Nature or Science?” My more sober experimental colleagues pointed out that that this was likely only a (lowly) spike sorting artifact. So we set out to address the problem (leading to the publication of this paper a mere 8 years later!)

Let me just give a little bit of intuition for the problem and our solution, mainly as an excuse for showing some of the beautiful figures that Jon Shlens made.

Clustering-based spike sorting
Most spike-sorting algorithms use some variant of clustering. They identify “candidate” spike waveforms (basically, large deviations in the recorded signal) and apply clustering to determine which spikes came from which neurons. The problem is that when two spikes are close together in time, they superimpose linearly, creating a composite waveform that may not look much like either of the two spike waveforms in isolation.  Here’s a figure illustrating the phenomenon, showing the superposition of two synchronous spikes on a single channel (A) and in principal component space (B):

Pillow13_PLoSONE_SpikeSorting_fig1

In B, the black cloud of points represent waveforms assigned to the first neuron (black trace in A), and the red cloud of points represent waveforms assigned to the second neuron (red trace in A).  The blue arrow shows where you end up in PCA space if you add the mean black spike to the mean red spike.  Crazy, huh?  You end up way off in no-man’s land!

Let me point out: this is actual data from an actual pair of retinal ganglion cells, recorded by Jon Shlens in EJ Chichilnisky‘s lab.  See those little gray dots near the blue arrow? Those are projections of actual waveforms recorded during the experiment, which the clustering algorithm simply threw out.  (Makes sense, given how far they are from both clusters, right?) Discarding these spikes is what produced the curious artifact in the cross-correlogram shown above.

(The problem is actually slightly worse than this: if the spikes are not precisely synchronous but still overlap, then it turns out that the superposition of waveforms traces out an entire manifold in PCA space; see Fig 1C-D of the paper if your eyes can handle it.)

(Note #2: the problem also arises for window discriminators, matched filtering, and basically all methods that don’t explicitly take superposition into account.)

Binary pursuit
The good news is that we can address this problem using a generative model that  incorporates superposition of spike waveforms.  I won’t bore you with details here (read the paper if you like to be bored), but the basic idea is to use the clustering-based algorithm to identify the isolated spikes in the recording. Then, go through the recorded data and (greedily) insert spikes whenever doing so will reduce the residual error (essentially, whenever it looks like a particular spike should go there), with a sparse prior on spiking.  This results in an (approximate) MAP estimate of the spike train. (We call it binary pursuit since it’s essentially a binary version of matching pursuit).

Nothing too fancy here. Similar methods have been developed by Michael Berry’s and Vijay Balasubramanian’s groups (see Prentice et al 2011). There are no theoretical or practical guarantees with our method, but there are a few cool tricks and I think the results are pretty nice / compelling.  Here’s one of the comely figures Jon made:

Left: black dots: spikes detected by clustering; red dots: “missed” spikes.
Middle: colored ellipses: where we’d expect points to be, as a function of time-shift of second spike (based on generative model).
Right: overlapped spikes detected by binary pursuit (colored by time-shift). 

Pillow13_PLoSONE_SpikeSorting_scatter
And finally: look what it does with the correlograms:
(Showing 8 pairs of neurons: gray=original clustering; red = corrected with our method.)
Pillow13_PLoSONE_SpikeSorting_xcfig_fixed

There is of course huge room for improvement. One would like to combine our approach with methods that identify the number of neurons (e.g., see Frank Wood’s work on Dirichlet-Process based spike sorting) and allow for non-stationarity in spike waveforms (see Calabrese & Paninski 2010).    (And I should add: the last section of the paper focuses on the signal detection theory issue of knowing when to trust the results of this or any algorithm.)

There’s lots of active work on this problem, and in my view it’s a great area for the development and application of Bayesian methods*, since we have a ton of prior information about spike trains and neural tissue and the statistical features of the recorded data, much of which has yet to be exploited. The challenge is to find computationally efficient inference methods, since the raw data is pretty big (512 electrodes sampled at 20KHz, for our dataset).

* However, in my view, Frank Wood‘s utopian vision of a world in which neuroscientists don’t spike sort, but rather perform all analyses on multiple samples from the posterior over spike sortings, is still (thankfully) a very long ways off.

Sensory Coding Workshop @ MBI

This week, Memming and I are in Columbus, Ohio for a workshop on “Sensory and Coding”, organized by  Brent DoironAdrienne FairhallDavid Kleinfeld, and John Rinzel.

Monday was “Big Picture Day”, and I gave a talk about Bayesian Efficient Coding, which represents our attempt to put Barlow’s Efficient Coding Hypothesis in a Bayesian framework, with an explicit loss function to specify what kinds of posteriors are “good”. One of my take-home bullet points was that “you can’t get around the problem of specifying a loss function”, and entropy is no less arbitrary than other choice. This has led to some stimulating lunchtime discussions with Elad Schneidman, Surya Ganguli, Stephanie Palmer, David Schwab, and Memming over whether entropy really is special (or not!).

It’s been a great workshop so far, with exciting talks from a panoply of heavy hitters, including  Garrett Stanley, Steve Baccus, Fabrizio Gabbiani, Tanya Sharpee, Nathan Kutz, Adam Kohn, and Anitha Pasupathy.  You can see the full lineup here:
http://mbi.osu.edu/2012/ws6schedule.html

Bayesian inference for Poisson-GLM with Laplace prior

During the last lab meeting, we talked about using expectation propagation (EP), an approximate Bayesian inference method, to fit generalized linear models (Poisson-GLMs) under Gaussian and Laplace (or exponential) priors on the filter coefficients. Both priors give rise to log-concave posteriors, and the Laplace prior has the useful property that the MAP estimate is often sparse (i.e., many weights are exactly zero).  EP attempts to find the posterior mean, which is not (ever) sparse, however.

Bayesian inference under a Laplace prior is quite challenging. Unfortunately, our best friend the Laplace approximation is intractable, since the prior is non-differentiable at zero.

EP is one of many well-known methods for obtaining a Gaussian approximation to an intractable posterior distribution. It aims to minimize the KL divergence between the true posterior and an approximating Gaussian. If the approximating distribution is in the exponential family, moment-matching minimizes the KL divergence. However, direct minimization is typically intractable due to the difficulty of computing the moments of the true (high-dimensional) posterior. Instead, EP represents the posterior with a product of Gaussian “site potentials”, one for each term in the likelihood, each of which can be updated sequentially.  (EP works well when the true posterior involves a product of independent likelihood terms and a prior.)

The main three steps in EP are as follows: (i) given site approximations (and the current approximate Gaussian posterior), we compute the cavity distribution by leaving out one of the site potentials; (2) we then compute the tilted distribution which is a product of the cavity distribution and the true factor from the model; and (3) finally we estimate site parameters such that the new Gaussian posterior has the same mean and covariance as the tilted distribution. This description sounds complicated; however, it’s just a simple 1-d update in the Gaussian posterior mean and covariance.

Some might wonder how to set the hyperparameter (in the Laplace prior). One can do cross-validation since there is only one hyperparameter in this case; however, one can do 0-th moment (normalizer) matching between the tilted distribution and the new Gaussian posterior to set it. (this is a real messy part, in my opinion.)

After going over the technical details mentioned above, we looked at the results in the papers. Long story short, in terms of MSE, the posterior mean of the approximate Gaussian posterior obtained by EP performed better than the numerically optimized MAP estimate (as it should be). However, what’s bizarre to me is that the posterior mean from EP was better than the MAP estimate in terms of prediction performance (measured by KL distance of the estimated model from ground truth), when the weights are truly sparse. Isn’t posterior mean usually less sparse than MAP, if the likelihood is non-symmetric around zero? If so, shouldn’t the MAP estimate be better than the posterior mean of EP estimate if the weights are truly sparse?

Papers:

http://infoscience.epfl.ch/record/161312/files/ecml07.pdf?version=1

http://bethgelab.org/media/publications/fncom_2010_00012.pdf

Lab Meeting 3/25/2013: Demixing PCA (dPCA)

If you had lots of spike trains over 4 seconds for 800 neurons, 6 stimulus conditions, and 2 behavioral choices, how would you visualize your data? Unsupervised dimensionality reduction techniques, such as principal component analysis (PCA) finds orthonormal basis vectors that captures the most variance of the data, but the results are not necessarily interpretable. What one wants is to say is something like:

“Along this direction, the population dynamics seems to encode stimulus, and along this other orthogonal dimension, neurons are modulated by the motor behavior…”

But being completely unsupervised, PCA is blind to the stimulus conditions and behavioral outputs associated with the data. One can use a regression-type analysis (supervised learning), or discriminant analysis, but the following paper we discussed today used an interesting approach (first presented, in an earlier form at COSYNE 2011):

Wieland Brendel and Ranulfo Romo and Christian K. Machens. Demixed Principal Component Analysis, Advances in Neural Information Processing Systems 24, NIPS 2011

The main idea is the decomposition of the marginal (empirical) covariance over all conditions as a sum of conditional (empirical) covariances, so that for a given vector, you can decompose the variance explained by that projection into contributions from each variable (and their combinations). Then, for higher interpretability, they devise sparsifying penalties that would make each vector explain one of the variables, and discourage mixtures. As a result, a PCA-like method that still captures most of the variance, but gives an orthonormal basis vectors that corresponds mostly to modulations for each stimulus condition. They provide two algorithms that initializes at PCA solution: one is a gradient based optimization on a penalized PCA cost function, and the other is an extension of probabilistic PCA that enforces orthogonality and sparsity (via a somewhat arbitrary thresholding step).

However, dPCA has several potential disadvantages:

  • dPCA requires a constant length trial (in fact, every variable to have a constant length)
  • dPCA assumes each event is time-locked (e.g., behavioral delay cannot be accounted for, nor you can have random time delays)
  • each condition requires sufficient data to have a reliable PSTH (conditions with low probability can hurt your analysis), and it cannot analyze single trials
  • conditions must be categorical; dPCA cannot deal with continuous stimulus or observations
  • dPCA is not convex; the optimization result depends on initialization

dPCA suggests interesting directions for future work to formulate a generative model sparse Bayesian prior for demixing, and find a more direct link for the penalized cost function.

Cosyne 2013

We’ve recently returned from Utah, where several us attended the 10th annual Computational and Systems Neuroscience (CoSyNe) Annual Meeting.  It’s hard to believe Cosyne is ten!  I got to have a little fun with the opening-night remarks, noting that Facebook and Cosyne were founded only a month apart in Feb/March 2004, with impressive aggregate growth in the years since:

FBCosyneGrowth

The meeting kicked off with a talk from Bill Bialek  (one of the invited speakers for the very first Cosyne—where he gave a chalk talk!), who provoked the audience with a talk entitled “Are we asking the right questions.” His answer (“no”) focused in part on the issue of what the brain is optimized for: in his view, for extracting information that is useful for predicting the future.

In honor of the meeting’s 10th anniversary, three additional reflective/provocative talks on the state of the field were contributed by Eve Marder, Terry Sejnowski, and Tony Movshon.  Eve spoke about how homeostatic mechanisms lead to “degenerate” (non-identifiable) biophysical models and confer robustness in neural systems. Terry talked about the brain’s sensitivity to “suspicious coincidences” of spike patterns and the recent BAM proposal (which he played a central part in advancing). Tony gave the meeting’s final talk, a lusty defense of primate neurophysiology against the advancing hordes of rodent and invertebrate neuroscience, arguing that we will only understand the human brain by studying animals with sufficiently similar brains.

See Memming’s blog post for a summary of some of the week’s other highlights.  We had a good showing this year, with 7 lab-related posters in total:

  • I-4. Semi-parametric Bayesian entropy estimation for binary spike trains. Evan Archer, Il M Park, & Jonathan W Pillow.  [oops—we realized after submitting that the estimator is not *actually* semi-parametric; live and learn.]
  • I-14. Precise characterization of multiple LIP neurons in relation to stimulus and behavior.  Jacob Yates, Il M Park, Lawrence Cormack, Jonathan W Pillow, & Alexander Huk.
  • I-28. Beyond Barlow: a Bayesian theory of efficient neural coding.  Jonathan W Pillow & Il M Park.
  • II-6. Adaptive estimation of firing rate maps under super-Poisson variability.  Mijung Park, J. Patrick Weller, Gregory Horwitz, & Jonathan W Pillow.
  • II-14. Perceptual decisions are limited primarily by variability in early sensory cortex.  Charles Michelson, Jonathan W Pillow, & Eyal Seidemann
  • II-94. Got a moment or two? Neural models and linear dimensionality reduction. Il M Park, Evan Archer, Nicholas Priebe, & Jonathan W Pillow
  • II-95. Spike train entropy-rate estimation using hierarchical Dirichlet process priors.  Karin Knudson & Jonathan W Pillow.

Lab Meeting 2/11/2013: Spectral Learning

Last week I gave a brief introduction to Spectral Learning. This is a topic I’ve been wanting to know more about since reading the (very cool) NIPS paper from Lars Buesing, Jakob Macke and Maneesh Sahani, which describes a spectral method for fitting latent variable models of multi-neuron spiking activity (Buesing et al 2012).  I presented some of the slides (and a few video demos) from the spectral learning tutorial organized by Geoff Gordon, Byron Boots, and Le Song  at AISTATS 2012.

So, what is spectral learning, and what is it good for?  The basic idea is that we can fit latent variable models very cheaply using just the eigenvectors of a covariance matrix, computed directly from the observed data. This stands in contrast to “standard” maximum likelihood methods (e.g., EM or gradient ascent), which require expensive iterative computations and are plagued by local optima.

Essentially: spectral learning refers to a collection of moment-based methods. You compute some covariance matrices and then grab their dominant eigenvectors.  You pay some cost in terms of statistical efficiency (i.e., they may not converge to the “true” answer as quickly as the MLE, assuming one can find it), but you do get consistency, freedom from local optima, speed, and low computational cost.   (The Buesing 2012 paper does a nice trick of using spectral methods to provide a starting point for EM, combining the virtues of both approaches).

The key insight that makes spectral learning work is the fact that if (for example) you have a system where dependencies between past and future are mediated by some low-dimensional latent variable (e.g., as you generally do in the Kalman filter or dynamic factor models), the cross-covariance matrix between past and future observations will be low rank.  Geoff Gordon refers to the latent state as a kind of bottleneck:

SpectralBottleneck

You just have to do a little work to figure out how the singular vectors can be manipulated to provide an estimate of the model parameters.

The resulting method is so easy that you can estimate the Kalman Filter parameters with about 5 lines of Matlab. (See Byron Boots slides for a very clear and beautiful derivation). I’ll give it here just to provide a taste.

Spectral Learning of Kalman Filter Model

Consider the following (linear) discrete-time system:

  • x_{t+1} = A x_t + noise      (latent dynamics)
  • o_t = C x_t + noise   (observation equation) 

where observation vector o_t is higher-dimensional than latent vector x_t. Begin by defining the k-lagged cross-covariance matrix as \Sigma_k = \mathbb E [o_{t+k}\, o_t^T ] .

We can estimate the dynamics and observation matrices (up to a change of basis) via:

  • \hat A = U^T \Sigma_2 (U^T \Sigma_1)^{\dagger}
  • \hat C = U \hat A ^{-1},

where U denotes the n left singular vectors of \Sigma_1, and {\dagger} denotes pseudoinverse.  (Seems miraculous, huh? See Byron’s slides; the proof is very simple.)  Intuitively, this makes sense because the lag-2 covariance \Sigma_2 will involve two multiplications by the dynamics matrix A, and the pseudoinverse involving the \Sigma_1 will remove one of them.

Apparently, this is an old trick in the engineering literature, but it seems to have been picked up in machine learning only in the last 5-10 years, where it’s since been extended and applied to a wide variety of latent variable models.

One of our current research interests is to determine whether the moment-based methods we’ve been developing for spike train analysis (e.g., Bayesian Spike-Triggered Covariance) qualify as spectral methods, and (of course) how to better exploit use of this general class of ideas for modeling neural data.

Lab Meeting 2/4/2013: Asymptotically optimal tuning curve in Lp sense for a Poisson neuron

Optimal tuning curve is the best transformation of the stimulus into neural firing pattern (usually firing rate) under certain constraints and optimality criterion. The following paper I saw at NIPS 2012 was related to what we are doing, so we took a deeper look into it.

Wang, Stocker & Lee (NIPS 2012), Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum Lp loss.

The paper assumes a single neuron encoding a 1 dimensional stimulus, governed by a distribution \pi(s). The neuron is assumed to be Poisson (pure rate code). The neuron’s tuning curve h(s) is smooth, monotonically increasing (with h'(s) > c), and has a limited minimum and maximum firing rate as its constraint. Authors assume asymptotic regime for MLE decoding where the observation time T is long enough to apply asymptotic normality theory (and convergence of p-th moments) of MLE.

The authors show that there is a 1-to-1 mapping between the tuning curve and the Fisher information I under these constraints. Then for various loss functions, they derive the optimal tuning curve using calculus of variations. In general, to minimize the Lp loss E\left[ |\hat s - s|^p \right] under the constraints, the optimal (squared) tuning curve is:

\sqrt(h(s)) = \sqrt{h_{min}} + (\sqrt{h_{max}} - \sqrt{h_{min}}) \frac{\int_{-\infty}^s \pi(t)^{1/(p+1)} \mathrm{d}t}{\int_{-\infty}^\infty \pi(t)^{1/(p+1)} \mathrm{d}t}

Furthermore, in the limit of p \to 0, the optimal solution corresponds to the infomax solution (i.e., optimum for mutual information loss). However, all the analysis is only in the asymptotic limit, where the Cramer-Rao bound is attained by the MLE. For the case of mutual information, unlike noise-less case where the optimal tuning curve becomes the stimulus CDF (Laughlin), for Poisson noise, it turns out to be the square of the stimulus CDF. I have plotted the differences below for a normal distribution (left) and a mixture of normals (right):

Comparison of optimal tuning curves

The results are very nice, and I’d like to see more results with stimulus noise and with population tuning assumptions.