Dynamic mode decomposition for large-scale neuron activity

On March 23th, I presented the following paper in lab meeting:

The purpose of the paper is to describe dynamic mode decomposition (DMD), a method from applied mathematics for dimensionality reduction of dynamical systems, and adapt it to the analysis of large-scale neural recordings.

Here’s the basic setup.  Consider measurements of an n-dimensional dynamical system at m time points:


where x_t is the n-dimensional state vector at time t (e.g., the spike count of n neurons in a single time bin). DMD attempts to describe the map from x_t to x_{t+1} with a low-rank linear dynamics matrix A, so that X=AX'.  We may think of DMD as high-dimensional regression of X' against X, followed by eigendecomposition of the regression weights A. For long time series, one may use a sliding window to model nonlinear dynamics by approximating them with a linear dynamics matrix A within each time window.

Through algebra tricks, DMD will come to a simple dynamic model \hat{X(t)}=\Phi\Lambda^tz_0, where \Phi and \Lambda are eigenvectors and eigenvalues of A and z_0 satisfies the initial condition. The magnitude of mode \phi_i represents spatial correlations between the n observable locations. The eigenvalue \lambda_i corresponds to the temporal dynamics of the spatial mode \phi_i. Specifically, its rate of growth/decay and frequency of oscillation are reflected in the magnitude and phase components of \phi_i, respectively. Therefore, DMD can be considered as as a hybrid of static mode extraction by principal components analysis (PCA) in the spatial domain and spectral transformation in the frequency domain (Fourier transform).

One crucial point the paper emphasizes is the DMD spectrum. Since the phases of eigenvalues reflect the oscillation frequency of the modes, one can plot the DMD spectrum \left(P=||\phi||^2\right)  as a function of frequency \left(f_i=\frac{\mbox{imag}(log\lambda_i)}{2\pi\Delta t}\right), and compare to the Fourier spectrum of the raw data. The paper shows that the DMD spectrum qualitatively resembles the Fourier power spectrum, but with each point as a spatial correlated mode instead of raw recordings.

Screen Shot 2015-03-26 at 1.34.31 AM

The authors carried out several analyses based on the DMD Spectrum. First, they validated the DMD approach to derive sensorimotor maps based on a simple movement task. Next, they leveraged DMD in combination with machine learning techniques to detect and characterize spindle networks present during sleep. More specifically, they used a sliding window over the entire dynamic raw data. Within each window, they calculated the DMD spectrum and picked the modes with power exceeding 1/f^\alpha distribution curve and collected all such modes into a library. Finally, all the modes in the library were clustered by Gaussian mixture model into centroids as stereotypes of spindle network. Results look pretty tidy and beautiful but there are still some ambiguities about experimental setting and design.

In summary, this paper shows how DMD can be used to extract coherent patterns by decomposing vector time-series data into a low-dimensional representation in both space and time, and that it can be applied to large-scale neuron recordings (ECoG data). However, it’s slightly unclear how novel DMD is relative to existing methods used in engineering and neuroscience. Mathematically, DMD seems to correspond to a spectral method for estimating a low-dimensional latent linear dynamical systems model (i.e., a Kalman filter model) with “innovations” noise and observation noise set to zero.  DMD was originally introduced in fluid physics, where one seeks to characterize the behavior of (deterministic) nonlinear dynamical system described by a PDE. It would therefore be interesting to flesh out the connections between DMD and the Kalman filter more explicitly, and to examine how the two approaches might be combined or extended to understand the dynamical structure in large-scale neuron activity.

Code released for “Binary Pursuit” spike sorting

At long last, I’ve finished cleaning, commenting, and packaging up code for binary pursuit spike sorting, introduced in our 2013 paper in PLoS ONE.  You can download the Matlab code here (or on github), and there’s a simple test script to illustrate how to use it on a simulated dataset.

The method relies on a generative model (of the raw electrode data) that explicitly accounts for the superposition of spike waveforms. This allows it to detect synchronous and overlapping spikes in multi-electrode recordings, which clustering-based methods (by design) fail to do.  

If you’d like to know more (but don’t feel like reading the paper), I wrote a blog post describing the basic intuition (and the cross-correlation artifacts that inspired us to develop it in the first place) back when the paper came out (link).

Subunit models for characterizing responses of sensory neurons

On July 28th, I presented the following paper in lab meeting:

This paper proposes a new method for characterizing the multi-dimensional stimulus selectivity of sensory neurons. The main idea is that, instead of thinking of neurons as projecting high-dimensional stimuli into an arbitrary low-dimensional feature space (the view underlying characterization methods like STA, STC, iSTAC, GQM, MID, and all their rosy-cheeked cousins), it might be more useful / parsimonious to think of neurons as performing a projection onto convolutional subunits. That is, rather than characterizing stimulus selectivity in terms of a bank of arbitrary linear filters, it might be better to consider a subspace defined by translated copies of a single linear filter.

Fig 1 illustrates the rationale rather beautifully:
Panel (a) shows a bank of subunit filters — translated copies of a single filter (here scaled by the weighting of each subunit). The output of each subunit filter passes through a point nonlinearity, and the resulting values are summed to form the neural output. Panel (b) gives the corresponding geometric picture: the arrows labeled \vec k_1, \ldots, \vec k_4 indicate vectors defined by the subunit filter shifted to positions 1, …, 4, respectively. Note that these vectors lie on a hyper-sphere (since translations do not change the length of a vector). The span of these vectors defines a multi-dimensional space of stimuli that will excite the neuron (assuming their projection is positive, so perhaps we should say half-space). But if we were to attempt to characterize this space using spike-triggered covariance analysis (panel c), we would obtain a basis defined by a collection of orthonormal vectors. (The slightly subtle point being made here is that the number of vectors we can recover grows with the size of the dataset — so in the bottom plot with 10^6 datapoints, there are four “significant” eigenvalues corresponding to a four-dimensional subspace). The four eigenvectors \vec e_1, \ldots, \vec e_4 span the same subspace as \vec k_1, \ldots \vec k_4, but they have been orthogonalized, and their relationship to the original subunit filter shape is obscured.

If you think about it, this picture is devastating for STC analysis. The subunit model gives rise to a 4-dimensional feature space with a single filter shifted to 4 different locations (which requires n + 4 parameters), whereas STC analysis requires four distinct filters (4n parameters). But the real calamity comes in trying to estimate the nonlinear mapping from feature space to spike rate. For STC analysis, we need to estimate an arbitrary nonlinear function in 4 dimensions. That is, we need to go into the 4-dimensional space spanned by filters \vec e_1, \ldots, \vec e_4 and figure out how much the neuron spikes for each location in that 4D hypercube.  In the subunit model, by contrast, the nonlinear behavior is assumed to arise from a (rectifying) scalar nonlinearity applied to the output of each subunit filter. We need only a single 1D nonlinearity and a set of 4 weights.

So clearly, if a neuron’s response is governed by a linear combination of nonlinear subunits, it’s crazy to attempt to go about characterizing it with STC analysis.

This idea is not entirely new, of course. Nicole Rust’s 2005 paper suggested that the derivative-like filters revealed by spike-triggered covariance analysis of V1 responses might be a signature of nonlinear subunits (Rust et al, Neuron 2005). Kanaka Rajan & Bill Bialek made a related observation about STC analysis (Rajan & Bialek 2013), noting that many of the simple computations we might expect a neuron to perform (e.g., motion selectivity) lead to relatively high-dimensional features spaces (as opposed to the low-dimensional subspaces expected by STC). Tatyanya Sharpee’s lab has introduced a method for characterizing translation-invariant receptive fields inspired by maximally-informative-dimensions estimator (Eickenberg et al 2012). And of course, the idea of nonlinear subunits is much older, going back to at least 1960s work in retina (e.g., Barlow & Levick 1965).

The Vintch et al paper describes a particular method for characterizing the subunit model, which they formulate as a linear-nonlinear-linear (LNL) model from stimulus \vec x to response r given by:

r = \sum_i w_i f_{\theta}(\vec k_{(i)} \cdot \vec x) + noise

where \vec k_{(i)} denotes the subunit filter \vec k shifted to the i‘th stimulus position, f_\theta(\cdot) is the subunit nonlinearity, and w_i is the weight on the output of the i‘th subunit.  I won’t belabor the details of the fitting procedure, but the results are quite beautiful, showing that one can obtain substantially more accurate prediction of V1 responses (with far fewer parameters!) using a model with separate populations of excitatory and inhibitory subunits.

We finished lab meeting with a brief discussion of some new ideas I’ve been working on with Anqi and Memming, aimed at developing fast, scalable methods for fitting high-dimensional subunit models. So watch this space, and hopefully we’ll have something new to report in a few months!



  1. Efficient and direct estimation of a neural subunit model for sensory coding. Vintch, Zaharia, Movshon, & Simoncelli, NIPS 2012.
  2. Spatiotemporal elements of macaque V1 receptive fields.  Rust, Schwartz, Movshon & Simoncelli, Neuron 2005.
  3. Maximally Informative “Stimulus Energies” in the Analysis of Neural Responses to Natural Signals.  Rajan & Bialek, PLoS ONE 2013.
  4. Characterizing Responses of Translation-Invariant Neurons to Natural Stimuli: Maximally Informative Invariant Dimensions. Eickenberg,Rowekamp, Kouh, & Sharpee.  Neural Comp., 2012.
  5. The mechanism of directionally selective units in rabbit’s retina. Barlow & Levick, J Physiol 178(3):1965.

Quantifying the effect of intertrial dependence on perceptual decisions

Today we discussed a paper on sequential effects in psychophysics by Fründ et al. Although inter-trial dependencies are known to exist, psychophysical responses are typically modeled as independent Bernoulli observations. One way to account for the effect of previous trial outcomes is to use logistic regression with terms that represent previous stimuli or responses (Busse et al). Fründ and colleagues extend this approach by including a lapse rate which captures responses where the subject was not doing the task. They propose a latent variable on each trial, l_t that is in one of three possible states: 0,1,2 (“guesses left”, “guesses right”, “does task”). The probability of a response is dependent on the stimulus on that trial, the outcome of previous trials and the state of the latent variable:


where x_t is the concatenation of stimulus and history terms for that trial, \omega is weights on the stimulus and history terms and g(x) is \frac{1}{1+e^{-x}}

The authors use Expectation-Maximization to fit the weights, \omega, and the probabilities of the latent states, p. They then compare model fits that include history terms to fits without them for four different psychophysical tasks.

The main results are: a small gain in likelihood per trial for the full model with history terms, thresholds are a few percent lower if history is included, and the effect of history is strongest for low signal conditions and disappears with high signal strength. Probably the coolest figure is 4c where they demonstrate how history weights reveal different task strategies for the different psychophysical experiments.


The different colors are the four different tasks, each dot is a different subject.  Red and Blue (both visual 2AFC tasks) seem safely in ‘switch’ or ‘win stay – lose switch’. All subjects maintain a ‘stay’ strategy for Yellow (an Auditory Yes-No task).

Does it matter? The authors point out that all though the bias in estimated parameters introduced by inter-trial dependence is systematic, it is small. They suggest that it is particularly important to account for history effects in the analysis of neurophysiology and it’s correlation with choice.


  • Fründ I, Wichmann FA, Macke JH. (2014). Quantifying the effect of intertrial dependence on perceptual decisions. Journal of Vision. 14(7). pii: 9. doi: 10.1167/14.7.9.
  • Busse, L., Ayaz, A., Dhruv, N. T., Katzner, S., Saleem, A. B., Scholvinck, M. L., et al. (2011). The Detection of Visual Contrast in the Behaving Mouse. Journal of Neuroscience, 31(31), 11351–11361. doi:10.1523/JNEUROSCI.6689-10.2011

Partitioning Neural Variability

On July 7, we discussed Partitioning Neural Variability by Gorris et al.  In this paper, the authors seek to isolate the portion of the variability of sensory neurons that comes from non-sensory sources such as arousal or attention.  In order to partition the variability in a principled way, the authors propose a “modulated Poisson framework” for spiking neurons, in which a neuron produces spikes according to a Poisson process whose mean rate is the product of a stimulus-driven component f(S) , and a stimulus-independent ‘gain’ term (G).

Applying the law of total variance, the variance for the number of spikes N over a time period \Delta t can then be written as var[N | S, \Delta t] = f(S) \Delta t + \sigma^2_G (f(S)\Delta t)^2.  Moreover, with a gamma prior over G, the distribution for N | S, \Delta t has a negative binomial distribution, which can be fit to neural data.

The authors analyzed quite a large data set, from neurons in macaque LGN, V1, V2, and MT stimulated with drifting gratings.  Their decomposition of the variance of N (above) motivates the corresponding decomposition of within-condition sum-of-squares into components arising “from the point process” (i.e. corresponding to the f(S) \Delta t term above) and from the gain signal.  Using this decomposition they can estimate of the fraction of the within-condition variance that comes from the fluctuations of G.  They find that the modulated Poisson model fits the model better than the Poisson model, and that the gain share of variance is quite high (47.5%, compared to 47% due to stimulus drive and only 5.5% from the Poisson noise), and that it increases along the visual pathway.

The authors also write down an analogous decomposition for the covariance of two neurons into a “point-process” and gain components.  A strength of this form for the covariance is that it allows for spike-count correlations to vary dramatically with stimulus drive.  This decomposition of pairwise tuning correlation into “point-process” and gain components is then used for an analysis of correlations with respect to stimulus similarity, cortical distance, and temporal structure.

Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is:

\quad y = \phi(x)^T w + \epsilon, \quad  \epsilon \sim \mathcal{N}(0,\sigma^2)

and an independent zero-mean Gaussian prior on each element of w:

w_i \sim \mathcal{N}(0,1/\alpha_i),

where \alpha_i denotes the prior precision of each weight.

The basic idea is to set the hyperparameters \{\alpha_i\}  by maximizing marginal likelihood P(Y|X,\{\alpha_i\}). The resulting estimate for w under the prior P(w | \{\alpha_i\}) will be sparse because many of the \alpha_i will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters \alpha_i and \sigma^2 iteratively, but the draw back is the initial \alpha vector would be a huge, and every matrix operation would be time consuming with large scale data. When an \alpha_i reaches infinity, it’s abandoned and \alpha vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned \alpha would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets \alpha_i‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each \alpha_i from all other hyperparameters, labelled “\alpha_{-i}“. The authors analyze l(\alpha_i) to show that L(\alpha) has a unique maximum with respect to \alpha_i. Then they update \alpha_i given:
\alpha_i=\frac{s_i^2}{q_i^2-s_i},\mbox{ if }q_i^2>s_i
\alpha_i=\infty,\mbox{ if }q_i^2\le s_i
where s_i and q_i are statistics needed for computing \alpha_i. Let \theta_i=q_i^2-s_i, then
If \theta_i>0 and \alpha_i<\infty, re-estimate \alpha_i;
If \theta_i>0 and \alpha_i=\infty, add \phi_i to the model with updated \alpha_i;
If \theta_i\le 0 and \alpha_i<\infty, delete \phi_i from the model and set \alpha_i=\infty;

After updating \alpha_i, it recomputes \Sigma and \mu (covariance and mean of posterior of w) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for w in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the \alpha_i sequentially. It could also be borrowed to solve similar problems with independent \alpha vector.

Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

They showed that a factor model with 1-dimensional latent process captures the excess noise correlations observed during anesthesia. They extended GPFA [4] to include a linear stimulus drive component, and fit the trials corresponding to the same stimulus. This model assumes that the square-root transformed spike counts in 100 ms bins y(t) follows,

y(t) \sim \mathcal{N}(f(s(t)) + cx(t) + d, R)

x(t) \sim \mathcal{N}(0, K)

where s(t) is the stimulus, f(\cdot) is the tuning curve drive, d is the mean, R represents the individual (diagonal) covariance, c is the factor loadings vector, and x(t) is the 1-dimensional population wide latent process with assumed smoothness given by the kernel K(t_1, t_2) = \exp(-(t_1 - t_2)^2 / 2 / \tau^2 ). The square-root transformation is for the variance stabilization, making Poisson-like spike count observations more like Gaussian distributed.

They show that the most of the extra noise correlation observed during anesthesia is explained by the 1D latent process, and the remaining noise correlation is in par with awake condition, while noise correlations in the awake data didn’t get explained away by it (figure below).

Noise correlation as a function of geometric mean of the neurons

A remarkable result! (red) anesthesia, (blue) awake (solid) raw noise correlation (dashed) residual noise correlation unexplained by the 1D-latent variable

Note that this nonlinear relation between the mean rate and noise correlation is not a natural consequence of the generative model they assumed (equation above), since the noise is additive while firing rate is modulated by stimulus. That’s why they had to fit different factor loadings vector c for different stimulus conditions. However, they show that a multiplicative LNP model where a 1D latent process multiplicatively interacts with the stimulus drive in the log firing rate does reproduce this quality (Fig. 6). (It’s a shame that they didn’t actually fit this more elegant generative model directly to their data.)

Furthermore, they showed that local field potentials (LFP) in 0.5-2Hz range is correlated with the inferred latent variable. Hence, low-pass filtered LFP averaged over electrodes is a good proxy for L4 population correlation under anesthesia.

1D-latent process, LFP, and LNP results are all consistent with Matteo Carandini’s soloist vs chorister story except his preparations were awake rodent as far as I recall. This story also bears similarity with [5-8]. This is certainly an exciting type of model and there are many similar models being developed (e.g. [9, 10] and references therein). I hope to see more developments in these kinds of latent variable models; they have the advantage of scaling well in terms of number of parameters, but the disadvantage of being non-convex optimization. Also, I would like to see similar analysis for awake but spontaneous (non-task performing) state.

Their data and code are all available online (awesome!).

  1. Ecker, A. S., Berens, P., Cotton, R. J., Subramaniyan, M., Denfield, G. H., Cadwell, C. R., Smirnakis, S. M., Bethge, M., and Tolias, A. S. (2014). State dependence of noise correlations in macaque primary visual cortexNeuron, 82(1):235-248.
  2. Ecker, A. S., Berens, P., Keliris, G. A., Bethge, M., Logothetis, N. K., and Tolias, A. S. (2010). Decorrelated neuronal firing in cortical microcircuitsScience, 327(5965):584-587.
  3. Cohen, M. R. and Kohn, A. (2011). Measuring and interpreting neuronal correlationsNat Neurosci, 14(7):811-819.
  4. Yu, B. M., Cunningham, J. P., Santhanam, G., Ryu, S. I., Shenoy, K. V., and Sahani, M. (2009). Gaussian-Process factor analysis for Low-Dimensional Single-Trial analysis of neural population activityJournal of Neurophysiology, 102(1):614-635.
  5. Luczak, A., Bartho, P., and Harris, K. D. (2013). Gating of sensory input by spontaneous cortical activityThe Journal of Neuroscience, 33(4):1684-1695.
  6. Goris, R. L. T., Movshon, J. A., and Simoncelli, E. P. (2014). Partitioning neuronal variabilityNat Neurosci, 17(6):858-865. [last week’s blog post]
  7. Tkačik, G., Marre, O., Mora, T., Amodei, D., Berry, M. J., and Bialek, W. (2013). The simplest maximum entropy model for collective behavior in a neural networkJournal of Statistical Mechanics: Theory and Experiment, 2013(03):P03011+.
  8. Okun, M., Yger, P., Marguet, S. L., Gerard-Mercier, F., Benucci, A., Katzner, S., Busse, L., Carandini, M., and Harris, K. D. (2012). Population rate dynamics and multineuron firing patterns in sensory cortexThe Journal of neuroscience : the official journal of the Society for Neuroscience, 32(48):17108-17119.
  9. Pfau, D., Pnevmatikakis, E. A., and Paninski, L. (2013). Robust learning of low-dimensional dynamics from large neural ensembles. NIPS 26