# Code released for “Binary Pursuit” spike sorting

At long last, I’ve finished cleaning, commenting, and packaging up code for binary pursuit spike sorting, introduced in our 2013 paper in PLoS ONE.  You can download the Matlab code here (or on github), and there’s a simple test script to illustrate how to use it on a simulated dataset.

The method relies on a generative model (of the raw electrode data) that explicitly accounts for the superposition of spike waveforms. This allows it to detect synchronous and overlapping spikes in multi-electrode recordings, which clustering-based methods (by design) fail to do.

If you’d like to know more (but don’t feel like reading the paper), I wrote a blog post describing the basic intuition (and the cross-correlation artifacts that inspired us to develop it in the first place) back when the paper came out (link).

# Subunit models for characterizing responses of sensory neurons

On July 28th, I presented the following paper in lab meeting:

This paper proposes a new method for characterizing the multi-dimensional stimulus selectivity of sensory neurons. The main idea is that, instead of thinking of neurons as projecting high-dimensional stimuli into an arbitrary low-dimensional feature space (the view underlying characterization methods like STA, STC, iSTAC, GQM, MID, and all their rosy-cheeked cousins), it might be more useful / parsimonious to think of neurons as performing a projection onto convolutional subunits. That is, rather than characterizing stimulus selectivity in terms of a bank of arbitrary linear filters, it might be better to consider a subspace defined by translated copies of a single linear filter.

Fig 1 illustrates the rationale rather beautifully:

Panel (a) shows a bank of subunit filters — translated copies of a single filter (here scaled by the weighting of each subunit). The output of each subunit filter passes through a point nonlinearity, and the resulting values are summed to form the neural output. Panel (b) gives the corresponding geometric picture: the arrows labeled $\vec k_1, \ldots, \vec k_4$ indicate vectors defined by the subunit filter shifted to positions 1, …, 4, respectively. Note that these vectors lie on a hyper-sphere (since translations do not change the length of a vector). The span of these vectors defines a multi-dimensional space of stimuli that will excite the neuron (assuming their projection is positive, so perhaps we should say half-space). But if we were to attempt to characterize this space using spike-triggered covariance analysis (panel c), we would obtain a basis defined by a collection of orthonormal vectors. (The slightly subtle point being made here is that the number of vectors we can recover grows with the size of the dataset — so in the bottom plot with $10^6$ datapoints, there are four “significant” eigenvalues corresponding to a four-dimensional subspace). The four eigenvectors $\vec e_1, \ldots, \vec e_4$ span the same subspace as $\vec k_1, \ldots \vec k_4$, but they have been orthogonalized, and their relationship to the original subunit filter shape is obscured.

If you think about it, this picture is devastating for STC analysis. The subunit model gives rise to a 4-dimensional feature space with a single filter shifted to 4 different locations (which requires $n + 4$ parameters), whereas STC analysis requires four distinct filters ($4n$ parameters). But the real calamity comes in trying to estimate the nonlinear mapping from feature space to spike rate. For STC analysis, we need to estimate an arbitrary nonlinear function in 4 dimensions. That is, we need to go into the 4-dimensional space spanned by filters $\vec e_1, \ldots, \vec e_4$ and figure out how much the neuron spikes for each location in that 4D hypercube.  In the subunit model, by contrast, the nonlinear behavior is assumed to arise from a (rectifying) scalar nonlinearity applied to the output of each subunit filter. We need only a single 1D nonlinearity and a set of 4 weights.

So clearly, if a neuron’s response is governed by a linear combination of nonlinear subunits, it’s crazy to attempt to go about characterizing it with STC analysis.

This idea is not entirely new, of course. Nicole Rust’s 2005 paper suggested that the derivative-like filters revealed by spike-triggered covariance analysis of V1 responses might be a signature of nonlinear subunits (Rust et al, Neuron 2005). Kanaka Rajan & Bill Bialek made a related observation about STC analysis (Rajan & Bialek 2013), noting that many of the simple computations we might expect a neuron to perform (e.g., motion selectivity) lead to relatively high-dimensional features spaces (as opposed to the low-dimensional subspaces expected by STC). Tatyanya Sharpee’s lab has introduced a method for characterizing translation-invariant receptive fields inspired by maximally-informative-dimensions estimator (Eickenberg et al 2012). And of course, the idea of nonlinear subunits is much older, going back to at least 1960s work in retina (e.g., Barlow & Levick 1965).

The Vintch et al paper describes a particular method for characterizing the subunit model, which they formulate as a linear-nonlinear-linear (LNL) model from stimulus $\vec x$ to response $r$ given by:

$r = \sum_i w_i f_{\theta}(\vec k_{(i)} \cdot \vec x)$ + noise

where $\vec k_{(i)}$ denotes the subunit filter $\vec k$ shifted to the $i$‘th stimulus position, $f_\theta(\cdot)$ is the subunit nonlinearity, and $w_i$ is the weight on the output of the $i$‘th subunit.  I won’t belabor the details of the fitting procedure, but the results are quite beautiful, showing that one can obtain substantially more accurate prediction of V1 responses (with far fewer parameters!) using a model with separate populations of excitatory and inhibitory subunits.

We finished lab meeting with a brief discussion of some new ideas I’ve been working on with Anqi and Memming, aimed at developing fast, scalable methods for fitting high-dimensional subunit models. So watch this space, and hopefully we’ll have something new to report in a few months!

References

1. Efficient and direct estimation of a neural subunit model for sensory coding. Vintch, Zaharia, Movshon, & Simoncelli, NIPS 2012.
2. Spatiotemporal elements of macaque V1 receptive fields.  Rust, Schwartz, Movshon & Simoncelli, Neuron 2005.
3. Maximally Informative “Stimulus Energies” in the Analysis of Neural Responses to Natural Signals.  Rajan & Bialek, PLoS ONE 2013.
4. Characterizing Responses of Translation-Invariant Neurons to Natural Stimuli: Maximally Informative Invariant Dimensions. Eickenberg,Rowekamp, Kouh, & Sharpee.  Neural Comp., 2012.
5. The mechanism of directionally selective units in rabbit’s retina. Barlow & Levick, J Physiol 178(3):1965.

# Quantifying the effect of intertrial dependence on perceptual decisions

Today we discussed a paper on sequential effects in psychophysics by Fründ et al. Although inter-trial dependencies are known to exist, psychophysical responses are typically modeled as independent Bernoulli observations. One way to account for the effect of previous trial outcomes is to use logistic regression with terms that represent previous stimuli or responses (Busse et al). Fründ and colleagues extend this approach by including a lapse rate which captures responses where the subject was not doing the task. They propose a latent variable on each trial, $l_t$ that is in one of three possible states: 0,1,2 (“guesses left”, “guesses right”, “does task”). The probability of a response is dependent on the stimulus on that trial, the outcome of previous trials and the state of the latent variable:

where $x_t$ is the concatenation of stimulus and history terms for that trial, $\omega$ is weights on the stimulus and history terms and $g(x)$ is $\frac{1}{1+e^{-x}}$

The authors use Expectation-Maximization to fit the weights, $\omega$, and the probabilities of the latent states, $p$. They then compare model fits that include history terms to fits without them for four different psychophysical tasks.

The main results are: a small gain in likelihood per trial for the full model with history terms, thresholds are a few percent lower if history is included, and the effect of history is strongest for low signal conditions and disappears with high signal strength. Probably the coolest figure is 4c where they demonstrate how history weights reveal different task strategies for the different psychophysical experiments.

The different colors are the four different tasks, each dot is a different subject.  Red and Blue (both visual 2AFC tasks) seem safely in ‘switch’ or ‘win stay – lose switch’. All subjects maintain a ‘stay’ strategy for Yellow (an Auditory Yes-No task).

Does it matter? The authors point out that all though the bias in estimated parameters introduced by inter-trial dependence is systematic, it is small. They suggest that it is particularly important to account for history effects in the analysis of neurophysiology and it’s correlation with choice.

References:

• Fründ I, Wichmann FA, Macke JH. (2014). Quantifying the effect of intertrial dependence on perceptual decisions. Journal of Vision. 14(7). pii: 9. doi: 10.1167/14.7.9.
• Busse, L., Ayaz, A., Dhruv, N. T., Katzner, S., Saleem, A. B., Scholvinck, M. L., et al. (2011). The Detection of Visual Contrast in the Behaving Mouse. Journal of Neuroscience, 31(31), 11351–11361. doi:10.1523/JNEUROSCI.6689-10.2011

# Partitioning Neural Variability

On July 7, we discussed Partitioning Neural Variability by Gorris et al.  In this paper, the authors seek to isolate the portion of the variability of sensory neurons that comes from non-sensory sources such as arousal or attention.  In order to partition the variability in a principled way, the authors propose a “modulated Poisson framework” for spiking neurons, in which a neuron produces spikes according to a Poisson process whose mean rate is the product of a stimulus-driven component $f(S)$ , and a stimulus-independent ‘gain’ term (G).

Applying the law of total variance, the variance for the number of spikes $N$ over a time period $\Delta t$ can then be written as $var[N | S, \Delta t] = f(S) \Delta t + \sigma^2_G (f(S)\Delta t)^2$.  Moreover, with a gamma prior over $G$, the distribution for $N | S, \Delta t$ has a negative binomial distribution, which can be fit to neural data.

The authors analyzed quite a large data set, from neurons in macaque LGN, V1, V2, and MT stimulated with drifting gratings.  Their decomposition of the variance of $N$ (above) motivates the corresponding decomposition of within-condition sum-of-squares into components arising “from the point process” (i.e. corresponding to the $f(S) \Delta t$ term above) and from the gain signal.  Using this decomposition they can estimate of the fraction of the within-condition variance that comes from the fluctuations of $G$.  They find that the modulated Poisson model fits the model better than the Poisson model, and that the gain share of variance is quite high (47.5%, compared to 47% due to stimulus drive and only 5.5% from the Poisson noise), and that it increases along the visual pathway.

The authors also write down an analogous decomposition for the covariance of two neurons into a “point-process” and gain components.  A strength of this form for the covariance is that it allows for spike-count correlations to vary dramatically with stimulus drive.  This decomposition of pairwise tuning correlation into “point-process” and gain components is then used for an analysis of correlations with respect to stimulus similarity, cortical distance, and temporal structure.

# Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is:

$\quad y = \phi(x)^T w + \epsilon, \quad$  $\epsilon \sim \mathcal{N}(0,\sigma^2)$

and an independent zero-mean Gaussian prior on each element of $w$:

$w_i \sim \mathcal{N}(0,1/\alpha_i)$,

where $\alpha_i$ denotes the prior precision of each weight.

The basic idea is to set the hyperparameters $\{\alpha_i\}$  by maximizing marginal likelihood $P(Y|X,\{\alpha_i\})$. The resulting estimate for $w$ under the prior $P(w | \{\alpha_i\})$ will be sparse because many of the $\alpha_i$ will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters $\alpha_i$ and $\sigma^2$ iteratively, but the draw back is the initial $\alpha$ vector would be a huge, and every matrix operation would be time consuming with large scale data. When an $\alpha_i$ reaches infinity, it’s abandoned and $\alpha$ vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned $\alpha$ would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets $\alpha_i$‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each $\alpha_i$ from all other hyperparameters, labelled “$\alpha_{-i}$“. The authors analyze $l(\alpha_i)$ to show that $L(\alpha)$ has a unique maximum with respect to $\alpha_i$. Then they update $\alpha_i$ given:
$\alpha_i=\frac{s_i^2}{q_i^2-s_i},\mbox{ if }q_i^2>s_i$
$\alpha_i=\infty,\mbox{ if }q_i^2\le s_i$
where $s_i$ and $q_i$ are statistics needed for computing $\alpha_i$. Let $\theta_i=q_i^2-s_i$, then
If $\theta_i>0$ and $\alpha_i<\infty$, re-estimate $\alpha_i$;
If $\theta_i>0$ and $\alpha_i=\infty$, add $\phi_i$ to the model with updated $\alpha_i$;
If $\theta_i\le 0$ and $\alpha_i<\infty$, delete $\phi_i$ from the model and set $\alpha_i=\infty$;

After updating $\alpha_i$, it recomputes $\Sigma$ and $\mu$ (covariance and mean of posterior of $w$) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for $w$ in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the $\alpha_i$ sequentially. It could also be borrowed to solve similar problems with independent $\alpha$ vector.

# Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

They showed that a factor model with 1-dimensional latent process captures the excess noise correlations observed during anesthesia. They extended GPFA [4] to include a linear stimulus drive component, and fit the trials corresponding to the same stimulus. This model assumes that the square-root transformed spike counts in 100 ms bins $y(t)$ follows,

$y(t) \sim \mathcal{N}(f(s(t)) + cx(t) + d, R)$

$x(t) \sim \mathcal{N}(0, K)$

where $s(t)$ is the stimulus, $f(\cdot)$ is the tuning curve drive, $d$ is the mean, $R$ represents the individual (diagonal) covariance, $c$ is the factor loadings vector, and $x(t)$ is the 1-dimensional population wide latent process with assumed smoothness given by the kernel $K(t_1, t_2) = \exp(-(t_1 - t_2)^2 / 2 / \tau^2 )$. The square-root transformation is for the variance stabilization, making Poisson-like spike count observations more like Gaussian distributed.

They show that the most of the extra noise correlation observed during anesthesia is explained by the 1D latent process, and the remaining noise correlation is in par with awake condition, while noise correlations in the awake data didn’t get explained away by it (figure below).

A remarkable result! (red) anesthesia, (blue) awake (solid) raw noise correlation (dashed) residual noise correlation unexplained by the 1D-latent variable

Note that this nonlinear relation between the mean rate and noise correlation is not a natural consequence of the generative model they assumed (equation above), since the noise is additive while firing rate is modulated by stimulus. That’s why they had to fit different factor loadings vector $c$ for different stimulus conditions. However, they show that a multiplicative LNP model where a 1D latent process multiplicatively interacts with the stimulus drive in the log firing rate does reproduce this quality (Fig. 6). (It’s a shame that they didn’t actually fit this more elegant generative model directly to their data.)

Furthermore, they showed that local field potentials (LFP) in 0.5-2Hz range is correlated with the inferred latent variable. Hence, low-pass filtered LFP averaged over electrodes is a good proxy for L4 population correlation under anesthesia.

1D-latent process, LFP, and LNP results are all consistent with Matteo Carandini’s soloist vs chorister story except his preparations were awake rodent as far as I recall. This story also bears similarity with [5-8]. This is certainly an exciting type of model and there are many similar models being developed (e.g. [9, 10] and references therein). I hope to see more developments in these kinds of latent variable models; they have the advantage of scaling well in terms of number of parameters, but the disadvantage of being non-convex optimization. Also, I would like to see similar analysis for awake but spontaneous (non-task performing) state.

Their data and code are all available online (awesome!).

1. Ecker, A. S., Berens, P., Cotton, R. J., Subramaniyan, M., Denfield, G. H., Cadwell, C. R., Smirnakis, S. M., Bethge, M., and Tolias, A. S. (2014). State dependence of noise correlations in macaque primary visual cortexNeuron, 82(1):235-248.
2. Ecker, A. S., Berens, P., Keliris, G. A., Bethge, M., Logothetis, N. K., and Tolias, A. S. (2010). Decorrelated neuronal firing in cortical microcircuitsScience, 327(5965):584-587.
3. Cohen, M. R. and Kohn, A. (2011). Measuring and interpreting neuronal correlationsNat Neurosci, 14(7):811-819.
4. Yu, B. M., Cunningham, J. P., Santhanam, G., Ryu, S. I., Shenoy, K. V., and Sahani, M. (2009). Gaussian-Process factor analysis for Low-Dimensional Single-Trial analysis of neural population activityJournal of Neurophysiology, 102(1):614-635.
5. Luczak, A., Bartho, P., and Harris, K. D. (2013). Gating of sensory input by spontaneous cortical activityThe Journal of Neuroscience, 33(4):1684-1695.
6. Goris, R. L. T., Movshon, J. A., and Simoncelli, E. P. (2014). Partitioning neuronal variabilityNat Neurosci, 17(6):858-865. [last week’s blog post]
7. Tkačik, G., Marre, O., Mora, T., Amodei, D., Berry, M. J., and Bialek, W. (2013). The simplest maximum entropy model for collective behavior in a neural networkJournal of Statistical Mechanics: Theory and Experiment, 2013(03):P03011+.
8. Okun, M., Yger, P., Marguet, S. L., Gerard-Mercier, F., Benucci, A., Katzner, S., Busse, L., Carandini, M., and Harris, K. D. (2012). Population rate dynamics and multineuron firing patterns in sensory cortexThe Journal of neuroscience : the official journal of the Society for Neuroscience, 32(48):17108-17119.
9. Pfau, D., Pnevmatikakis, E. A., and Paninski, L. (2013). Robust learning of low-dimensional dynamics from large neural ensembles. NIPS 26

# Computational Vision Course 2014 @ CSHL

Yesterday marked the start of the 2014 summer course in COMPUTATIONAL NEUROSCIENCE: VISION at Cold Spring Harbor. The course was founded in 1985 by Tony Movshon and Ellen Hildreth, with the goal of inspiring new generations of students to address problems at the intersection of vision, computation, and the brain. The list of past attendees is impressive.

For the next two weeks, students will have two 3-hour lectures per day from a smattering of leading experimental and theoretical neuroscientists (full list here), punctuated with breaks for conversation and recreation (swimming, soccer, frisbee, tennis, ping pong, squirt gun battles, to name a few) on the lush grounds of the Banbury Center at CSHL. Evenings will be devoted to informal discussions, tutorials, and to course projects inspired by lectures and/or interactions with students, TAs, and lecturers.

The course is a ridiculous amount of fun, despite the number of equations. I first attended as a second-year grad student in the summer of 2000 (part of a cohort that included Alex Huk, Anne Churchland, Hannah Bayer, Stephen David, Odelia Schwartz, and Richard Murray), and I returned in 2002 as TA alongside Anne Churchland. It’s an honor to be back this year as a co-organizer (with Geoff Boynton, Greg Horwitz, and Stefan Treue), and I’m joined by Jake Yates and two other students from UT.

Tony Movshon kicked things off yesterday with characteristic Tony levels of charm and wit. Here’s a short sampling of quotes:

• “Matlab is an abomination. But I want you to learn as much as possible so that when you come to my lab, you will know how to get whatever it is you need to get done, done.”
• “Only a theorist could use the terms ‘actual’ and ‘simulated’ simultaneously when referring to data.”
• “The brain works the way it does because it’s made of meat, and meat is not deterministic.”
• “Did I mention how dumb the motor system is?”