# Fast Kronecker trick for Gaussian Process regression with “expressive” kernels

On May 11th, I presented the following paper in lab meeting:

Fast Kernel Learning for Multidimensional Pattern Extrapolation
Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai and John P. Cunningham

This paper presents a method for scaling up structured spectral mixture (SM) kernels (from Wilson et al 2013) for Gaussian Process regression to multi-dimensional settings in which many (but not all) of the input points live on a grid (e.g., a 2D image in which some pixels are missing).  The spectral mixture is a very expressive kernel that enhances representation learning, but the problem comes in applying it to large scale data.

To address this problem, the authors developed the spectral mixture product (SMP) kernel, which is just a multiplication of 1D kernels for each dimension. This allows the full kernel matrix for points defined on a $P$-dimensional grid to be represented with a Kronecker product of kernel matrices along each dimension:  $K = k_1 \otimes \cdots \otimes k_P$. This allows the quadratic term and the log-determinant term in the GP log-likelihood to be evaluated very efficiently using singular value decompositions of these component matrices, namely:
where $Q=Q^1\otimes Q^2\otimes ...\otimes Q^P$ (eigenvectors), and $V=V^1\otimes V^2\otimes ...\otimes V^P$ (eigenvalues).

The key contribution of this paper is to show that if inputs are not on a grid — for example, some pixels from a training image are missing — one can insert “imaginary” points to complete the grid, and place infinite measurement noise on these observations so they have no effect on inference. Assuming there’s a dataset of M observations which are not necessarily on a grid, they form a complete grid using W imaginary observations, $y_W\sim\mathcal{N}(f_W,\epsilon^{-1}I_W)$, $\epsilon\rightarrow 0$. The total observation vector $y = [y_M, y_W ]^\top$ has $N = M + W$ entries: $y\sim\mathcal{N} (f, D_N)$, where the noise covariance matrix $D_N =\mbox{diag}(D_M, \epsilon^{-1}I_W), D_M = \sigma^2 I_M$. It’s simple to prove that the imaginary points have no effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution.

However, the relevant matrices are no longer have Kronecker structure due to the fact that the observations do not form a complete grid. The authors get around this using the conjugate gradient method, an iterative method for solving linear equations, to compute the quadratic term of the GP log-likelihood. They compute the log-determinant term using an approximation.

In particular, the authors use preconditioned conjugate gradients to compute $(K_N + D_N)^{ -1} y$ by solving$(K_N + D_N)x = y$ for $x$. For the log-determinant term, they use the approximation:

With these implementation tricks, they can scale up highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, to datasets exceeding $N = 10^5$ training instances, in minutes to 10s of minutes. They obtain many beautiful results, including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery.

The take home message for me is that smart implementation tricks can allow GPs with expressive covariances to be applied to large scale problems with non-gridded input.

# Integration dynamics and choice probabilities

Recently in lab meeting, I presented

Sensory integration dynamics in a hierarchical network explains choice probabilities in cortical area MT

Klaus Wimmer, Albert Compte, Alex Roxin, Diogo Peixoto, Alfonso Renart & Jaime de la Rocha. Nature Communications, 2015

Wimmer et al. reanalyze and reinterpret a classic dataset of neural recordings from MT while monkeys perform a motion discrimination task. The classic result shows that the firing rates of neurons in MT are correlated with the monkey’s choice, even when the stimulus is the same. This covariation of neural activity and choice, termed choice probability, could indicate sensory variability causing behavioral variability or it could result from top-down signals that reflect the monkey’s choice. To investigate the source of choice probabilities, the authors use a two-stage, hierarchical network model of integrate and fire neurons tuned to mimic the dynamics of MT and LIP neurons and compare the model to what they find in the data.

The figure above shows the two stages. They key manipulation the authors employ is to change the amount of feedback from the integration stage to the sensory stage. In their model, variability in the sensory stage can arise from 3 possible sources: 1, stimulus variability (red and blue squiggles that lead to E1 and E2 populations); 2, background noise (X); and 3, feedback signals (green bFB).

They find that changing the amount of feedback in this type of network makes several predictions that can be tested using existing data. They argue that feedback signals should account for choice probability (CP) later in the trial, which means the early part should be due to feedforward fluctuations in the stimulus. This allows them to take advantage of a stimulus manipulation in the data: some of the trials are perfect repeats of the same stimulus where as others have the same motion strength but are generated using different random seeds. They call these two conditions replicate and non-replicate, respectively. The model predicts that CP will be smaller for replicate trials during the early part of the motion stimulus because there are no stimulus fluctuations. The model also predicts that correlations between cell pairs will be smaller across the board for replicate stimuli (see the comparison between the black and gray traces in both subplots below).

The authors find that both of their predictions are true in the data they have access to. They analyze CP across 41 neurons with replicate stimuli an 118 neurons with non-replicate stimuli and look at correlations across 32 pairs. They find that neurons were less  variable with replicate stimuli (subpanel a below), that correlations in an example cell pair were smaller for replicate stimuli, and choice probability was lower earlier in the trial for replicate stimuli.

They also interrogated cell pairs to see if noise correlations were indicative of a feedback signal. They sub-selected pairs based on whether they had increasing or decreasing correlations through out the trial. The prediction is that increasing correlations means more feedback, so these cells should have higher CP later in the trial, which is exactly what they found.

The authors end by suggesting that the feedback could serve to strengthen the commitment to a category despite bursts of sensory evidence. Unfortunately, this makes the simulated observer less sensitive. Overall, the paper is a valiant attempt at understanding the source of choice probabilities using decades old data. The insights remain clear: the time-course of CP and noise correlations should be informative about top-down signals during decision-making. Hopefully new data will be available soon to help constrain the space of models.

# Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares

Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014

In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

I’ve done some work on time-frequency representations in the past and often times an investigator expects to find signals that are narrow-band, but possibly amplitude-modulated, and are therefore limited in resolution by the time-frequency uncertainty principle.  However, when you formulate the problem of time-frequency representation as a Bayesian estimation problem, you can get a lot more bang for you buck.

The basic problem is to estimate the Fourier coefficients $x_n$ at time step $n$ for observations $y_n$ described by

$\bold{y}_n = \bold{F}_n \bold{x}_n + \bold{v}_n$

where $\bold{F}_n$ is the real-valued DFT matrix and $\bold{v}\sim \mathcal{N}(0,\sigma^2)$.  The usual way of doing business is to let $n$ be subsamples of the observation times and estimate $x_n$ over some window centered on $n$. Smoothness in time is achieved by having a large overlap between neighboring windows.  Ba et. al take a different approach, which obviates the need for large overlap between windows, reducing the size of $\bold{y}_n$ substantially for each time step.  The authors instead model the relationship between neighboring (in time) Fourier coefficients by

$\bold{x}_n = \bold{x}_{n-1}+\bold{w}_n$

where $\bold{w}_n$ is drawn from some distribution $p_W(w)$.

Thus, the trick is to bring prior information to bare in the least-squares problem, by specifying the structure of the solution that you expect through $p_W(w)$.

The authors offered two possibilities:

$\log p_1(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}w^2_{n,k} + \epsilon^2\right)^{1/2} +c_1$,

and

$\log p_2(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}\left(w^2_{n,k} + \epsilon^2\right)^{1/2}\right)^{1/2} +c_2$.

Now, I admit that these priors look kind of funny, but there is a good reason why these were the priors specified in this paper.  These priors are “epsilon-perturbed” $\ell_\nu$ norms, given by

$L(\bold{x}) = -\frac{1}{2}\sum^M_{i=1}(\epsilon^2+x_i^2)^{\nu/2}$.

This”epsilon-perturbed” version of the $\ell_\nu$ norms have rounded out corners in their level curves, like in the following picture from the authors’ earlier paper:

It turns out that distributions with this form come from the normal/independent family, commonly used for robust regression and which lend themselves to EM by IRLS!

Thus $\log p_1$ is approximately an $\ell_1$ prior over frequencies (indexed by $k$), and a $\ell_2$ prior over changes in $x_n$ across time (indexed by $n$).  The latter can also be interpreted as a discrete version of the total variation norm.  The result is that $\log p_1$ biases the solutions toward things that are sparse in frequency but smooth in time.  Alternatively, $\log p_2$ is like a $\ell_1$ norm over frequencies, and a $\ell_1$ norm on transitions, also called a “fused LASSO”.  This prior will bias solutions toward things that are sparse infrequency and change abruptly in time.

The full MAP estimate of $\bold{x}_n$ can be found by IRLS, where the algorithm ends up being repeated application of the Kalman filtering equations forward in time followed by the Kalman smoother equations backward in time until convergence. The results for a simulated example of a 10Hz and 11Hz component, where one has amplitude that is ramping up exponentially fast, and the other is pulsing, is shown in the Figure at the top of the page.  An example using real EEG data looks like this:

The authors didn’t really provide any validation other than the eye-ball test, and it wasn’t clear how to evaluate performance of the method for real data sets.  However, the results do look pretty good for the test cases they illustrated. I also would have liked to get more details about choosing window width and frequency resolution for data analysis.

I think that the application of the IRLS algorithm for solving this problem is really clever and there are a lot of possibilities for generalizing to different priors that, for example, accommodate frequency modulation. I hope to see some follow-up papers were they use the method to do some fun data analysis and maybe flesh out these details a bit more.

# Attractors, chaos, and network dynamics for short term memory

In a recent lab meeting, I presented the following paper from Larry Abbott’s group:

From fixed points to chaos: Three models of delayed discrimination.
Barak, Sussillo, Romo, Tsodyks, & Abbott,
Progress. in Neurobiology 103:214-22 2013.

The paper seeks to connect the statistical properties of neurons in pre-frontal cortex (PFC) during short-term memory with those exhibited by several dynamical models for neural population responses. In a sense, it can be considered a follow-up to Christian Machens’ beautiful 2005 Science paper [2], which showed how a simple attractor model could support behavior in a two-interval discrimination task. The problem with the Machens/Brody/Romo account (which relied on mutual inhibition between two competing populations) is that it predicts extremely stereotyped response profiles, with all neurons in each population exhibiting the same profile.

But real PFC neurons don’t look like that (Fig 1):

So what’s the deal?

Neural heterogeneity in cortex is a hot topic these days, so this paper fits in with a surge of recent efforts to make sense of it (e.g., [3-6], and our recent paper Park et al 2014 [7]).  But instead of the “neural coding” question (i.e., what information do these neurons carry, and how?) this paper approaches heterogeneity from a mechanistic / dynamics standpoint: what model would suffice to give rise to these kinds of responses? In particular, it compares three dynamical models for PFC activity:

1. line attractor – Machens, Romo & Brody style, with two populations.
(We already know this won’t work!)
2. randomly connected network – chaotic network with sparse, random connectivity and a trained set of linear “readout” weights. This represents the “reservoir computing” idea so beloved by Europeans (cf. echo state network [8] / liquid machine [9]). The idea is to project the input to a nonlinear, high-dimensional feature space, where everything is linearly separable, and then do linear regression. There’s no need to tune anything except the output weights.
3. “trained” randomly connected network – similar to the FORCE-trained network of Sussillo & Abbott 2009 [10] fame: you initialize a randomly connected network then train it to be not-so-chaotic, so it does what you want it to. Less structured than #1 but more structured than #2.  (Good mnemonic here: “TRAIN”).

All three networks are adjusted so that their performance (at categorizing the frequency of a second tactile vibration stimulus as “lower” or “higher” than the frequency stored in memory) matches that of the monkeys (around 95%).

Results

The setup leads us to expect that the TRAIN network should do best here, and that’s more or less what happens.  The TRAIN model occupies a kind of middle ground between the stereotyped responses of the attractor model and the wild-and-crazy responses of the
chaotic network:

Wild-and-crazinessis is quantified by similarity between response to the (1st) stimulus and response during the delay period; for the attractor model these are perfectly correlated, whereas for the chaotic network they’re virtually uncorrelated after a small time delay. The TRAIN network is somewhere in between (compare to Fig 1):

Unfortunately, however, it’s not a Total Victory for any model, as TRAIN falls down on some other key metrics. (For example: the percent of the total response variance devoted to encoding the frequency of the first stimulus — identified by demixed PCA — is 28% for TRAIN, vs. 5% for the chaotic network, and only 2% for the neural data!).

So we haven’t quite cracked it, although the TRAIN approach seems promising.

Conclusion:

The paper makes for a fun, engaging read, and it nicely integrates a lot of ideas that don’t normally get to hang out together (attractor dynamics, reservoir computing, hessian-free optimization, state space pictures of neural population activity). If I have one criticism, it’s that the role for the TRAIN network doesn’t seem clearly defined enough in comparison to the other two models.  On the one hand, it’s still just a sparse, randomly connected network: if a randomly connected network can already produce the behavior, then what’s the justification for training it? On the other hand, if we’re going to go to the trouble of training the network, why not train it to reproduce the responses of neurons actually recorded during the experiment (i.e., instead of just training it to produce the behavior?) Surely if the model has a rich enough dynamical repertoire to produce the behavior and match the response dynamics of real neurons, this would be a useful thing to know (i.e., a nice existence proof). But the fact that this particular training procedure failed to produce a network with matching response types seems harder to interpret.

More broadly, it seems that we probably need a richer dataset to say anything definitive about the neural dynamics underlying short-term memory (e.g., with variable time delay between first and second stimulus, and with a greater range of task difficulties.)

Intriguing failure and compelling future direction:

The authors point out that no model accounted for an observed increase in the number of tuned responses toward the end of the delay period. They suggest that we might need a model with synaptic plasticity to account for this effect:

“These surprising finding could indicate the extraction of information about the stimulus from a synaptic form (selective patterns of short-term synaptic facilitation) into a spiking form. It remains a challenge to future work to develop echo-state random networks and TRAIN networks with short-term synaptic plasticity of recurrent connections.”

References

1. Omri Barak, David Sussillo, Ranulfo Romo, Misha Tsodyks, and L.F. Abbott. “From fixed points to chaos: Three models of delayed discrimination”. Progress in Neurobiology, 103(0):214–222, 2013. Conversion of Sensory Signals into Perceptions, Memories and Decisions.
2. Christian K Machens, Ranulfo Romo, and Carlos D Brody. “Flexible control of mutual inhibition: a neural model of two-interval discrimination.” Science, 307(5712):1121–1124, 2005.
3. Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. “Context-dependent computation by recurrent dynamics in prefrontal cortex”. Nature, 503(7474):78–84, 2013.
4. Mattia Rigotti, Omri Barak, Melissa R Warden, Xiao-Jing Wang, Nathaniel D Daw, Earl K Miller, and Stefano Fusi. “The importance of mixed selectivity in complex cognitive tasks.” Nature, 497(7451):585–590, 2013.
5. Kaufman, M. T.; Churchland, M. M.; Ryu, S. I. & Shenoy, K. V. “Cortical activity in the null space: permitting preparation without movement”. Nat Neurosci 2014.
6. David Raposo, Matthew T Kaufman, and Anne K Churchland. A category-free neural population sup- ports evolving demands during decision-making. Nature neuroscience, 2014.
7. [6]  Il Memming Park, Miriam LR Meister, Alexander C Huk, and Jonathan W Pillow. Encoding and decoding in parietal cortex during sensorimotor decision-making. Nature neuroscience, 17:1395–1403, 2014.
8. Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks. German National Research Center for Information Technology GMD Technical Report, 148:34, 2001.
9. Wolfgang Maass, Thomas Natschläger, and Henry Markram. “Real-time computing without stable states: A new framework for neural computation based on perturbations.” Neural Computation, 14:2531–2560, 2002.
10. David Sussillo and L. F. Abbott. “Generating coherent patterns of activity from chaotic neural networks”. Neuron, 63(4):544–557, 2009.

# Dynamic mode decomposition for large-scale neuron activity

On March 23th, I presented the following paper in lab meeting:

The purpose of the paper is to describe dynamic mode decomposition (DMD), a method from applied mathematics for dimensionality reduction of dynamical systems, and adapt it to the analysis of large-scale neural recordings.

Here’s the basic setup.  Consider measurements of an $n$-dimensional dynamical system at $m$ time points:

$X=[x_1,x_2,...,x_{m-1}]$
$X'=[x_2,x_3,...,x_{m}]$,

where $x_t$ is the $n$-dimensional state vector at time $t$ (e.g., the spike count of $n$ neurons in a single time bin). DMD attempts to describe the map from $x_t$ to $x_{t+1}$ with a low-rank linear dynamics matrix $A$, so that $X=AX'$.  We may think of DMD as high-dimensional regression of $X'$ against $X$, followed by eigendecomposition of the regression weights $A$. For long time series, one may use a sliding window to model nonlinear dynamics by approximating them with a linear dynamics matrix $A$ within each time window.

Through algebra tricks, DMD will come to a simple dynamic model $\hat{X(t)}=\Phi\Lambda^tz_0$, where $\Phi$ and $\Lambda$ are eigenvectors and eigenvalues of $A$ and $z_0$ satisfies the initial condition. The magnitude of mode $\phi_i$ represents spatial correlations between the $n$ observable locations. The eigenvalue $\lambda_i$ corresponds to the temporal dynamics of the spatial mode $\phi_i$. Specifically, its rate of growth/decay and frequency of oscillation are reflected in the magnitude and phase components of $\phi_i$, respectively. Therefore, DMD can be considered as as a hybrid of static mode extraction by principal components analysis (PCA) in the spatial domain and spectral transformation in the frequency domain (Fourier transform).

One crucial point the paper emphasizes is the DMD spectrum. Since the phases of eigenvalues reflect the oscillation frequency of the modes, one can plot the DMD spectrum $\left(P=||\phi||^2\right)$  as a function of frequency $\left(f_i=\frac{\mbox{imag}(log\lambda_i)}{2\pi\Delta t}\right)$, and compare to the Fourier spectrum of the raw data. The paper shows that the DMD spectrum qualitatively resembles the Fourier power spectrum, but with each point as a spatial correlated mode instead of raw recordings.

The authors carried out several analyses based on the DMD Spectrum. First, they validated the DMD approach to derive sensorimotor maps based on a simple movement task. Next, they leveraged DMD in combination with machine learning techniques to detect and characterize spindle networks present during sleep. More specifically, they used a sliding window over the entire dynamic raw data. Within each window, they calculated the DMD spectrum and picked the modes with power exceeding $1/f^\alpha$ distribution curve and collected all such modes into a library. Finally, all the modes in the library were clustered by Gaussian mixture model into centroids as stereotypes of spindle network. Results look pretty tidy and beautiful but there are still some ambiguities about experimental setting and design.

In summary, this paper shows how DMD can be used to extract coherent patterns by decomposing vector time-series data into a low-dimensional representation in both space and time, and that it can be applied to large-scale neuron recordings (ECoG data). However, it’s slightly unclear how novel DMD is relative to existing methods used in engineering and neuroscience. Mathematically, DMD seems to correspond to a spectral method for estimating a low-dimensional latent linear dynamical systems model (i.e., a Kalman filter model) with “innovations” noise and observation noise set to zero.  DMD was originally introduced in fluid physics, where one seeks to characterize the behavior of (deterministic) nonlinear dynamical system described by a PDE. It would therefore be interesting to flesh out the connections between DMD and the Kalman filter more explicitly, and to examine how the two approaches might be combined or extended to understand the dynamical structure in large-scale neuron activity.

# Code released for “Binary Pursuit” spike sorting

At long last, I’ve finished cleaning, commenting, and packaging up code for binary pursuit spike sorting, introduced in our 2013 paper in PLoS ONE.  You can download the Matlab code here (or on github), and there’s a simple test script to illustrate how to use it on a simulated dataset.

The method relies on a generative model (of the raw electrode data) that explicitly accounts for the superposition of spike waveforms. This allows it to detect synchronous and overlapping spikes in multi-electrode recordings, which clustering-based methods (by design) fail to do.

If you’d like to know more (but don’t feel like reading the paper), I wrote a blog post describing the basic intuition (and the cross-correlation artifacts that inspired us to develop it in the first place) back when the paper came out (link).

# Subunit models for characterizing responses of sensory neurons

On July 28th, I presented the following paper in lab meeting:

This paper proposes a new method for characterizing the multi-dimensional stimulus selectivity of sensory neurons. The main idea is that, instead of thinking of neurons as projecting high-dimensional stimuli into an arbitrary low-dimensional feature space (the view underlying characterization methods like STA, STC, iSTAC, GQM, MID, and all their rosy-cheeked cousins), it might be more useful / parsimonious to think of neurons as performing a projection onto convolutional subunits. That is, rather than characterizing stimulus selectivity in terms of a bank of arbitrary linear filters, it might be better to consider a subspace defined by translated copies of a single linear filter.

Fig 1 illustrates the rationale rather beautifully:

Panel (a) shows a bank of subunit filters — translated copies of a single filter (here scaled by the weighting of each subunit). The output of each subunit filter passes through a point nonlinearity, and the resulting values are summed to form the neural output. Panel (b) gives the corresponding geometric picture: the arrows labeled $\vec k_1, \ldots, \vec k_4$ indicate vectors defined by the subunit filter shifted to positions 1, …, 4, respectively. Note that these vectors lie on a hyper-sphere (since translations do not change the length of a vector). The span of these vectors defines a multi-dimensional space of stimuli that will excite the neuron (assuming their projection is positive, so perhaps we should say half-space). But if we were to attempt to characterize this space using spike-triggered covariance analysis (panel c), we would obtain a basis defined by a collection of orthonormal vectors. (The slightly subtle point being made here is that the number of vectors we can recover grows with the size of the dataset — so in the bottom plot with $10^6$ datapoints, there are four “significant” eigenvalues corresponding to a four-dimensional subspace). The four eigenvectors $\vec e_1, \ldots, \vec e_4$ span the same subspace as $\vec k_1, \ldots \vec k_4$, but they have been orthogonalized, and their relationship to the original subunit filter shape is obscured.

If you think about it, this picture is devastating for STC analysis. The subunit model gives rise to a 4-dimensional feature space with a single filter shifted to 4 different locations (which requires $n + 4$ parameters), whereas STC analysis requires four distinct filters ($4n$ parameters). But the real calamity comes in trying to estimate the nonlinear mapping from feature space to spike rate. For STC analysis, we need to estimate an arbitrary nonlinear function in 4 dimensions. That is, we need to go into the 4-dimensional space spanned by filters $\vec e_1, \ldots, \vec e_4$ and figure out how much the neuron spikes for each location in that 4D hypercube.  In the subunit model, by contrast, the nonlinear behavior is assumed to arise from a (rectifying) scalar nonlinearity applied to the output of each subunit filter. We need only a single 1D nonlinearity and a set of 4 weights.

So clearly, if a neuron’s response is governed by a linear combination of nonlinear subunits, it’s crazy to attempt to go about characterizing it with STC analysis.

This idea is not entirely new, of course. Nicole Rust’s 2005 paper suggested that the derivative-like filters revealed by spike-triggered covariance analysis of V1 responses might be a signature of nonlinear subunits (Rust et al, Neuron 2005). Kanaka Rajan & Bill Bialek made a related observation about STC analysis (Rajan & Bialek 2013), noting that many of the simple computations we might expect a neuron to perform (e.g., motion selectivity) lead to relatively high-dimensional features spaces (as opposed to the low-dimensional subspaces expected by STC). Tatyanya Sharpee’s lab has introduced a method for characterizing translation-invariant receptive fields inspired by maximally-informative-dimensions estimator (Eickenberg et al 2012). And of course, the idea of nonlinear subunits is much older, going back to at least 1960s work in retina (e.g., Barlow & Levick 1965).

The Vintch et al paper describes a particular method for characterizing the subunit model, which they formulate as a linear-nonlinear-linear (LNL) model from stimulus $\vec x$ to response $r$ given by:

$r = \sum_i w_i f_{\theta}(\vec k_{(i)} \cdot \vec x)$ + noise

where $\vec k_{(i)}$ denotes the subunit filter $\vec k$ shifted to the $i$‘th stimulus position, $f_\theta(\cdot)$ is the subunit nonlinearity, and $w_i$ is the weight on the output of the $i$‘th subunit.  I won’t belabor the details of the fitting procedure, but the results are quite beautiful, showing that one can obtain substantially more accurate prediction of V1 responses (with far fewer parameters!) using a model with separate populations of excitatory and inhibitory subunits.

We finished lab meeting with a brief discussion of some new ideas I’ve been working on with Anqi and Memming, aimed at developing fast, scalable methods for fitting high-dimensional subunit models. So watch this space, and hopefully we’ll have something new to report in a few months!

References

1. Efficient and direct estimation of a neural subunit model for sensory coding. Vintch, Zaharia, Movshon, & Simoncelli, NIPS 2012.
2. Spatiotemporal elements of macaque V1 receptive fields.  Rust, Schwartz, Movshon & Simoncelli, Neuron 2005.
3. Maximally Informative “Stimulus Energies” in the Analysis of Neural Responses to Natural Signals.  Rajan & Bialek, PLoS ONE 2013.
4. Characterizing Responses of Translation-Invariant Neurons to Natural Stimuli: Maximally Informative Invariant Dimensions. Eickenberg,Rowekamp, Kouh, & Sharpee.  Neural Comp., 2012.
5. The mechanism of directionally selective units in rabbit’s retina. Barlow & Levick, J Physiol 178(3):1965.