How can single neurons predict behavior?

Pitkow et al., Neuron, 87, 411-423, 2015

A couple of weeks ago I presented Xaq Pitkow et al.’s paper examining the convoluted relationship between choice probabilities (CP), information-limiting correlations (ILC), and suboptimal coding.

The paper is prefaced by two observations:

  1. Optimal linear decoding theory suggests that the CPs should decrease as the number of neurons in a pool increase.  This can be true even if noise correlations are non-zero. This would imply that, given the enormous number of neurons that are in the brain, that CPs should be extraordinarily tiny. Yet, single neurons often exhibit significant CPs.
  2. Single neurons have detection thresholds that are not much greater (or can actually exceed) psychophysical thresholds. Therefore, if the animal has access to individual neurons with this sensitivity to the stimulus, then why wouldn’t pooling across many of them make behavior reflective of even lower detection thresholds?

The claim is that optimal decoding strategies can conspire with ILC to make these observations plausible under conditions observed in the brain.

In order to understand these claims, let’s define a few quantities. The linear estimate of the stimulus with respect to some reference value of the stimulus s_0 is given by

\hat{s} = w^T(r-f(s_0)) + s_0

where w is the vector of decoding weights, r is the vector of neuronal responses to stimulus s, and f(s) is the vector of tuning curves evaluated at s.

If the noise covariance matrix is given by \Sigma , then the optimal linear decoder is given by

w = \frac{\Sigma^{-1}f'}{f'\Sigma^{-1}f'},

where f' is the derivative of f with respect to s. The variance (or rather, the Cramer-Rao bound) of this estimator is \sigma^2_s = w^T\Sigma w = (f'^T\Sigma^{-1}f')^{-1}. This expression is important because it tells us the minimum achievable estimator variance for any linear estimator, no matter what the correlation structure is.

The key ingredient here is information limiting correlations (to find out what these are I would read this paper by Moreno-Bote et al. 2014). If the noise correlations display ILC then the noise covariance can be written in the form \Sigma = \Sigma_0+\epsilon f'f'^T, allowing us to write the effective optimal decoding error \sigma^2 = w^T(\Sigma_0+\epsilon f'f'^T)w=\sigma^2_0+\epsilon, where \sigma^2_0 is the part that decays with the number of neurons, and \epsilon is the part that does not decay with number of neurons. Interestingly, the optimal choice correlation C_k, a quantity related to CPs introduced by the authors, is given by

C_k = \frac{(\sigma^2_0 + \epsilon)\sigma^2_k}{f'_k}

where \sigma^2_k is the kthdiagonal element of \Sigma and f'is the kth element of f'.  This expression indicates that ILCs actually increase choice correlations (and by extension, CPs) to an extent that is not correctable by optimal decoding (even though there are noise correlations that are correctable) and that there will be non-zero choice correlations even in the limit of infinite population size.  This is a profound observation because it not only makes an important connection between the form of of correlations between neurons and the ability to of the animal to decode the stimulus, but it simultaneously predicts that the activity of individual neurons will be predictive of the animals choice, regardless of the number of neurons involved in making the choice!

The stronger prediction of the paper is also a bit harder to follow, and is a bit less intuitive, so I’ll just outline the main points.  Suppose there are two populations (x and y) of neurons that both receive information about the stimulus.  If only x is used by the animal for decision making, but only y is recorded, then y will have choice correlations that are proportional to, but large than x. No, I didn’t get that wrong, the population that is not used for behavior has larger choice correlations. The authors also make claim that they can use choice correlation to distinguish between this scenario and one in which a population that is used for behavior is decoded sub-optimally.

It seems to me that almost any case in which the entire population is not recorded, but that has ILCs, would display this kind of behavior, so I’m not entirely sure how to interpret the claim that there is actually a non-decoded and a decoded population and that we can distinguish between them.  But maybe I missed something.

The paper was a very interesting unification of ILCs, choice probabilities, and optimal linear decoding. I’m sure we’ll be seeing follow-ups from this paper.


Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

The paper we discussed this week was

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, James L. McClelland and Surya Ganguli

This work aims to start analyzing a gnawing question in machine learning: How do deep neural networks actually work? In particular, for a given set of input-output pairs \{x^\mu,y^\mu \}, how do the network weights evolve, and how fast can the network converge? To this end, the authors take a close look at the simplest form of deep networks, deep linear networks, and describe the dynamics on learning the network weights over time. While a simple case, the intuition they build for the linear case gives the authors a way to discuss intelligent ways of initializing deep networks. The ensuing discussion about network initialization focuses on a kind of variance-invariance over network layers, a concept which the authors also discuss in the concept of nonlinear networks.

The paper is roughly divided into three sections. The first third of the paper discusses an exact analysis of the learning process for a linear neural network with one hidden layer. The second third of this paper extends the ideas and intuition for a single hidden layer to networks with multiple hidden layers, and further develop the idea of greedy network initialization based on the network dynamics. In the final section of the paper the authors look at the properties each of greedy and random network initialization and discuss the implications for learning rates in both linear and nonlinear networks.

For the first third of the paper, the network being analyzed consists of an input layer x\in\mathbb{R}^{N_1} , output layer y\in\mathbb{R}^{N_3} , and hidden layer h\in\mathbb{R}^{N_2} . The weights connecting layer one to layer two are W^{21} \in\mathbb{R}^{N_2,N_1} and the weights connecting layer two to layer three are W^{32} \in\mathbb{R}^{N_3,N_2}. In a linear network the back-propagation learning rule simply projects the error of each training sample down the network. Taking the limit of small step sizes on the update rule and the expectation of the step with respect to the training set, the authors are able to write the learning rule as the set of differential equations:

\tau\frac{dW^{21}}{dt}={W^{32}}^T\left(\Sigma^{31}-W^{32}W^{21}\Sigma^{11}\right),\qquad\tau\frac{dW^{32}}{dt}=\left(\Sigma^{31}-W^{32}W^{21}\Sigma^{11}\right){W^{21}}^T ,

where \tau is a constant, \Sigma^{11} = E[xx^T] is the input covariance matrix and \Sigma^{31} = E[yx^T] is the input-output cross-covariance matrix. To simplify the analysis, the authors assume appropriate whitening to at the input to set \Sigma^{11} = I and replace \Sigma^{31} with its SVD \Sigma^{31} = USV^T. This replacement allows the authors to analyze a simpler set of equations for an equivalent network that operates in the principal directions of \Sigma^{31}. This equivalent set of network weights can be written in terms of \overline{W}^{21} = W^{21}V = [a^1,a^2... a^\alpha, ... a^{N_2}] and \overline{W}^{32} = UW^{32}= [b^1,b^2... b^\alpha, ... b^{N_2}]^T. By writing an equivalent set of differential equations in terms of the a‘s and b‘s, we see an interesting phenomenon: the resulting equations essentially solve the cost function

\frac{1}{2\tau}\left[ \sum_{\alpha} (s_\alpha - {a^\alpha}^T b^{\alpha})^2 + \sum_{\alpha\neq\beta}{a^\alpha}^T b^{\beta} \right]

This cost function is essentially saying that the \alpha^{th} row-column couple \{a^\alpha,b^\alpha\} is trying to represent the \alpha^{th} mode of the cross covariance matrix $s_\alpha$, while simultaneously trying to be as orthogonal as possible from all other sets of row-column pairs. For a typical matrix, this is essentially trying to find the best rank N_2 approximation to the cross covariance matrix (i.e. a dynamical system whose solution should be the SVD of \Sigma^{31}). The authors continue on to find closed form solutions for very simple cases of N_2 < \min(N_1,N_2) and a^{\alpha} = b^{\alpha} \propto r^{\alpha} . While exact solutions for the values of proportionality are derived over time (i.e. the learning rate), the entire argument only holds for when r^{\alpha} are pre-defined. Since these vectors cannot change naturally once initialized, I feel that a lot of the analysis would be difficult to move to a general case. Additionally, the analysis depends heavily on initializing the network to have orthogonal pairs of \{a,b\}‘s: something that is impossible when the hidden layer is overcomplete: a common property of the oft-used convolutional neural networks. As such, the intuition built via this analysis allows the authors to continue onto deeper neural networks.

For deep neural networks, the mathematical details for reducing and analyzing the network learning dynamics becomes more complicated. To reduce the network analysis into an analysis of the rows and columns of a rotated set of weight matrices requires another set of rather stringent assumptions: that the right singular vectors of each weight matrix and match the left singular vectors of the weight matrix from the previous layer. Mathematically, if W^l = U^lS^l{V^l}^T, then V^l = U^{l-1}. The first and last set of weights are again modified via the left and right singular vectors of the input-output cross-covariance matrix. This additional assumption then allows them to again reduce the learning rules to an update on a set of proportionality variables (i.e. every layer has modes proportional to r^\alpha). The authors then derive similar learning curves as in the single hidden layer case.

The authors also use the intuition from these analyses to discuss the virtues of greedy and random initialization of the network weights. For greedy methods they discuss a greedy method from the literature where the network is pre-trained in a two-step process. First, an auto-encoder is used to train the network to predict its own input. Second, the network is fine tuned to adapt to the desired output. The authors note that this procedure produces an input weight matrix proportional to the right singular vectors of the \Sigma^{31}, meaning that the subspace chosen for the problem at hand is correct prior to learning. This implies that the subsequent learning will be very fast since the network does not need to adapt the principal directions, but rather only the strength of each layer.

As an alternative, the authors also discuss smarter random network initialization: using random orthogonal matrices over random Gaussian matrices. The intuition here comes from the fact that preservation of statistics across layers can imply faster learning. Alternatively, if a network layer does not preserve the total variance of the inputs from the previous layer, vital information could be lost either in terms of the lack of representation power at the output or by the lack of ability for gradients to effectively modify lower levels of the network. Gaussian matrices are particularly terrible in this regard since they are almost guaranteed to have many small singular values. This implies that many vectors, either coming up or down the network, will be severely attenuated, hindering learning. Random orthogonal matrices are guaranteed to preserve the norms of vectors, and consequently are invaluable in this regard. While interesting in its own right, this observation allows the authors to begin to talk about nonlinear networks. This is because the authors state that the greedy methods for initialization might not carry over well to the nonlinear case. In particular the greedy initialization observations depended on the relationship between the SVD of \Sigma^{31}, which in turned was intuition gleaned from the update equations for linear networks. Nonlinear networks have a much more complex set of dynamics.

Using the ideas of energy preservation and random orthogonal matrices, the authors discuss what it means for a nonlinear network to preserve information across network layers. To accomplish this, the authors present the concept of isometry called the dynamical isometry. Isometries in general discuss the preservation of distances across a function, and the dynamical isometry is particular to discussing the preservation of distances of the back-propagation functions in a deep network. Specifically, the authors describe DI as an isometry condition on the “product of Jacobians associated with error signal back-propagation”.

For linear networks, random orthogonal matrices automatically preserve the norms of vectors, so dynamical isometry follows directly. For nonlinear cases, the authors analyze networks with random orthogonal weights, tanh nonlinearities and a scaling factor g at each layer. The authors plot empirical distributions for the singular values of the Jacobian matrix as a function of the gain at each layer g and the strength of the input layer q = \frac{1}{N_1}\sum_i^{N_1} {x^1_i}^2. For g<1, the singular values for the Jacobian are very small (less than 1% of the input variance) and therefore the network does not act as an isometry and learning will be slow. Similarly, when g\geq 1.1, there are some large singular values, which look troubling to me. The authors mention that this condition is still better than random Gaussian weights. Right around g = 1 (what the authors term the “edge of chaos”) the singular values mostly cluster around one, indicating good learning properties.

Overall, I thought that this paper begins to ask the right questions about what properties of neural networks helps them learn. I’m a little skeptical that the analysis of the linear case will yield much for more general classes of networks actually in use, but I think the concepts that the authors are discussing are going to be very important. In particular, I’d be excited to see what comes of this dynamical isometry.

Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

Exponential families have a general form p(x|\eta)=h(x)\mbox{exp}(\eta^T T(x)-a(\eta)), where h(x) is a base measure, T(x) is a vector of sufficient statistics, \eta is the natural parameter and a(\eta) is the log-normalizer. An attractive property of exponential family distributions is that \mathbb{E}[T(x)]=\nabla_{\eta}a(\eta), which will be exploited in DEFs later.

Imitating deep neural net, these models stack multiple exponential family distributions in a hierarchical structure:

Screen Shot 2015-09-24 at 1.27.43 PM

Here, z_l is the latent variable vector at layer l. K_l is the number of variables in the layer l. W_l is the weight matrix connecting layer l+1 and l. The entire generative direction is from top to bottom. The top layer’s latent variables are generated from a prior distribution p(z_L)=\mbox{EXPFAM}_L(z_L,\eta). Then the conditional distribution between layer l+1 and l is p(z_l|z_{l+1},W_l)=\mbox{EXPFAM}_l(z_l,g_l(z_{l+1}^T W_l)). The bottom layer is for observation likelihood which is Poisson in this paper, p(x|z_1,W_0)=\mbox{Poisson}(z_1^T W_0). The link function g_l maps the inner product z_{l+1}^T W_l to the natural parameter to generate z_l. Because the expected sufficient statistics are equal to the gradient of the log normalizer \mathbb{E}[T(z_l)]=\nabla_{\eta_l}a(g_l(z_{l+1}^T W_l)), where \eta_l=g_l(z_{l+1}^T W_l), if we consider g_l(x)=x and T(z_l)=z_l, then \mathbb{E}[z_l] is just the linear function of W_l transformed by \nabla_{\eta_l}a(\cdot), which is one source of non-linearity. The inference method for DEFs is called black box variational inference (Ranganath, et al 2014), which we didn’t have time to discuss.

The paper also provided three concrete examples specified from DEFs which are sparse gamma DEFs, sigmoid belief network and Poisson DEFs. For different exponential families, there are different natural link functions g_l(\cdot) and prior distributions for p(W) summarized as follows,

Screen Shot 2015-09-24 at 1.52.03 PM

They finally evaluated various DEFs on text and combined multiple DEFs into a model for pairwise recommendation data. In an extensive study, they showed that going beyond one layer improves predictions for DEFs. They demonstrated that DEFs find interesting exploratory structure in large data sets, and give better predictive performance than state-of-the-art models.

This paper is interesting because it proposed a general deep hierarchical structure for all kinds of exponential families that avoids complicated inference implementations for various hierarchical generative models, as well as uncovering insight of “dual” relationship between neural net nonlinearity functions and the link functions in Bayesian framework.

Every Neuron is Special

A couple of weeks ago I presented

A category-free neural population supports evolving demands during decision-making

by David Raposo, Matthew Kaufman and Anne Churchland.  By “categories” they are referring to some population of cells whose responses during an experiment seem to be dominated by one or two of the experimental variables. The authors refer to these types of categories as functional categories.

The data come from experiments in which rats were trained to discriminate between high and low rate pulses of both visual and auditory stimuli (see the figure below).  In some trials the rats were presented with only visual stimuli, some only auditory, and some both.  The rats were presented a stimulus and then had to make a choice to make a nose poke to either the left or the right to indicate whether they thought the stimulus was high-rate or low-rate (with “high” and “low” determined by a threshold set by the experimenter during training).

The paper can be fairly clearly delineated into two parts. In part I, the authors use pharmacological techniques to demonstrate a causal role of posterior parietal cortex (PPC) in the task. The experiments can be summed up by this Figure:


The rats, predictably, were able to do a better job of discriminating high-rate from low-rate pulses when both sensory modalities were present. Furthermore, they showed rather convincingly that PPC was involved almost exclusively in the processing of the visual stimulus.

The second part of the paper, in my view, is the real meat of the paper. This part is comprised of a suite of statistical analyses that pick apart the population data to find evidence for functional catagories.

The authors did a great job illustrating this part of the paper with figures.  For example, here is a figure of PSTH’s that points out the way that many people still think about neurons involved in this kind of task:


Solid lines are PSTHs for trials where the stimulus rate was high, dashed lines were for low-rate trials.  Blue are PSTHs for visual-only trials, green are for auditory-only trials. The neurons here demonstrate selectivity for a) choice, b) modality, c) mixtures, d) something dynamic. In the past, the way that one might study these cells is to identify those cells in a) and b), using some selectivity metric, and then declare them a special class of cells that are selective for the choice of the animal and the stimulus, respectively.

However, check out the other half of the same figure:


In (f-h) the authors are showing us histograms of their measure of choice preference (f,g) or modality preference (h) over all cells.  The shaded regions are histograms of the cells for which this metric is significantly > 0. We can easily see from this figure that these “selective” cells (in the shaded regions) do not belong to special populations, but are simply in the tails of some distribution that is centered on 0. Their identity as “selective” is no more than an artifact of our classifying them based on a p-value.  Thus, most cells are more like the cells in panels c) and d) above, displaying some mixture of selectivity, or even some dynamic behavior which is perhaps not well summarized by the PSTH.

I know what you’re thinking.  “Well, maybe there is no special population of cells that are selective for one thing, so what about groups of cells that are selective for particular combination of things?”  In other words, are there cells that cluster with respect to their sensitivity to either the sensory modality, or the choice of the animal?  The answer is mostly given by the following figure:

Panel a) shows a plot of the choice and modality preference metric for each cell.  It is easy to see that there are no clusters.  Of course, the authors go much further in their analysis of clustering, using the top 8 singular vectors of the population PSTHs, and examining distributions of nearest-neighbor angles, and found no evidence of clustering, but I won’t get into that here.

Next, the authors fit separate SVMs to decode choice and modality.  The decoders could accurately decode choice and modality from the same trial, but the choice decoder was independent of the modality and vice versa. This indicates that the information about choice and modality are encoded in the same neuronal population, at the same time, but that these signals do not interfere with each other.


Finally, they did PCA on the population responses during two different phases of the experiment, during the decision phase, when the animal is collecting information about the stimulus, and during the movement phase, when the animal is executing its choice by reorienting its body to the chosen port.

They found that the movement space and the decision space were poorly aligned (see figure below). This observation is similar in spirit to an interesting paper (also by Matt Kauffman), describing why it may be useful for population trajectories to explore mis-aligned vector spaces while performing different functions.


In summary:

  1. PPC neurons are involved in vision-dependent decision making.
  2. These neurons display non-specific sensitivity to both the choice of the rats and the stimulus.
  3. Information about both choice and modality are independently encoded by the population and are available for linear read-out.
  4. The population of neurons explore different parts of state space at different times.

Together these findings reject the notion that each neuron in PPC is encoding something specific about the experiment, and that this mixed selectivity is a feature of the network.  The authors provide a well-articulated normative explanation for mixed selectivity:

Theoretical work suggests a major advantage for category-free populations: when parameters are distributed randomly across neurons, an arbitrary group of them can be linearly combined to estimate the parameter needed at a given moment[12–14]. This obviates the need for precisely prepatterned connections between neurons and their downstream targets and also means that all information is transmitted. This latter property could allow the same network to participate in multiple behaviors simply by using different readouts of the neurons.

which seems to be the situation born out by their data.

My only complaint about this paper is that, although I am totally on-board with this hypothesis, it seems to me that in some sense this paper is trying to prove a negative. The fact that they found no evidence for categories is not evidence for absence of categories. To be fair, they allude to other types of categories (cell type, connectivity) in the discussion. Here they are specifically referring to functional categories, which are defined a priori by the experimenter, which is the problem with functional categories to begin with.

Fast Kronecker trick for Gaussian Process regression with “expressive” kernels

On May 11th, I presented the following paper in lab meeting:

Fast Kernel Learning for Multidimensional Pattern Extrapolation
Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai and John P. Cunningham

This paper presents a method for scaling up structured spectral mixture (SM) kernels (from Wilson et al 2013) for Gaussian Process regression to multi-dimensional settings in which many (but not all) of the input points live on a grid (e.g., a 2D image in which some pixels are missing).  The spectral mixture is a very expressive kernel that enhances representation learning, but the problem comes in applying it to large scale data.

To address this problem, the authors developed the spectral mixture product (SMP) kernel, which is just a multiplication of 1D kernels for each dimension. This allows the full kernel matrix for points defined on a P-dimensional grid to be represented with a Kronecker product of kernel matrices along each dimension:  K = k_1 \otimes \cdots \otimes k_P. This allows the quadratic term and the log-determinant term in the GP log-likelihood to be evaluated very efficiently using singular value decompositions of these component matrices, namely:  Screen Shot 2015-05-12 at 6.30.53 PM
where Q=Q^1\otimes Q^2\otimes ...\otimes Q^P (eigenvectors), and V=V^1\otimes V^2\otimes ...\otimes V^P (eigenvalues).

The key contribution of this paper is to show that if inputs are not on a grid — for example, some pixels from a training image are missing — one can insert “imaginary” points to complete the grid, and place infinite measurement noise on these observations so they have no effect on inference. Assuming there’s a dataset of M observations which are not necessarily on a grid, they form a complete grid using W imaginary observations, y_W\sim\mathcal{N}(f_W,\epsilon^{-1}I_W), \epsilon\rightarrow 0. The total observation vector y = [y_M, y_W ]^\top has N = M + W entries: y\sim\mathcal{N} (f, D_N), where the noise covariance matrix D_N =\mbox{diag}(D_M, \epsilon^{-1}I_W), D_M = \sigma^2 I_M. It’s simple to prove that the imaginary points have no effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution.

However, the relevant matrices are no longer have Kronecker structure due to the fact that the observations do not form a complete grid. The authors get around this using the conjugate gradient method, an iterative method for solving linear equations, to compute the quadratic term of the GP log-likelihood. They compute the log-determinant term using an approximation.

In particular, the authors use preconditioned conjugate gradients to compute (K_N + D_N)^{ -1} y by solving(K_N + D_N)x = y for x. For the log-determinant term, they use the approximation: Screen Shot 2015-05-12 at 6.41.33 PM

With these implementation tricks, they can scale up highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, to datasets exceeding N = 10^5 training instances, in minutes to 10s of minutes. They obtain many beautiful results, including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery.

The take home message for me is that smart implementation tricks can allow GPs with expressive covariances to be applied to large scale problems with non-gridded input.

Integration dynamics and choice probabilities

Recently in lab meeting, I presented

Sensory integration dynamics in a hierarchical network explains choice probabilities in cortical area MT

Klaus Wimmer, Albert Compte, Alex Roxin, Diogo Peixoto, Alfonso Renart & Jaime de la Rocha. Nature Communications, 2015

Wimmer et al. reanalyze and reinterpret a classic dataset of neural recordings from MT while monkeys perform a motion discrimination task. The classic result shows that the firing rates of neurons in MT are correlated with the monkey’s choice, even when the stimulus is the same. This covariation of neural activity and choice, termed choice probability, could indicate sensory variability causing behavioral variability or it could result from top-down signals that reflect the monkey’s choice. To investigate the source of choice probabilities, the authors use a two-stage, hierarchical network model of integrate and fire neurons tuned to mimic the dynamics of MT and LIP neurons and compare the model to what they find in the data.


The figure above shows the two stages. They key manipulation the authors employ is to change the amount of feedback from the integration stage to the sensory stage. In their model, variability in the sensory stage can arise from 3 possible sources: 1, stimulus variability (red and blue squiggles that lead to E1 and E2 populations); 2, background noise (X); and 3, feedback signals (green bFB).

They find that changing the amount of feedback in this type of network makes several predictions that can be tested using existing data. They argue that feedback signals should account for choice probability (CP) later in the trial, which means the early part should be due to feedforward fluctuations in the stimulus. This allows them to take advantage of a stimulus manipulation in the data: some of the trials are perfect repeats of the same stimulus where as others have the same motion strength but are generated using different random seeds. They call these two conditions replicate and non-replicate, respectively. The model predicts that CP will be smaller for replicate trials during the early part of the motion stimulus because there are no stimulus fluctuations. The model also predicts that correlations between cell pairs will be smaller across the board for replicate stimuli (see the comparison between the black and gray traces in both subplots below).


The authors find that both of their predictions are true in the data they have access to. They analyze CP across 41 neurons with replicate stimuli an 118 neurons with non-replicate stimuli and look at correlations across 32 pairs. They find that neurons were less  variable with replicate stimuli (subpanel a below), that correlations in an example cell pair were smaller for replicate stimuli, and choice probability was lower earlier in the trial for replicate stimuli.


They also interrogated cell pairs to see if noise correlations were indicative of a feedback signal. They sub-selected pairs based on whether they had increasing or decreasing correlations through out the trial. The prediction is that increasing correlations means more feedback, so these cells should have higher CP later in the trial, which is exactly what they found.


The authors end by suggesting that the feedback could serve to strengthen the commitment to a category despite bursts of sensory evidence. Unfortunately, this makes the simulated observer less sensitive. Overall, the paper is a valiant attempt at understanding the source of choice probabilities using decades old data. The insights remain clear: the time-course of CP and noise correlations should be informative about top-down signals during decision-making. Hopefully new data will be available soon to help constrain the space of models.

Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares

Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014


In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

I’ve done some work on time-frequency representations in the past and often times an investigator expects to find signals that are narrow-band, but possibly amplitude-modulated, and are therefore limited in resolution by the time-frequency uncertainty principle.  However, when you formulate the problem of time-frequency representation as a Bayesian estimation problem, you can get a lot more bang for you buck.

The basic problem is to estimate the Fourier coefficients x_n at time step n for observations y_n described by

\bold{y}_n = \bold{F}_n \bold{x}_n + \bold{v}_n

where \bold{F}_n is the real-valued DFT matrix and \bold{v}\sim \mathcal{N}(0,\sigma^2).  The usual way of doing business is to let n be subsamples of the observation times and estimate x_n over some window centered on n. Smoothness in time is achieved by having a large overlap between neighboring windows.  Ba et. al take a different approach, which obviates the need for large overlap between windows, reducing the size of \bold{y}_n substantially for each time step.  The authors instead model the relationship between neighboring (in time) Fourier coefficients by

\bold{x}_n = \bold{x}_{n-1}+\bold{w}_n

where \bold{w}_n is drawn from some distribution p_W(w).

Thus, the trick is to bring prior information to bare in the least-squares problem, by specifying the structure of the solution that you expect through p_W(w).

The authors offered two possibilities:

 \log p_1(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}w^2_{n,k} + \epsilon^2\right)^{1/2} +c_1,


\log p_2(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}\left(w^2_{n,k} + \epsilon^2\right)^{1/2}\right)^{1/2} +c_2.

Now, I admit that these priors look kind of funny, but there is a good reason why these were the priors specified in this paper.  These priors are “epsilon-perturbed” \ell_\nu norms, given by

L(\bold{x}) = -\frac{1}{2}\sum^M_{i=1}(\epsilon^2+x_i^2)^{\nu/2}.

This”epsilon-perturbed” version of the \ell_\nu norms have rounded out corners in their level curves, like in the following picture from the authors’ earlier paper:


It turns out that distributions with this form come from the normal/independent family, commonly used for robust regression and which lend themselves to EM by IRLS!

Thus \log p_1 is approximately an \ell_1 prior over frequencies (indexed by k), and a \ell_2 prior over changes in x_n across time (indexed by n).  The latter can also be interpreted as a discrete version of the total variation norm.  The result is that \log p_1 biases the solutions toward things that are sparse in frequency but smooth in time.  Alternatively, \log p_2 is like a \ell_1 norm over frequencies, and a \ell_1 norm on transitions, also called a “fused LASSO”.  This prior will bias solutions toward things that are sparse infrequency and change abruptly in time.

The full MAP estimate of \bold{x}_n can be found by IRLS, where the algorithm ends up being repeated application of the Kalman filtering equations forward in time followed by the Kalman smoother equations backward in time until convergence. The results for a simulated example of a 10Hz and 11Hz component, where one has amplitude that is ramping up exponentially fast, and the other is pulsing, is shown in the Figure at the top of the page.  An example using real EEG data looks like this:


The authors didn’t really provide any validation other than the eye-ball test, and it wasn’t clear how to evaluate performance of the method for real data sets.  However, the results do look pretty good for the test cases they illustrated. I also would have liked to get more details about choosing window width and frequency resolution for data analysis.

I think that the application of the IRLS algorithm for solving this problem is really clever and there are a lot of possibilities for generalizing to different priors that, for example, accommodate frequency modulation. I hope to see some follow-up papers were they use the method to do some fun data analysis and maybe flesh out these details a bit more.