# Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

The paper we discussed this week was

# Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, James L. McClelland and Surya Ganguli

http://arxiv.org/pdf/1312.6120v3.pdf

This work aims to start analyzing a gnawing question in machine learning: How do deep neural networks actually work? In particular, for a given set of input-output pairs $\{x^\mu,y^\mu \}$, how do the network weights evolve, and how fast can the network converge? To this end, the authors take a close look at the simplest form of deep networks, deep linear networks, and describe the dynamics on learning the network weights over time. While a simple case, the intuition they build for the linear case gives the authors a way to discuss intelligent ways of initializing deep networks. The ensuing discussion about network initialization focuses on a kind of variance-invariance over network layers, a concept which the authors also discuss in the concept of nonlinear networks.

The paper is roughly divided into three sections. The first third of the paper discusses an exact analysis of the learning process for a linear neural network with one hidden layer. The second third of this paper extends the ideas and intuition for a single hidden layer to networks with multiple hidden layers, and further develop the idea of greedy network initialization based on the network dynamics. In the final section of the paper the authors look at the properties each of greedy and random network initialization and discuss the implications for learning rates in both linear and nonlinear networks.

For the first third of the paper, the network being analyzed consists of an input layer $x\in\mathbb{R}^{N_1}$, output layer $y\in\mathbb{R}^{N_3}$, and hidden layer $h\in\mathbb{R}^{N_2}$. The weights connecting layer one to layer two are $W^{21} \in\mathbb{R}^{N_2,N_1}$ and the weights connecting layer two to layer three are $W^{32} \in\mathbb{R}^{N_3,N_2}$. In a linear network the back-propagation learning rule simply projects the error of each training sample down the network. Taking the limit of small step sizes on the update rule and the expectation of the step with respect to the training set, the authors are able to write the learning rule as the set of differential equations:

$\tau\frac{dW^{21}}{dt}={W^{32}}^T\left(\Sigma^{31}-W^{32}W^{21}\Sigma^{11}\right),\qquad\tau\frac{dW^{32}}{dt}=\left(\Sigma^{31}-W^{32}W^{21}\Sigma^{11}\right){W^{21}}^T$,

where $\tau$ is a constant, $\Sigma^{11} = E[xx^T]$ is the input covariance matrix and $\Sigma^{31} = E[yx^T]$ is the input-output cross-covariance matrix. To simplify the analysis, the authors assume appropriate whitening to at the input to set $\Sigma^{11} = I$ and replace $\Sigma^{31}$ with its SVD $\Sigma^{31} = USV^T$. This replacement allows the authors to analyze a simpler set of equations for an equivalent network that operates in the principal directions of $\Sigma^{31}$. This equivalent set of network weights can be written in terms of $\overline{W}^{21} = W^{21}V = [a^1,a^2... a^\alpha, ... a^{N_2}]$ and $\overline{W}^{32} = UW^{32}= [b^1,b^2... b^\alpha, ... b^{N_2}]^T$. By writing an equivalent set of differential equations in terms of the $a$‘s and $b$‘s, we see an interesting phenomenon: the resulting equations essentially solve the cost function

$\frac{1}{2\tau}\left[ \sum_{\alpha} (s_\alpha - {a^\alpha}^T b^{alpha})^2 + \sum_{\alpha\neq\beta}{a^\alpha}^T b^{beta} \right]$

This cost function is essentially saying that the $\alpha^{th}$ row-column couple $\{a^\alpha,b^\alpha\}$ is trying to represent the $\alpha^{th}$ mode of the cross covariance matrix $s_\alpha$, while simultaneously trying to be as orthogonal as possible from all other sets of row-column pairs. For a typical matrix, this is essentially trying to find the best $rank N_2$ approximation to the cross covariance matrix (i.e. a dynamical system whose solution should be the SVD of $\Sigma^{31}$). The authors continue on to find closed form solutions for very simple cases of $N_2 < \min(N_1,N_2)$ and $a^{\alpha} = b^{\alpha} \propto r^{\alpha}$. While exact solutions for the values of proportionality are derived over time (i.e. the learning rate), the entire argument only holds for when $r^{alpha}$ are pre-defined. Since these vectors cannot change naturally once initialized, I feel that a lot of the analysis would be difficult to move to a general case. Additionally, the analysis depends heavily on initializing the network to have orthogonal pairs of $\{a,b\}$‘s: something that is impossible when the hidden layer is overcomplete: a common property of the oft-used convolutional neural networks. As such, the intuition built via this analysis allows the authors to continue onto deeper neural networks.

For deep neural networks, the mathematical details for reducing and analyzing the network learning dynamics becomes more complicated. To reduce the network analysis into an analysis of the rows and columns of a rotated set of weight matrices requires another set of rather stringent assumptions: that the right singular vectors of each weight matrix and match the left singular vectors of the weight matrix from the previous layer. Mathematically, if $W^l = U^lS^l{V^l}^T$, then $V^l = U^{l-1}$. The first and last set of weights are again modified via the left and right singular vectors of the input-output cross-covariance matrix. This additional assumption then allows them to again reduce the learning rules to an update on a set of proportionality variables (i.e. every layer has modes proportional to $r^\alpha$). The authors then derive similar learning curves as in the single hidden layer case.

The authors also use the intuition from these analyses to discuss the virtues of greedy and random initialization of the network weights. For greedy methods they discuss a greedy method from the literature where the network is pre-trained in a two-step process. First, an auto-encoder is used to train the network to predict its own input. Second, the network is fine tuned to adapt to the desired output. The authors note that this procedure produces an input weight matrix proportional to the right singular vectors of the $\Sigma^{31}$, meaning that the subspace chosen for the problem at hand is correct prior to learning. This implies that the subsequent learning will be very fast since the network does not need to adapt the principal directions, but rather only the strength of each layer.

As an alternative, the authors also discuss smarter random network initialization: using random orthogonal matrices over random Gaussian matrices. The intuition here comes from the fact that preservation of statistics across layers can imply faster learning. Alternatively, if a network layer does not preserve the total variance of the inputs from the previous layer, vital information could be lost either in terms of the lack of representation power at the output or by the lack of ability for gradients to effectively modify lower levels of the network. Gaussian matrices are particularly terrible in this regard since they are almost guaranteed to have many small singular values. This implies that many vectors, either coming up or down the network, will be severely attenuated, hindering learning. Random orthogonal matrices are guaranteed to preserve the norms of vectors, and consequently are invaluable in this regard. While interesting in its own right, this observation allows the authors to begin to talk about nonlinear networks. This is because the authors state that the greedy methods for initialization might not carry over well to the nonlinear case. In particular the greedy initialization observations depended on the relationship between the SVD of $\Sigma^{31}$, which in turned was intuition gleaned from the update equations for linear networks. Nonlinear networks have a much more complex set of dynamics.

Using the ideas of energy preservation and random orthogonal matrices, the authors discuss what it means for a nonlinear network to preserve information across network layers. To accomplish this, the authors present the concept of isometry called the dynamical isometry. Isometries in general discuss the preservation of distances across a function, and the dynamical isometry is particular to discussing the preservation of distances of the back-propagation functions in a deep network. Specifically, the authors describe DI as an isometry condition on the “product of Jacobians associated with error signal back-propagation”.

For linear networks, random orthogonal matrices automatically preserve the norms of vectors, so dynamical isometry follows directly. For nonlinear cases, the authors analyze networks with random orthogonal weights, tanh nonlinearities and a scaling factor $g$ at each layer. The authors plot empirical distributions for the singular values of the Jacobian matrix as a function of the gain at each layer $g$ and the strength of the input layer $q = \frac{1}{N_1}\sum_i^{N_1} {x^1_i}^2$. For $g<1$, the singular values for the Jacobian are very small (less than 1% of the input variance) and therefore the network does not act as an isometry and learning will be slow. Similarly, when $g\geq 1.1$, there are some large singular values, which look troubling to me. The authors mention that this condition is still better than random Gaussian weights. Right around $g = 1$ (what the authors term the “edge of chaos”) the singular values mostly cluster around one, indicating good learning properties.

Overall, I thought that this paper begins to ask the right questions about what properties of neural networks helps them learn. I’m a little skeptical that the analysis of the linear case will yield much for more general classes of networks actually in use, but I think the concepts that the authors are discussing are going to be very important. In particular, I’d be excited to see what comes of this dynamical isometry.

# Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.
http://arxiv.org/abs/1411.2581

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

Exponential families have a general form $p(x|\eta)=h(x)\mbox{exp}(\eta^T T(x)-a(\eta))$, where $h(x)$ is a base measure, $T(x)$ is a vector of sufficient statistics, $\eta$ is the natural parameter and $a(\eta)$ is the log-normalizer. An attractive property of exponential family distributions is that $\mathbb{E}[T(x)]=\nabla_{\eta}a(\eta)$, which will be exploited in DEFs later.

Imitating deep neural net, these models stack multiple exponential family distributions in a hierarchical structure:

Here, $z_l$ is the latent variable vector at layer $l$. $K_l$ is the number of variables in the layer $l$. $W_l$ is the weight matrix connecting layer $l+1$ and $l$. The entire generative direction is from top to bottom. The top layer’s latent variables are generated from a prior distribution $p(z_L)=\mbox{EXPFAM}_L(z_L,\eta)$. Then the conditional distribution between layer $l+1$ and $l$ is $p(z_l|z_{l+1},W_l)=\mbox{EXPFAM}_l(z_l,g_l(z_{l+1}^T W_l))$. The bottom layer is for observation likelihood which is Poisson in this paper, $p(x|z_1,W_0)=\mbox{Poisson}(z_1^T W_0)$. The link function $g_l$ maps the inner product $z_{l+1}^T W_l$ to the natural parameter to generate $z_l$. Because the expected sufficient statistics are equal to the gradient of the log normalizer $\mathbb{E}[T(z_l)]=\nabla_{\eta_l}a(g_l(z_{l+1}^T W_l))$, where $\eta_l=g_l(z_{l+1}^T W_l)$, if we consider $g_l(x)=x$ and $T(z_l)=z_l$, then $\mathbb{E}[z_l]$ is just the linear function of $W_l$ transformed by $\nabla_{\eta_l}a(\cdot)$, which is one source of non-linearity. The inference method for DEFs is called black box variational inference (Ranganath, et al 2014), which we didn’t have time to discuss.

The paper also provided three concrete examples specified from DEFs which are sparse gamma DEFs, sigmoid belief network and Poisson DEFs. For different exponential families, there are different natural link functions $g_l(\cdot)$ and prior distributions for $p(W)$ summarized as follows,

They finally evaluated various DEFs on text and combined multiple DEFs into a model for pairwise recommendation data. In an extensive study, they showed that going beyond one layer improves predictions for DEFs. They demonstrated that DEFs find interesting exploratory structure in large data sets, and give better predictive performance than state-of-the-art models.

This paper is interesting because it proposed a general deep hierarchical structure for all kinds of exponential families that avoids complicated inference implementations for various hierarchical generative models, as well as uncovering insight of “dual” relationship between neural net nonlinearity functions and the link functions in Bayesian framework.

# Every Neuron is Special

A couple of weeks ago I presented

A category-free neural population supports evolving demands during decision-making

by David Raposo, Matthew Kaufman and Anne Churchland.  By “categories” they are referring to some population of cells whose responses during an experiment seem to be dominated by one or two of the experimental variables. The authors refer to these types of categories as functional categories.

The data come from experiments in which rats were trained to discriminate between high and low rate pulses of both visual and auditory stimuli (see the figure below).  In some trials the rats were presented with only visual stimuli, some only auditory, and some both.  The rats were presented a stimulus and then had to make a choice to make a nose poke to either the left or the right to indicate whether they thought the stimulus was high-rate or low-rate (with “high” and “low” determined by a threshold set by the experimenter during training).

The paper can be fairly clearly delineated into two parts. In part I, the authors use pharmacological techniques to demonstrate a causal role of posterior parietal cortex (PPC) in the task. The experiments can be summed up by this Figure:

The rats, predictably, were able to do a better job of discriminating high-rate from low-rate pulses when both sensory modalities were present. Furthermore, they showed rather convincingly that PPC was involved almost exclusively in the processing of the visual stimulus.

The second part of the paper, in my view, is the real meat of the paper. This part is comprised of a suite of statistical analyses that pick apart the population data to find evidence for functional catagories.

The authors did a great job illustrating this part of the paper with figures.  For example, here is a figure of PSTH’s that points out the way that many people still think about neurons involved in this kind of task:

Solid lines are PSTHs for trials where the stimulus rate was high, dashed lines were for low-rate trials.  Blue are PSTHs for visual-only trials, green are for auditory-only trials. The neurons here demonstrate selectivity for a) choice, b) modality, c) mixtures, d) something dynamic. In the past, the way that one might study these cells is to identify those cells in a) and b), using some selectivity metric, and then declare them a special class of cells that are selective for the choice of the animal and the stimulus, respectively.

However, check out the other half of the same figure:

In (f-h) the authors are showing us histograms of their measure of choice preference (f,g) or modality preference (h) over all cells.  The shaded regions are histograms of the cells for which this metric is significantly > 0. We can easily see from this figure that these “selective” cells (in the shaded regions) do not belong to special populations, but are simply in the tails of some distribution that is centered on 0. Their identity as “selective” is no more than an artifact of our classifying them based on a p-value.  Thus, most cells are more like the cells in panels c) and d) above, displaying some mixture of selectivity, or even some dynamic behavior which is perhaps not well summarized by the PSTH.

I know what you’re thinking.  “Well, maybe there is no special population of cells that are selective for one thing, so what about groups of cells that are selective for particular combination of things?”  In other words, are there cells that cluster with respect to their sensitivity to either the sensory modality, or the choice of the animal?  The answer is mostly given by the following figure:

Panel a) shows a plot of the choice and modality preference metric for each cell.  It is easy to see that there are no clusters.  Of course, the authors go much further in their analysis of clustering, using the top 8 singular vectors of the population PSTHs, and examining distributions of nearest-neighbor angles, and found no evidence of clustering, but I won’t get into that here.

Next, the authors fit separate SVMs to decode choice and modality.  The decoders could accurately decode choice and modality from the same trial, but the choice decoder was independent of the modality and vice versa. This indicates that the information about choice and modality are encoded in the same neuronal population, at the same time, but that these signals do not interfere with each other.

Finally, they did PCA on the population responses during two different phases of the experiment, during the decision phase, when the animal is collecting information about the stimulus, and during the movement phase, when the animal is executing its choice by reorienting its body to the chosen port.

They found that the movement space and the decision space were poorly aligned (see figure below). This observation is similar in spirit to an interesting paper (also by Matt Kauffman), describing why it may be useful for population trajectories to explore mis-aligned vector spaces while performing different functions.

In summary:

1. PPC neurons are involved in vision-dependent decision making.
2. These neurons display non-specific sensitivity to both the choice of the rats and the stimulus.
3. Information about both choice and modality are independently encoded by the population and are available for linear read-out.
4. The population of neurons explore different parts of state space at different times.

Together these findings reject the notion that each neuron in PPC is encoding something specific about the experiment, and that this mixed selectivity is a feature of the network.  The authors provide a well-articulated normative explanation for mixed selectivity:

Theoretical work suggests a major advantage for category-free populations: when parameters are distributed randomly across neurons, an arbitrary group of them can be linearly combined to estimate the parameter needed at a given moment[12–14]. This obviates the need for precisely prepatterned connections between neurons and their downstream targets and also means that all information is transmitted. This latter property could allow the same network to participate in multiple behaviors simply by using different readouts of the neurons.

which seems to be the situation born out by their data.

My only complaint about this paper is that, although I am totally on-board with this hypothesis, it seems to me that in some sense this paper is trying to prove a negative. The fact that they found no evidence for categories is not evidence for absence of categories. To be fair, they allude to other types of categories (cell type, connectivity) in the discussion. Here they are specifically referring to functional categories, which are defined a priori by the experimenter, which is the problem with functional categories to begin with.

# Fast Kronecker trick for Gaussian Process regression with “expressive” kernels

On May 11th, I presented the following paper in lab meeting:

Fast Kernel Learning for Multidimensional Pattern Extrapolation
Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai and John P. Cunningham

This paper presents a method for scaling up structured spectral mixture (SM) kernels (from Wilson et al 2013) for Gaussian Process regression to multi-dimensional settings in which many (but not all) of the input points live on a grid (e.g., a 2D image in which some pixels are missing).  The spectral mixture is a very expressive kernel that enhances representation learning, but the problem comes in applying it to large scale data.

To address this problem, the authors developed the spectral mixture product (SMP) kernel, which is just a multiplication of 1D kernels for each dimension. This allows the full kernel matrix for points defined on a $P$-dimensional grid to be represented with a Kronecker product of kernel matrices along each dimension:  $K = k_1 \otimes \cdots \otimes k_P$. This allows the quadratic term and the log-determinant term in the GP log-likelihood to be evaluated very efficiently using singular value decompositions of these component matrices, namely:
where $Q=Q^1\otimes Q^2\otimes ...\otimes Q^P$ (eigenvectors), and $V=V^1\otimes V^2\otimes ...\otimes V^P$ (eigenvalues).

The key contribution of this paper is to show that if inputs are not on a grid — for example, some pixels from a training image are missing — one can insert “imaginary” points to complete the grid, and place infinite measurement noise on these observations so they have no effect on inference. Assuming there’s a dataset of M observations which are not necessarily on a grid, they form a complete grid using W imaginary observations, $y_W\sim\mathcal{N}(f_W,\epsilon^{-1}I_W)$, $\epsilon\rightarrow 0$. The total observation vector $y = [y_M, y_W ]^\top$ has $N = M + W$ entries: $y\sim\mathcal{N} (f, D_N)$, where the noise covariance matrix $D_N =\mbox{diag}(D_M, \epsilon^{-1}I_W), D_M = \sigma^2 I_M$. It’s simple to prove that the imaginary points have no effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution.

However, the relevant matrices are no longer have Kronecker structure due to the fact that the observations do not form a complete grid. The authors get around this using the conjugate gradient method, an iterative method for solving linear equations, to compute the quadratic term of the GP log-likelihood. They compute the log-determinant term using an approximation.

In particular, the authors use preconditioned conjugate gradients to compute $(K_N + D_N)^{ -1} y$ by solving$(K_N + D_N)x = y$ for $x$. For the log-determinant term, they use the approximation:

With these implementation tricks, they can scale up highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, to datasets exceeding $N = 10^5$ training instances, in minutes to 10s of minutes. They obtain many beautiful results, including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery.

The take home message for me is that smart implementation tricks can allow GPs with expressive covariances to be applied to large scale problems with non-gridded input.

# Integration dynamics and choice probabilities

Recently in lab meeting, I presented

Sensory integration dynamics in a hierarchical network explains choice probabilities in cortical area MT

Klaus Wimmer, Albert Compte, Alex Roxin, Diogo Peixoto, Alfonso Renart & Jaime de la Rocha. Nature Communications, 2015

Wimmer et al. reanalyze and reinterpret a classic dataset of neural recordings from MT while monkeys perform a motion discrimination task. The classic result shows that the firing rates of neurons in MT are correlated with the monkey’s choice, even when the stimulus is the same. This covariation of neural activity and choice, termed choice probability, could indicate sensory variability causing behavioral variability or it could result from top-down signals that reflect the monkey’s choice. To investigate the source of choice probabilities, the authors use a two-stage, hierarchical network model of integrate and fire neurons tuned to mimic the dynamics of MT and LIP neurons and compare the model to what they find in the data.

The figure above shows the two stages. They key manipulation the authors employ is to change the amount of feedback from the integration stage to the sensory stage. In their model, variability in the sensory stage can arise from 3 possible sources: 1, stimulus variability (red and blue squiggles that lead to E1 and E2 populations); 2, background noise (X); and 3, feedback signals (green bFB).

They find that changing the amount of feedback in this type of network makes several predictions that can be tested using existing data. They argue that feedback signals should account for choice probability (CP) later in the trial, which means the early part should be due to feedforward fluctuations in the stimulus. This allows them to take advantage of a stimulus manipulation in the data: some of the trials are perfect repeats of the same stimulus where as others have the same motion strength but are generated using different random seeds. They call these two conditions replicate and non-replicate, respectively. The model predicts that CP will be smaller for replicate trials during the early part of the motion stimulus because there are no stimulus fluctuations. The model also predicts that correlations between cell pairs will be smaller across the board for replicate stimuli (see the comparison between the black and gray traces in both subplots below).

The authors find that both of their predictions are true in the data they have access to. They analyze CP across 41 neurons with replicate stimuli an 118 neurons with non-replicate stimuli and look at correlations across 32 pairs. They find that neurons were less  variable with replicate stimuli (subpanel a below), that correlations in an example cell pair were smaller for replicate stimuli, and choice probability was lower earlier in the trial for replicate stimuli.

They also interrogated cell pairs to see if noise correlations were indicative of a feedback signal. They sub-selected pairs based on whether they had increasing or decreasing correlations through out the trial. The prediction is that increasing correlations means more feedback, so these cells should have higher CP later in the trial, which is exactly what they found.

The authors end by suggesting that the feedback could serve to strengthen the commitment to a category despite bursts of sensory evidence. Unfortunately, this makes the simulated observer less sensitive. Overall, the paper is a valiant attempt at understanding the source of choice probabilities using decades old data. The insights remain clear: the time-course of CP and noise correlations should be informative about top-down signals during decision-making. Hopefully new data will be available soon to help constrain the space of models.

# Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares

Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014

In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

I’ve done some work on time-frequency representations in the past and often times an investigator expects to find signals that are narrow-band, but possibly amplitude-modulated, and are therefore limited in resolution by the time-frequency uncertainty principle.  However, when you formulate the problem of time-frequency representation as a Bayesian estimation problem, you can get a lot more bang for you buck.

The basic problem is to estimate the Fourier coefficients $x_n$ at time step $n$ for observations $y_n$ described by

$\bold{y}_n = \bold{F}_n \bold{x}_n + \bold{v}_n$

where $\bold{F}_n$ is the real-valued DFT matrix and $\bold{v}\sim \mathcal{N}(0,\sigma^2)$.  The usual way of doing business is to let $n$ be subsamples of the observation times and estimate $x_n$ over some window centered on $n$. Smoothness in time is achieved by having a large overlap between neighboring windows.  Ba et. al take a different approach, which obviates the need for large overlap between windows, reducing the size of $\bold{y}_n$ substantially for each time step.  The authors instead model the relationship between neighboring (in time) Fourier coefficients by

$\bold{x}_n = \bold{x}_{n-1}+\bold{w}_n$

where $\bold{w}_n$ is drawn from some distribution $p_W(w)$.

Thus, the trick is to bring prior information to bare in the least-squares problem, by specifying the structure of the solution that you expect through $p_W(w)$.

The authors offered two possibilities:

$\log p_1(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}w^2_{n,k} + \epsilon^2\right)^{1/2} +c_1$,

and

$\log p_2(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}\left(w^2_{n,k} + \epsilon^2\right)^{1/2}\right)^{1/2} +c_2$.

Now, I admit that these priors look kind of funny, but there is a good reason why these were the priors specified in this paper.  These priors are “epsilon-perturbed” $\ell_\nu$ norms, given by

$L(\bold{x}) = -\frac{1}{2}\sum^M_{i=1}(\epsilon^2+x_i^2)^{\nu/2}$.

This”epsilon-perturbed” version of the $\ell_\nu$ norms have rounded out corners in their level curves, like in the following picture from the authors’ earlier paper:

It turns out that distributions with this form come from the normal/independent family, commonly used for robust regression and which lend themselves to EM by IRLS!

Thus $\log p_1$ is approximately an $\ell_1$ prior over frequencies (indexed by $k$), and a $\ell_2$ prior over changes in $x_n$ across time (indexed by $n$).  The latter can also be interpreted as a discrete version of the total variation norm.  The result is that $\log p_1$ biases the solutions toward things that are sparse in frequency but smooth in time.  Alternatively, $\log p_2$ is like a $\ell_1$ norm over frequencies, and a $\ell_1$ norm on transitions, also called a “fused LASSO”.  This prior will bias solutions toward things that are sparse infrequency and change abruptly in time.

The full MAP estimate of $\bold{x}_n$ can be found by IRLS, where the algorithm ends up being repeated application of the Kalman filtering equations forward in time followed by the Kalman smoother equations backward in time until convergence. The results for a simulated example of a 10Hz and 11Hz component, where one has amplitude that is ramping up exponentially fast, and the other is pulsing, is shown in the Figure at the top of the page.  An example using real EEG data looks like this:

The authors didn’t really provide any validation other than the eye-ball test, and it wasn’t clear how to evaluate performance of the method for real data sets.  However, the results do look pretty good for the test cases they illustrated. I also would have liked to get more details about choosing window width and frequency resolution for data analysis.

I think that the application of the IRLS algorithm for solving this problem is really clever and there are a lot of possibilities for generalizing to different priors that, for example, accommodate frequency modulation. I hope to see some follow-up papers were they use the method to do some fun data analysis and maybe flesh out these details a bit more.

# Attractors, chaos, and network dynamics for short term memory

In a recent lab meeting, I presented the following paper from Larry Abbott’s group:

From fixed points to chaos: Three models of delayed discrimination.
Barak, Sussillo, Romo, Tsodyks, & Abbott,
Progress. in Neurobiology 103:214-22 2013.

The paper seeks to connect the statistical properties of neurons in pre-frontal cortex (PFC) during short-term memory with those exhibited by several dynamical models for neural population responses. In a sense, it can be considered a follow-up to Christian Machens’ beautiful 2005 Science paper [2], which showed how a simple attractor model could support behavior in a two-interval discrimination task. The problem with the Machens/Brody/Romo account (which relied on mutual inhibition between two competing populations) is that it predicts extremely stereotyped response profiles, with all neurons in each population exhibiting the same profile.

But real PFC neurons don’t look like that (Fig 1):

So what’s the deal?

Neural heterogeneity in cortex is a hot topic these days, so this paper fits in with a surge of recent efforts to make sense of it (e.g., [3-6], and our recent paper Park et al 2014 [7]).  But instead of the “neural coding” question (i.e., what information do these neurons carry, and how?) this paper approaches heterogeneity from a mechanistic / dynamics standpoint: what model would suffice to give rise to these kinds of responses? In particular, it compares three dynamical models for PFC activity:

1. line attractor – Machens, Romo & Brody style, with two populations.
(We already know this won’t work!)
2. randomly connected network – chaotic network with sparse, random connectivity and a trained set of linear “readout” weights. This represents the “reservoir computing” idea so beloved by Europeans (cf. echo state network [8] / liquid machine [9]). The idea is to project the input to a nonlinear, high-dimensional feature space, where everything is linearly separable, and then do linear regression. There’s no need to tune anything except the output weights.
3. “trained” randomly connected network – similar to the FORCE-trained network of Sussillo & Abbott 2009 [10] fame: you initialize a randomly connected network then train it to be not-so-chaotic, so it does what you want it to. Less structured than #1 but more structured than #2.  (Good mnemonic here: “TRAIN”).

All three networks are adjusted so that their performance (at categorizing the frequency of a second tactile vibration stimulus as “lower” or “higher” than the frequency stored in memory) matches that of the monkeys (around 95%).

Results

The setup leads us to expect that the TRAIN network should do best here, and that’s more or less what happens.  The TRAIN model occupies a kind of middle ground between the stereotyped responses of the attractor model and the wild-and-crazy responses of the
chaotic network:

Wild-and-crazinessis is quantified by similarity between response to the (1st) stimulus and response during the delay period; for the attractor model these are perfectly correlated, whereas for the chaotic network they’re virtually uncorrelated after a small time delay. The TRAIN network is somewhere in between (compare to Fig 1):

Unfortunately, however, it’s not a Total Victory for any model, as TRAIN falls down on some other key metrics. (For example: the percent of the total response variance devoted to encoding the frequency of the first stimulus — identified by demixed PCA — is 28% for TRAIN, vs. 5% for the chaotic network, and only 2% for the neural data!).

So we haven’t quite cracked it, although the TRAIN approach seems promising.

Conclusion:

The paper makes for a fun, engaging read, and it nicely integrates a lot of ideas that don’t normally get to hang out together (attractor dynamics, reservoir computing, hessian-free optimization, state space pictures of neural population activity). If I have one criticism, it’s that the role for the TRAIN network doesn’t seem clearly defined enough in comparison to the other two models.  On the one hand, it’s still just a sparse, randomly connected network: if a randomly connected network can already produce the behavior, then what’s the justification for training it? On the other hand, if we’re going to go to the trouble of training the network, why not train it to reproduce the responses of neurons actually recorded during the experiment (i.e., instead of just training it to produce the behavior?) Surely if the model has a rich enough dynamical repertoire to produce the behavior and match the response dynamics of real neurons, this would be a useful thing to know (i.e., a nice existence proof). But the fact that this particular training procedure failed to produce a network with matching response types seems harder to interpret.

More broadly, it seems that we probably need a richer dataset to say anything definitive about the neural dynamics underlying short-term memory (e.g., with variable time delay between first and second stimulus, and with a greater range of task difficulties.)

Intriguing failure and compelling future direction:

The authors point out that no model accounted for an observed increase in the number of tuned responses toward the end of the delay period. They suggest that we might need a model with synaptic plasticity to account for this effect:

“These surprising finding could indicate the extraction of information about the stimulus from a synaptic form (selective patterns of short-term synaptic facilitation) into a spiking form. It remains a challenge to future work to develop echo-state random networks and TRAIN networks with short-term synaptic plasticity of recurrent connections.”

References

1. Omri Barak, David Sussillo, Ranulfo Romo, Misha Tsodyks, and L.F. Abbott. “From fixed points to chaos: Three models of delayed discrimination”. Progress in Neurobiology, 103(0):214–222, 2013. Conversion of Sensory Signals into Perceptions, Memories and Decisions.
2. Christian K Machens, Ranulfo Romo, and Carlos D Brody. “Flexible control of mutual inhibition: a neural model of two-interval discrimination.” Science, 307(5712):1121–1124, 2005.
3. Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. “Context-dependent computation by recurrent dynamics in prefrontal cortex”. Nature, 503(7474):78–84, 2013.
4. Mattia Rigotti, Omri Barak, Melissa R Warden, Xiao-Jing Wang, Nathaniel D Daw, Earl K Miller, and Stefano Fusi. “The importance of mixed selectivity in complex cognitive tasks.” Nature, 497(7451):585–590, 2013.
5. Kaufman, M. T.; Churchland, M. M.; Ryu, S. I. & Shenoy, K. V. “Cortical activity in the null space: permitting preparation without movement”. Nat Neurosci 2014.
6. David Raposo, Matthew T Kaufman, and Anne K Churchland. A category-free neural population sup- ports evolving demands during decision-making. Nature neuroscience, 2014.
7. [6]  Il Memming Park, Miriam LR Meister, Alexander C Huk, and Jonathan W Pillow. Encoding and decoding in parietal cortex during sensorimotor decision-making. Nature neuroscience, 17:1395–1403, 2014.
8. Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks. German National Research Center for Information Technology GMD Technical Report, 148:34, 2001.
9. Wolfgang Maass, Thomas Natschläger, and Henry Markram. “Real-time computing without stable states: A new framework for neural computation based on perturbations.” Neural Computation, 14:2531–2560, 2002.
10. David Sussillo and L. F. Abbott. “Generating coherent patterns of activity from chaotic neural networks”. Neuron, 63(4):544–557, 2009.