# Automated identification of mouse behavioral modules (using nonparametric Bayesian AR-HMM)

A few weeks ago, Lea and Anqi co-presented a paper from Robert Datta’s group for a joint lab meeting with 3 other labs:

Mapping Sub-Second Structure in Mouse Behavior
Wiltschko, Johnson, Iurilli, Peterson, Katon, Pashkovski, Abraira, Adams, & Datta. Neuron (2015).

It might not rival Newton’s apple, which led to his formulating the law of gravity, but the collapse of a lighting scaffold played a key role in the discovery that mice, like humans, have body language.

Behaviors are considered to evolve according to stereotyped forms, or modules, that are arranged in sequence to enable animals to accomplish particular goals. There are some recent technical advances that facilitate more comprehensive characterization of the components of behavior, but so far these have mostly only been applied to invertebrates. For mammals, current approaches tend to rely on human observers to specify what constitutes a meaningful behavioral module. The performance of these methods is therefore limited by human perception and intuition. Given the need for automated methods for identifying (and classifying) behavior in mammals, Wiltschko et al proposed an unsupervised learning method for automatically clustering patterns of mouse behavior without human supervision. They used state-of-the-art machine-learning algorithms to systematically describe short time-scale structure of behavior in mice and to understand how the brain might alter this structure to facilitate adaptive behaviors in response to environmental changes.

In order to investigate this, Wiltschko et al collected three-dimensional depth imaging data of freely behaving mice. They first developed a number of model-free approaches to see whether there exists block-wise structure in the data, hinting at a potential organization of behavior into stereotyped modules, while also giving insight into the time-scales at which these modules are organized. Surprisingly, block-wise structure was already apparent by visual inspection alone. A change-point analysis revealed that the mean block duration was about 350 ms, in accordance with a temporal autocorrelation analysis and also roughly matching the timescale of the blocks apparent upon visual inspection. The authors also carried out a PCA analysis to show that these segmented components are sub-second blocks of behaviors that encode recognizable actions, and thus serve as stereotyped and reused behavioral modules.

After pre-processing and dimensionality reduction of the imaging data, the authors investigated a series of models aiming to discover modular structure in  mouse behavior, using the model-free analysis to inform certain choices of hyper-parameters. The pipeline for the data preprocessing, dimensional compression, PCA component extraction and model fitting is shown below.

The final model that the authors used in the paper is the AR-HMM. As the name suggests, the AR-HMM combines an autoregressive (AR) model with a Hidden Markov Model (HMM). The basic idea is to model behavioral modules as a discrete hidden state, with Markovian state transitions between modules at each time point. Depending on the current behavior (i.e the value of the hidden state) the mouse’s pose evolves according to module-specific autoregressive dynamics. Once the behavioral module switches (e.g. the mouse stops walking and pauses) the dynamics that govern the evolution of the mouse’s pose also switch. Thus, each behavioral module is associated with module-specific AR dynamics and the model can be viewed as a dynamic mixture model. This can hence capture the smooth evolution of the mouse’s pose through PC space during the same behavioral epoch, but can also account for abrupt changes in pose dynamics due to changes in behavior via the latent switching state.

The final AR-HMM model that the authors use is non-parametric, allowing them to remain agnostic about the number of behavioral modules that can be identified. This HDP AR-HMM model has previously been developed in [EEMA08, EMEA09, EEMA09]. The generative model is of the form:where the Matrix Normal Inverse-Wishart (MNIW) prior is shared across the different sets of model parameters. A graphical model of the AR-HMM is shown in Figure 1 (not including parameters and priors for simplicity).

The HDP AR-HMM still requires specifying certain hyperparameters that will likely affect important characteristics of the fitted model. Most notably, the stickiness parameter $\kappa$ dictates how long the model will remain in a given state by affecting the probability of self-transitioning. Furthermore, the concentration parameter $\alpha$ determines the spread of the DP and thus affects the number of motifs the model will identify. While a non-informative prior is put over $\alpha$, $\kappa$ is set using the previous change-point analysis, essentially tuning the model to the time-scale of interest. In order to determine the relevant number of time-lags in the AR-component of the model, the authors placed a block ARD prior over $K_0$. Finally, inference in the HDP AR-HMM was carried out via Gibbs Sampling.

The HDP AR-HMM is a powerful and flexible modeling framework for identifying behavioral modules in mice and relates closely to a larger body of work on switching time-series models with Markovian structure (Jump-Markov models, Switching Linear Dynamical Systems). These models have been applied to problems ranging from speech recognition [BD07] to neural data analysis [BMJGSKM11], illustrating their flexibility and expressive power.

To demonstrate the advantages of using the AR-HMM compared to previous supervised approaches, the authors first characterized baseline patterns of behavior when the mouse is freely moving in a circular open field. They found that the model manages to group stereotyped behaviors (like e.g. a low rear or a left turn) in the same cluster and can identify meaningful modules that are representative of normal mouse exploratory behaviors in the laboratory — all in an entirely unsupervised fashion.

Next, the authors utilized the AR-HMM model to gain insight into the nature of behavioral changes that may arise due to changes in the environment. First, they changed the circular field to a square arena and discovered that a small number of behavioral modules are uniquely expressed in only the circular or the square arena, while other modules are expressed more or less frequently in one or the other arena. Second, the authors delivered an aversive fox odor to a quadrant of the square arena, which is known to lead to profound changes in mouse behavior characterized by odor investigation, escape, and freezing. The AR-HMM analysis revealed that the seemingly new behaviors were best described by up- or down-regulating modules that were identified previously during normal behavior, together with an altered transition structure between the individual modules. Thus, the behavioral changes associated with the aversive odor are not due to the introduction of new behavioral modules but are rather regulated via a re-arrangement of the components of normal behavior.

Finally, the authors investigated how genetic mutations and manipulations of neural activity may influence the sub-second structure of mouse behavior. They first characterized the phenotype of mice carrying a homozygous or heterozygous genetic mutation using the AR-HMM model framework. While the homozygous mutation results in an abnormal waddling walk, no phenotype has previously been reported for the heterozygous mutation, in which mice only have one copy of the mutated gene. The AR-HMM model correctly identifies a behavioral module unique to the homozygous mutants representing the waddling gate, but also shows that other behavioral modules are up-regulated in these mutants. Furthermore, the analysis revealed that the heterozygous mice do have a phenotype that distinguished them from wild type mice: they over-express the same set of modules that were up-regulated in the homozygous mutants, while failing to express the module for waddling gait. Thus, the AR-HMM allows one to gain insight into subtle behavioral differences that were not previously identifiable by human observers. The authors next investigate the effects of manipulating neural activity by optical stimulation using unilaterally expressed Channelrhodopsin-2 in a subset of layer 5 corticostriatal neurons in motor cortex. AR-HMM identifies and characterizes both obvious and subtle optogenetically induced phenotypes, distinguishing “new” optogenetically induced behaviors from up-regulated expression of “old” behaviors.

Overall, we think that unsupervised probabilistic machine learning is an interesting approach to better understanding the architecture of mouse behavior. Using a generative modeling approach provides a principled framework for investigating questions relating not only to the organization of behavior at fine time scales, but also to the influence of environmental factors, individual genes or neural circuits on mouse behavior. All in all, this work may facilitate a better fundamental understanding of how the brain can build and adapt patterns of behavior and may also have interesting implications for diagnosing disease based on subtle behavioral phenotypes.

References:

[EEMA08] Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. An hdp-hmm for systems with state persistence. In Proceedings of the 25th international conference on Machine learning, pages 312–319. ACM, 2008.

[EMEA09] Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In Advances in Neural Information Processing Systems, pages 549–557, 2009.

[EEMA09] Emily Fox, Erik Sudderth, Michael Jordan, and A Willsky. Nonparametric bayesian identification of jump systems with sparse dependencies. In Proc. 15th IFAC Sympo- sium on System Identification, pages 1886–1898, 2009.

[BD07] Bertrand Mesot and David Barber. Switching linear dynamical systems for noise ro- bust speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15(6):1850–1858, 2007.

[BMJGSKM11] Biljana Petreska, M Yu Byron, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and Maneesh Sahani. Dynamical segmentation of single trials from population neural data. In Advances in neural information processing systems, pages 756–764, 2011.

# Discussion of Fox & Friends, “Bayesian Structure Learning for Stationary Time Series”

[latexpage]

Last week in lab meeting I discussed a paper from Emily Fox’s group [TFF15], Bayesian Structure Learning for Stationary Time Series.

If you’ve read Carvalho and Scott’s paper on Bayesian inference for graphical models [CS09], then you’ll find [TFF15] a natural extension to [CS09] and reading the latter is surely the best way to get your head wrapped around the former.

From my view [TFF15] makes 2 main contributions:
1) Formulating the graph inference problem in terms of time series by borrowing from [CS09]
2) Formulating the problem in the Fourier domain, making the problem tractable.

On the way to making the problem properly specified, [TFF15] introduced a new conjugate prior on complex normal distributions defined on graphs, the hyper- complex inverse Wishart distribution.

To give you a little background, [CS09] used an algorithm (FINCS, originally developed in [CS08]) for inferring a undirected graph that defines conditional dependence between random variables when

1. variables are Gaussian distributed
2. observations are independent
3. the admissible graphs are “decomposable,” which mean that the graph can be unambiguously described by a bunch of cliques and their intersecting nodes (called “separators”).

Solving the problem in these terms provides some convenient simplifications.

The first simplification, owing to the variables being Gaussian, is a well-known result by Dempster [Demp72],

Suppose $X = (X_1,\dots,X_p)^T\sim \mathcal{N}(0,\Sigma)$, and $Z_{ij}\equiv \{\text{all } X_k, k\in (1,\dots,p)| k\neq i,j\}$. Then $[\Sigma]_{ij}^{-1} = 0$ implies $P(X_i,X_j|Z_{ij})=P(X_i|Z_{ij})P(X_j|Z_{ij})$.

In other words, $X_i$ and $X_j$ are independent, conditionally on all other variables, when the inverse covariance matrix has a 0 in the $i,j^\text{th}$ element. Thus, the problem of evaluating the likelihood of a graph becomes the problem of evaluating the likelihood of a covariance that satisfies the corresponding pattern of 0’s in $\Sigma^{-1}$.

The simplification relies upon the restriction of searching over decomposable graphs on more way than one. Not only does FINCS relyupon this assumption but decomposability provides for a convenient factorization of the prior. Conditioned on the graph $G$, the prior can be decomposed in terms of cliques $\mathcal{C}$ and separators $\mathcal{S}$:

$p(\Sigma|G) = \frac{\prod_{C\in\mathcal{C}}p(\Sigma_C|\delta,W_C)}{\prod_{S\in\mathcal{S}}p(\Sigma_S|delta,W_S)},$

where $(b, D)$ are the parameters of the hyper- inverse Wishart distribution; the conjugate prior for the covariance parameters of a Gaussian distribution that obeys the conditional independence laws defined by the graph. This dramatically reduces the complexity since operations need only be performed on covariance matrices that are the size of the cliques and separators, and not on the full dimensionality of the observations.

The snag that kept [TFF15] from using [CS09] more or less out of the box is that time series do not have independent observations, except in the trivial case (negating any simplification afforded by assumption (2), in the above list).  Thus, the likelihood over times series does not factorize over observations, thereby requiring us to contend with potentially very large covariance matrices. The key insight that [TFF15] made is that moving to the Fourier domain allowed them to make use of the Whittle approximation, which states that Fourier coefficients of stationary times series are asymptotically independent. This is a clever trick that allowed [TFF15] to operate as if the observations $d$ (now indexed by frequency rather than time) were independent. The one wrinkle is that the observations were now complex.

The complex likelihood required [TFF15] to derive a new hyper-Markov conjugate prior over the covariance matrices (or rather, spectral density matrics $S$), following the work of [DaLa93]. The result was a complex generalization of the hyper- inverse Wishart distribution, aptly named, the “hyper- complex inverse Wishart” (the “hyper-” part denotes that the prior obeys conditional independence laws that are specified by the graph).

After specifying the prior over the spectral densities, [TFF15] specify a hyperprior over the graph, which determines the number of edges $k$.  They also go further in placing a Beta $(a,b)$ prior over the probability of an edge.  The full graphical model is schematized in the left panel below.

It turns out you can marginalize out both the probability of edges $r$ and the spectral density of cliques and separators so that the final model is greatly simplified (right panel above) while still having the flexibility of a proper Bayesian formulation.

You can check out the analysis results for yourselves but I think that they look pretty awesome and make a great argument in favor of their approach.

Now, I’m not an expert on graph theory but I think that the most dubious of the assumptions that [TFF15] inherited from [CS09] is the requirement that all admissible graphs be decomposable. Its not clear to me how (or if) this is a very strong restriction and I’m curious how the algorithm adapts when the model is misspecified. When does it matter? How could this kind of bias influence scientific conclusions drawn from the inferred graphs? I’m also interested in how one might leverage these techniques to infer directed graphs.

If others have intuition about these points then I’m curious to know.

References:

[TFF15] Tank A, Foti NJ, and Fox E. Bayesian Structure Learning for Stationary Time Series. Uncertainty in Artificial Intelligence (UAI), 2015.

[CS09] C. M. Carvalho and J. G. Scott. Objective Bayesian model selection in Gaussian graphical models. Biometrika, 96(3):497?512, 2009

[CS08] C. M. Carvalho and J. G. Scott. Feature-inclusion stochastic search for Gaussian graphical models. J. Computational and Graphical Statistics, 17(4):790-808, 2008

[Demp72] Dempster, A. Covariance selection. Bometics, 28, 157-175

[DaLa93] A. P. Dawid and S. L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist., 21(3):1272–1317, 1993.

# Estimating within-session changes in firing rate

In the lab meeting on March 23, I presented the following paper:

Learning In Spike Trains: Estimating Within-Session Changes In Firing Rate Using Weighted Interpolation
Greg Jensen, Fabian Munoz, Vincent P Ferrera
bioRxiv (2016).

This paper seeks to develop a method that can track non-stationary dynamics of firing rates. Specifically, it deals with two problems:
1. How to estimate the firing rate accurately from single (or a small number of ) trials?
2. Given a (possibly inhomogeneous) session of consecutive trials, how to subdivide the session into ensembles of trials of similar features?

1. Single-trial rate estimation “ARRIS”

Their method, named ARRIS (Adaptive Rate Regression Involving Steps), uses reversible jump MCMC (RJMCMC) to detect step-like transitions of firing rates.

The RJMCMC (Green 1995) method is a variant of MCMC algorithm which dynamically adjusts the number of parameters (dimensionality of posterior), by adding/removing them as parts of its simulations. RJMCMC was used in a previous method called BARS (Bayesian Adaptive Regression Splines, DiMatteo et al 2001), where they dynamically add/remove knots (spline points).

In this paper’s ARRIS method, they use RJMCMC to determine step functions. The authors emphasize that unlike the previous rate estimation methods which almost always shows attenuation at the transition boundary (and more strongly with smaller samples), the ARRIS method captures sharp transitions successfully even with small samples. The method also works when the transition is smooth, because the MCMC generates an ensemble of steps.

2. Weighted interpolation

On the other hand, the firing rate dynamics varies in a non-random way in many natural applications. It’s nice that one could estimate the firing rates from single trials, but one should also be able to benefit from looking at the next trial in time, which presumably has a similar firing pattern. Based on this idea, the authors consider a weighted ensemble of trials centered on the time of interest, using a tri-cube weight function with a single parameter (the bandwidth). Afterwards, ARRIS could be applied to the weighted ensemble of firing data. Now the problem is: What is the optimal bandwidth of the weight function? In other words, about how many trials should one average over?

Optimal bandwidth determination:
The authors use generalized cross-validation (GCV) to find the optimal bandwidth across trials. Importantly, the across-trial bandwidth is optimized by GCV at each trial $r$ separately. The resulting bandwidth $h(r)$ is itself a good representation of data variability.
More specifically, in order to find the optimal bandwidth $h(r)$ at a given trial $r$, one needs to iterate the following. Specify a fixed bandwidth $h$; Construct a weighted ensemble of spikes with across-trial width $h$ and centered at trial $r$; Calculate a generalized cross-validation score $\textrm{GCV}(h,r)$ by running ARRIS on this weighted ensemble while leaving one time-point out at a time.

Rapid evaluation of the bandwidth:
However, there is a practical difficulty in executing the optimization procedure described above. The most obvious reason is that MCMC is already computationally intensive, and that iterating over many candidate values of $h$ may be too expensive. But more fundamentally, the single-trial estimate of the firing rate would fluctuate, because it depends on a stochastic algorithm (RJMCMC). It means the optimization of $h(r)$ will not be robust, because the GCV score will turn out differently every time it is calculated.

To get around this problem the authors suggest using a kernel density approximation, where the ARRIS estimate is replaced by a box kernel with a fixed bandwidth (*note* this bandwidth is within-trial, not across-trial). This within-trial bandwidth is determined by a simple estimator that depends on spike interval median (rather than being optimized). The approximation is only used to evaluate the optimal bandwidth, then ARRIS can be used to obtain the final estimate at the optimal bandwidth.

Overall, this is a nice paper that provides a readily applicable method for learning non-stationary firing dynamics from small samples. It has two contributions: (1) the step-function-based ARRIS method for single-trial firing rate estimate, and (2) the weighted interpolation framework that involves the determination the optimal bandwidth $h(r)$ which captures data variability.

• For better representation, the authors suggest that the optimal bandwidth $h(r)$ can be smoothed once again with a “smoothing bandwidth $h_{s}$” (which is optimized by another round of GCV).
• There are three different uses of the term “bandwidth” in the paper, which may be confusing: the across-trial bandwidth $h(r)$ for the weighted ensemble construction which is at the core of the paper, the smoothing bandwidth that goes on top of $h(r)$, and finally the within-trial bandwidth that is introduced for the kernel density approximation.

# Neuronal Circuits Underlying Persistent Representations Despite Time Varying Activity

This week in computational neuroscience journal club we discussed the paper

Neuronal Circuits Underlying Persistent Representations Despite Time Varying Activity
Shaul Druckmann and Dmitri B. Chklovskii
Current Biology, 22(22):2095–2103, 2012.

The main idea of this work was to demonstrate that the fact that non-trivial temporal dynamics occur in neural circuits, this activity does not prevent such circuits from persistently representing a constant stimulus. The main example used (although other neural regions, such as V1 are discussed) is working memory in pre-frontal cortex. The authors cite previous literature that has shown that even when a stimulus is stored’ in memory, there is still significant time-varying neural activity. While the authors make no indication that the represented stimulus must be preserved (i.e. temporal dynamics are due to temporally changing stimulus representations), they do present a viable architecture that describes how the temporal dynamics do not effect the stimulus stored for the working memory task.

At a high level, the authors use the high redundancy of neural circuits to introduce a network that can retain a stimulus in the space of allowable features while still permitting neural activity to continue changing the null-space of the feature matrix. This paper ties together some very nice concepts from linear algebra and network dynamics to present a nice and tidy proof-by-construction’ of the hypothesis that evolving networks can retain constant stimuli information. To tie this paper in a broader setting, the network dynamics here are similar to those used in network encoding models (e.g. the locally competitive algorithm for sparse coding) and the idea of non-trivial activity in the null space of a neural representation ties in nicely to later work discussing preparatory activity in motion tasks.

The formal mathematics of the paper are concerned with neural representations based on the linear generative model. This model asserts that a stimulus $s\in\mathbb{R}^M$ can be composed of a linear sum of $K$ dictionary elements $d_i \in\mathbb{R}^K$

$s = \sum_{i=1}^N d_i a_i = Da$

where the coefficients $a_i$ represent amount of each $d_i$ needed to construct $s$. In the network, $a_i$ is also the activity of neuron $i$, meaning that activity in neuron $i$ directly relates to the stimulus encoded by the network.  If the number of dictionary elements $K$ is the same as the dimension of the stimulus $M$, then there is no possible way for the neural activity $a$ to deviate from the unique solution $a = D^{-1}s$ without changing the stimulus encoded by the system. The authors note, however, that in many neural systems, the neural code is highly redundant (i.e. $K >> M$), indicating that the matrix $D$ has a large null-space and thus there are an infinite number of ways that $a$ can change while still faithfully encoding the stimulus.

With the idea of allowing neural activity to vary within the nullspace, the authors turn to an analysis of a specific neural network evolution equation where the change in neural activity decays is defined by a leaky-integrator-type system

$\tau \frac{d}{dt} a = -a + La$

where $\tau$ is the network time constant, and $L$ is the network connectivity matrix, i.e. $L_{i,j}$ indicates how activity in neuron $j$ influences neuron $i$. The authors do not specify a particular input method, however it seems that it is assumed that at $t=0$ the neural activity encodes the stimulus. To keep the stimulus constant, the authors observe that setting

$\frac{d}{dt}s = D\frac{d}{dt}a = 0$

results in the feature vector recombination (FEVER) rule

$D = DL$

This rule is the primary focus of attention of the paper. The FEVER rule is a mathematically simple rule that lays out the necessary condition for a linear network (and some classes of non-linear networks) to have time-varying activity while encoding the same stimulus. Note that this concept is in stark contrast to much of the encoding literature (particularly for vision) which utilizes over-complete neural codes to encode stimuli more efficiently. Instead this work does not try to constrain the encoding any more than the encoding must represent the stimulus. The remainder of this paper is concerned with how such networks can be found given a feature dictionary $D$ and the resulting properties of the connectivity matrix $L$. Specifically the problem of finding $L$ given $D$ is highly under-determined. In fact there are $K^2$ unknowns and only $M$ equations. As we already have $K>M$, this problem requires fairly strong regularization in order to yield a good solution. Two methods are proposed to solve for $L$. The first method seeks a sparse connectivity matrix by solving the $\ell_1$-regularized least squares (LASSO) optimization

$\widehat{L} = \arg\min_{L} \|D - DL\|_F^2 + \lambda\sum_{i,j} |L_{i,j}|$

where $\lambda$ trades off between the sparsity of $L$ and how strictly we adhere to the FEVER rule. This optimization, however, could result in neurons that both excite and inhibit other neurons. By Dale’s law, this should not be possible, so an alternate optimization is proposed. Setting $L = E-N$ where $E$ is an excitatory matrix and $N$ is an inhibition matrix, the authors make the assumption that there are dispersed, yet sparse excitation connections ($E$ is sparse) and that there are few, widely connected, inhibitory neurons ($N$ is low-rank). Thus these two matrices can be solved using a “sparse + low-rank” optimization program

$\{\widehat{E},\widehat{N}\} = \arg\min_{E,N} \|D - D(E-N)\|_F^2 + \lambda_1\sum_{i,j} |E_{i,j}| + \lambda_2\|N\|_{\ast}$

where $\lambda_1,\lambda_2$ trade off between the sparsity of $E$, rank of $N$ and adherence to the FEVER rule. This optimization is surprisingly well defined and allows for the determination of $L$ from $D$. Before implementing this learning, however, the authors note two more aspects of the model. First, by using a modified FEVER rule $\alpha D = DL$, a memory element can be built into the network. Second, certain classes of nonlinear networks can use the FEVER rule to preserve stimulus encoding. Specifically, networks that evolve as

$\tau \frac{d}{dt} a = -f(a) + Lf(a)$

with a point-wise nonlinearity $f(\cdot)$ will have $ds/dt = 0$ for a linear generative model. This property only holds, however, since the nonlinearity comes before the linear summation $L$. If the nonlinearity came afterwards, the derivative of $f(\cdot)$ would have to be accounted for, complicating the calculations. The authors use these two properties to introduce realistic properties, such as fading memory and spiking activity, into the FEVER network.

With the network architecture set and an inference method for $L$, the authors then analyze the resulting properties of the connectivity $L$. Specifically, the authors make note of the eigenvalues and engenvectors $L$ as well as the occurrence of different motifs (connectivity patters). The authors also make indirect observations by looking at the resulting time-traces resulting from the dynamics under the FEVER network.

In terms of the eigenvalues – arguably the most important aspect of a linear dynamical system – the first major observation is that $L$ must have at least $M$ eigenvalues at one. This necessitates from the fact that in the FEVER condition, $L$ must preserve all the stimulus dimensions. As all other eigenvalues lie in the null-space of $D$, the authors do not prove stability directly, but instead demonstrate empirically that the FEVER networks do not result in large eigenvalues. It most likely helps that sparsity constraints attempt to reduce the overall magnitude of the network connections, reducing the chance that eigenvalues not necessary for maintaining the stimulus will be bias towards smaller values (more explicitly so in the low-rank segment of the Dale’s FEVER inference). As a side note, the authors do present, however, in the supplementary information a proof that inhibitory connections are required for stability.

For the eigenvectors, the authors mostly aim to show that there are mostly sparse connections between neurons (i.e. FEVER networks are not fully connected). Branching from this measure, the total number of single connections, pairs of connected neurons, and motifs of triads of connected neurons are compared to the similar counts in rat V1, layer V. The authors show that the above-chance occurrence of these connectivity patters match the above-chance occurrences in rat V1.

For the indirect relation to biological systems, the authors also show how the temporal evolution in time can match the behavior of pre-frontal cortex in working memory tasks. Specifically, a spiking FEVER network is driven to a given stimulus, and then allowed to evolve naturally. The authors observe that the PSTH of subsets of neurons match the qualitative behaviors observed in biological studies: ramp up, ramp down, and time-invariant.

The authors use this accumulation of evidence to as a steps to the conclusion that time-varying activity and persistent stimulus encoding are not mutually exclusive. All in all I believe the following quote from the paper summarizes the paper quite nicely:

“Our study does not refute the idea of explicit coding of time in working memory but rather shows that time-varying activity does not necessarily imply that the underlying network stimulus representations explicitly encodes time-varying properties”

# Fast approximate inference for directed graphical model: a Bayesian auto-encoder

In this week’s lab meting, I presented the following paper from Max Welling’s group:

Auto-Encoding Variational Bayes
Diederik P. Kingma, Max Welling
arXiv, 2013.

The paper proposed an efficient inference and learning method for directed probabilistic
models with continuous latent variables (with intractable posterior distributions), for use with large datasets. The directed graphical model under consideration is as follows,

The dataset is $\mathbf{X}=\{\mathbf{x}^{(i)}\}_{i=1}^N$ consisting of $N$ i.i.d. samples of some continuous or discrete variable $\mathbf{x}$. $\mathbf{z}$ is an unobserved continuous random variable generating the data (solid lines: $p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})$), where $\mathbf{\theta}$ is the parameter set involved in the generative model. The ultimate task is to learn both $\mathbf{\theta}$ and $\mathbf{z}$. A general method to solve such a problem is to marginalize out $\mathbf{z}$ to get the marginal likelihood $p_\mathbf{\theta}(\mathbf{x})=\int p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})d\mathbf{z}$, and maximize this likelihood  to learn $\mathbf{\theta}$. However, in many application cases, e.g. a neural network with a nonlinear hidden layer, the integral is intractable. In order to overcome this intractability, sampling-based methods, e.g. Monte Carlo EM, are introduced. But when the dataset is large, batch optimization is too costly and sampling loop per datapoint is very expensive. Therefore, the paper introduced a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case.

First, they defined a recognition model $q_\Phi(\mathbf{z}|\mathbf{x})$: an approximation to the intractable true posterior $p_\mathbf{\theta}(\mathbf{z}|\mathbf{x})$, which is interpreted as a probabilistic encoder (dash line in the directed graph), and correspondingly, $p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})$ is the probabilistic decoder. Given the recognition model, the variational lower bound $\mathcal{L}(\mathbf{\theta},\Phi;\mathbf{x}^{(i)})$ is defined as

$\mbox{log}p_\mathbf{\theta}(\mathbf{x}^{(i)})\ge\mathcal{L}(\mathbf{\theta},\Phi;\mathbf{x}^{(i)})=-D_{KL}(q_\Phi(\mathbf{z}|\mathbf{x}^{(i)})||p_\mathbf{\theta}(\mathbf{z}))+\mathbb{E}_{q_\Phi(\mathbf{z}|\mathbf{x}^{(i)})}\left[\mbox{log}p_\mathbf{\theta}(\mathbf{x}^{(i)}|\mathbf{z})\right]$

In the paper’s setting,

$p_\mathbf{\theta}(\mathbf{z})=\mathcal{N}(\mathbf{0},\mathbf{I})$$q_\Phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{\mu},\mathbf{\sigma}^2)$

Therefore, $D_{KL}(q_\Phi(\mathbf{z}|\mathbf{x}^{i})||p_\mathbf{\theta}(\mathbf{z}))$ has an analytical form. The major tricky term is the expectation which usually doesn’t have any closed solution. The usual Monte Carlo estimator for this type of problem exhibits very high variance and is not capable to take derivatives w.r.t. $\Phi$. Given such a problem, the paper proposed a reparameterization trick of the expectation term yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.

The key reparameterization trick constructs samples $\mathbf{z}\sim q_\Phi(\mathbf{z}|\mathbf{x})$ in two steps:

1. $\mathbf{\epsilon} \sim p(\mathbf{\epsilon})$ (random seed independent of $\Phi$)
2.  $\mathbf{z}=g(\Phi,\mathbf{\epsilon},\mathbf{x})$ (differentiable perturbation)

such that $\mathbf{z}\sim q_\Phi(\mathbf{z}|\mathbf{x})$ (the correct distribution). This yields an estimator which typically has less variance than the generic estimator:

$\mathcal{L}(\mathbf{\theta},\Phi;\mathbf{x}^{(i)})=-D_{KL}(q_\Phi(\mathbf{z}|\mathbf{x}^{(i)})||p_\mathbf{\theta}(\mathbf{z}))+\frac{1}{L}\sum_{l=1}^L\mbox{log}p_\mathbf{\theta}(\mathbf{x}^{(i)}|\mathbf{z}^{(i,l)})$

where $\mathbf{z}^{(i,l)}=g(\Phi,\mathbf{\epsilon}^{(i,l)},\mathbf{x}^{(i)})$ and $\mathbf{\epsilon}^{(l)}\sim p(\mathbf{\epsilon})$

A connection with auto-encoders becomes clear when looking at the objective function. The first term is the KL divergence of the approximate posterior from the prior acts as a regularizer, while the second term is a an expected negative reconstruction error.

In the experiment, they set $p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})$ to be a Bernoulli or Gaussian MLP, depending on the type of data they are modeling. They presented the comparisons of their method to the wake-sleep algorithm and Monte Carlo EM on MNIST and Frey Face datasets.

Overall, I think their contributions are two-fold. First, the reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, they showed that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. The stochastic gradient method helps to parallelize the algorithm so as to improve the efficiency in largescale dataset.

# Binary Neurons can have Short Temporal Memory

This (belated) post is about the paper:

Randomly connected networks have short temporal memory,
Wallace, Hamid, & Latham, Neural Computation  (2013),

which I presented a few weeks ago at the Pillow lab group meeting. This paper analyzes the abilities of randomly connected networks  of binary neurons to store memories of network inputs. Network memory is  valuable quantity to bound; long memory indicates that a network is more likely to be able to perform complex operations on streaming inputs. Thus, the ability to recall past inputs provides a proxy for being able to operate on those inputs.  The overall result seems to stand in contrast to much of the literature because it predicts a very short memory (on the order of the logarithm of the number of nodes). The authors mention that this difference in the result is due to their use of more densely connected networks. There seem to be additional differences, though, between this papers’s network construction and those analyzed in related work.

# Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.
http://arxiv.org/abs/1411.2581

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

# Every Neuron is Special

A couple of weeks ago I presented

A category-free neural population supports evolving demands during decision-making

by David Raposo, Matthew Kaufman and Anne Churchland.  By “categories” they are referring to some population of cells whose responses during an experiment seem to be dominated by one or two of the experimental variables. The authors refer to these types of categories as functional categories.

# Fast Kronecker trick for Gaussian Process regression with “expressive” kernels

On May 11th, I presented the following paper in lab meeting:

Fast Kernel Learning for Multidimensional Pattern Extrapolation
Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai and John P. Cunningham

This paper presents a method for scaling up structured spectral mixture (SM) kernels (from Wilson et al 2013) for Gaussian Process regression to multi-dimensional settings in which many (but not all) of the input points live on a grid (e.g., a 2D image in which some pixels are missing).  The spectral mixture is a very expressive kernel that enhances representation learning, but the problem comes in applying it to large scale data.

# Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares
Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014

In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

# Attractors, chaos, and network dynamics for short term memory

In a recent lab meeting, I presented the following paper from Larry Abbott’s group:

From fixed points to chaos: Three models of delayed discrimination.
Barak, Sussillo, Romo, Tsodyks, & Abbott,
Progress. in Neurobiology 103:214-22 2013.

The paper seeks to connect the statistical properties of neurons in pre-frontal cortex (PFC) during short-term memory with those exhibited by several dynamical models for neural population responses. In a sense, it can be considered a follow-up to Christian Machens’ beautiful 2005 Science paper [2], which showed how a simple attractor model could support behavior in a two-interval discrimination task. The problem with the Machens/Brody/Romo account (which relied on mutual inhibition between two competing populations) is that it predicts extremely stereotyped response profiles, with all neurons in each population exhibiting the same profile. Continue reading

# Dynamic mode decomposition for large-scale neuron activity

On March 23th, I presented the following paper in lab meeting:

The purpose of the paper is to describe dynamic mode decomposition (DMD), a method from applied mathematics for dimensionality reduction of dynamical systems, and adapt it to the analysis of large-scale neural recordings.

# Subunit models for characterizing responses of sensory neurons

On July 28th, I presented the following paper in lab meeting:

This paper proposes a new method for characterizing the multi-dimensional stimulus selectivity of sensory neurons. The main idea is that, instead of thinking of neurons as projecting high-dimensional stimuli into an arbitrary low-dimensional feature space (the view underlying characterization methods like STA, STC, iSTAC, GQM, MID, and all their rosy-cheeked cousins), it might be more useful / parsimonious to think of neurons as performing a projection onto convolutional subunits. That is, rather than characterizing stimulus selectivity in terms of a bank of arbitrary linear filters, it might be better to consider a subspace defined by translated copies of a single linear filter.

# Quantifying the effect of intertrial dependence on perceptual decisions

Today we discussed a paper on sequential effects in psychophysics by Fründ et al. Although inter-trial dependencies are known to exist, psychophysical responses are typically modeled as independent Bernoulli observations. One way to account for the effect of previous trial outcomes is to use logistic regression with terms that represent previous stimuli or responses (Busse et al). Fründ and colleagues extend this approach by including a lapse rate which captures responses where the subject was not doing the task.

# Partitioning Neural Variability

On July 7, we discussed Partitioning Neural Variability by Gorris et al.  In this paper, the authors seek to isolate the portion of the variability of sensory neurons that comes from non-sensory sources such as arousal or attention.  In order to partition the variability in a principled way, the authors propose a “modulated Poisson framework” for spiking neurons, in which a neuron produces spikes according to a Poisson process whose mean rate is the product of a stimulus-driven component $f(S)$ , and a stimulus-independent ‘gain’ term (G).

# Noise correlation in V1: 1D-dynamics explains differences between anesthetized and awake

We discussed state dependence of noise correlations in macaque primary visual cortex [1] today. Noise correlation quantifies the covariability in spike counts between neurons (it’s called noise correlation because the signal (stimulus) drive component has been subtracted out). In a 2010 science paper [2], noise correlation was shown to be much smaller than previously reported; in the range of 0.01 compared to the usual 0.1-0.2 range and stirred up the field (see [3] for a list of values). In this paper, they argue that this difference in noise correlation magnitude is due to population level covariations during anesthesia (they used sufentanil).

# Multiple Object Tracking

On June 23rd, we discussed “Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model” by Edward Vul, Michael C Frank, George Alvarez, and Joshua B Tenenbaum.

This paper proposes a dynamic probabilistic model as an ideal observer for multiple object tracking.  The ideal observer model uses a set of n Kalman filters to track n objects.  Observations at each time step are composed of n pairs of position and velocity values.  However, inference must be performed in order to associate each observation pair with the appropriate object identity (Kalman filter).

# Evaluating point-process likelihoods

We recently discussed two recent papers proposing improvements to the commonly used discrete approximation of the log-likelihood for a point process (Paninski, 2004). The likelihood is written as
$ll(T|t_{1,..,N(T)}) = \sum_{i = 1}^{N(T)} \log( \lambda(t_i|\mathcal{H}_{t_i})) - \int_0^T \lambda(t|\mathcal{H}_{t}) dt$
where $t_i$ are the spike times and $\lambda(t|\mathcal{H}_t)$ is the conditional intensity function (CIF) of the process at time $t$ given the preceding spikes. Typically, the integral in this equation cannot be evaluated in closed form. The standard approximation computes the function by binning along a regular lattice with bins size $\delta$
$ll(T|t_{1,..,N(T)}) \approx \sum_{i = 1}^{l} \Delta N_i \log(\lambda_i \delta) - \lambda_i\delta$
where $\Delta N_i$ is the number of spikes in the $i$th bin. Both papers demonstrate that smarter approximations to the integral are better for point-process statistics than naïvely binning spike train data.