Automatic Differentiation in Machine Learning: a Survey

 

In lab meeting this week we discussed Automatic Differentiation in Machine Learning: a Survey, which addresses the general technique of autodifferentiation for functions expressed in computer programs.

The paper initially outlines three main approaches to the computation of derivatives in computer programs: 1) manually working out derivates by hand and coding the result. 2) using numerical approximations to derivates in the form of assessing function values at small discrete steps, and 3) computer manipulation of symbolic mathematic expressions that automatically generates differential expressions via standard calculus rules. The limitations of approach 1 are that derivatives can involve the calculation of a large number of terms (expression swell) that are both manually cumbersome to deal with and lend themselves to easy algebraic mistakes. The disadvantage of approach 2 is a finite difference approach to numerical differentiation inevitably involves round-off errors and is sensitive to step-size. The disadvantage to 3 is that this approach is computationally cumbersome and often standard calculus rules involve forms of derivates that still need to be manually reduced.

In contrast to these methods, the paper introduces autodifferentiation as a set of techniques that operates at the elementary computation level via step-wise implementation of the chain rule. The assessment of a function in a computer program involves an execution of procedures that can be represented via a computational graph where each node in the graph is an intermediary temporary variable. Computers execute this trace of elementary operations via intermediary variables and this procedure can be utilized efficiently to additionally concurrently calculate a derivative.

The paper outlines two general approaches to autodifferentiation: forward mode and backward mode. In forward mode, a derivative trace is evaluated alongside the computers execution of the functional trace, which intermediary values from the functional evaluation trace are used in concert with the derivative trace calculation. This execution can be conveniently realized by extending the representation of the real numbers to include a dual component. Analogous to imaginary numbers, this dual component is denoted with an \epsilon, where \epsilon^2 = 0. That is, each number in the execution of the function on a computer is now extended to include tuple which includes this additional dual component whose expressions evaluate using intermediate variables alongside the evaluation of the function.  Using this simple rule of the dual number (\epsilon^2 = 0), evaluations implement the notion of the chain rule in differentiation to automatically calculate derivatives at every computational step along the evaluation trace.

In executing this forward mode autodifferentiation, a derivative trace is initially ‘seeded’ with an input derivative vector as a function is evaluated with an initial input. Whereas symbolic differentiation can be thought of as the mapping of f(x) to J_{f}(x), forward autodifferentiation is the mapping of f(c) to J_{f}(c) \cdot \vec{x}. As such, considering a functional mapping from dimension n to m, forward mode requires n trace evaluations to calculate the entire Jacobian, each time the seeded derivative vector is an index for a particular dimension in n. Thus, the calculation of the Jacobian can be completed in approximately n times the total number of operations in an evaluation of f(c).

Backward mode, by contrast, computes f(c) to \vec{y} \cdot J_{f}(c). That is, one evaluation of the derivative trace in backward mode can similarly compute a Jacobian vector dot product, in this case extracting a row of the Jacobian (as opposed to a column as in forward mode). In this case, the calculation of the full Jacobian can be completed in approximately m times the total number of operations in an evaluation of f(c).  A disadvantage to backward mode autodifferentiation is that each intermediate variable along a function evaluation trace must be stored in memory for use during the backward execution of the derivative trace.

Advertisements

Stochastic variational learning in recurrent spiking networks

This week we discussed Stochastic variational learning in recurrent spiking networks by Danilo Rezende and Wolfram Gerstner.

Introduction

This paper brings together variational inference (VI) and biophysical networks of spiking neurons. The authors show:

  1. variational learning can be implemented by networks of spiking neurons to learn generative models of data,
  2. learning takes the form of a biologically plausible learning rule, where local synaptic learning signals are augmented with a global “novelty” signal.

One potential application the authors mention is to use this method to identify functional networks from experimental data. Through the course of the paper, some bedrock calculations relevant to computational neuroscience and variational inference are performed. These include computing the log likelihood of a population of spiking neurons with Poisson noise (including deriving the continuum limit from discrete time) and derivation of the score function estimator. I’ve filled in some of the gaps in these derivations in this blog post (plus some helpful references I consulted) for anyone seeing this stuff for the first time.

Neuron model and data log likelihood

The neuron model used in this paper is the spike response model which the authors note (and we discussed at length) is basically a GLM. The membrane potential of each unit in the network is described by the following equation:

\mathbf{u} = \mathbf{w \phi(t)} + \mathbf{\eta(t)}

where \mathbf{u} is a N-dimensional vector, \mathbf{\phi}(t) are exponentially filtered synaptic currents from the other neurons, w is a N \times N matrix of connections and \mathbf{\eta}(t) is an adaptation potential that mediates the voltage reset when a neuron spikes (this can be thought of as an autapse).

Figure1

Spikes are generated by defining an instantaneous firing rate \rho(t) = \rho_0 \text{exp}[\frac{\mathbf{u} - \theta}{\Delta u}] where \theta, \Delta u and \rho_0 are physical constants. The history of all spikes from all neurons ​is denoted by \mathbf{X}. We can define the probability the i^{th} neuron producing a spike in the interval [t,t+\Delta t], conditioned on the past activity of the entire network \mathbf{X}(0...t) as P_i(t_i^f \in [t,t+\Delta t] | \mathbf{X}(0...t)) \approx \rho_i(t) \Delta tand the probability of not producing a spike as P_i(t_i^f \notin [t,t+\Delta t] | \mathbf{X}(0...t)) \approx 1 - \rho_i(t) \Delta t.

Aside: in future sections, the activity of some neurons will be observed (visible) and denoted by a super- or subscript \mathcal{V} and the activity of other neurons will be hidden and similarly denoted by \mathcal{H}. 

We can define the joint probability of the entire set of spikes as: 

P(X(0...T)) \approx \Pi_{i \in \mathcal{V} \cup \mathcal{H}} \Pi_{k_i^s} [\rho_i(t^f_{k_i^s})\Delta t] \Pi_{k_i^{ns}} [1 - \rho_i(t^f_{k_i^{ns}})\Delta t]

The authors re-express this in the continuum limit. A detailed explanation of how to do this can be found in Abbott and Dayan, Chapter 1, Appendix C. The key is to expand the log of the “no spike” term into a Taylor series, truncate at the first term and then exponentiate it:

\text{exp} \hspace{1mm} \text{log} (\Pi_{k_i^{ns}} [1 - \rho_i(t^f_{k_i^{ns}})\Delta t]) = \text{exp} \sum_{k_i^{ns}} \text{log} [1 - \Delta t \rho_i(t_{k_i^{ns}}^f)] \approx \text{exp} \sum_{k_i^{ns}} - \Delta t \rho_i(t_{k_i^{ns}}^f).

As \Delta t \rightarrow 0, this approximation becomes exact and the sum across bins with no spikes becomes an integral across time, 

P(\mathbf{X}(0...T)) = \Pi_{i \in \mathcal{V} \cup \mathcal{H}} [\Pi_{t_i^f}\rho_i(t_i^f) \Delta t] \text{exp}(- \int_0^T dt \rho_i(t))

from which we can compute the log likelihood,

\text{log} P(\mathbf{X}(0...T) = \sum_{i \in \mathcal{V} \cup \mathcal{H}} \int_0^T d\tau [\text{log} \rho_i(\tau) \mathbf{X}_i(\tau) - \rho_i(\tau)]

where we use \mathbf{X}_i(\tau) to identify the spike times.

They note that this equation is not a sum of independent terms, since \rho depends on the entire past activity of all the other neurons.

Figure2

Figure 2 from the paper show the relevant network structures we will focus on. Panel C shows the intra- and inter- network connectivity between and among the hidden and visible units. Panel D illustrates the connectivity for the “inference” \mathcal{Q} network and the “generative” \mathcal{M} network. This structure is similar to the Helmholtz machine of Dayan (2000) and the learning algorithm will be very close to the wake-sleep algorithm used there.

Variational Inference with stochastic gradients

From here, they follow a pretty straightforward application of VI. I will pepper my post with terms we’ve used/seen in the past to make these connections as clear as possible. They construct a recurrent network of spiking neurons where the spiking data of a subset of the neurons (the visible neurons or “the observed data”) can be explained by the activity of a disjoint subset of unobserved neurons (or a “latent variable” ala the VAE). Like standard VI, they want to approximate the posterior distribution of the spiking patterns of the hiding variables (like one would approximate the posterior of a latent variable in a VAE) by minimizing the KL-divergence between the true posterior and an approximate posterior q:

KL(q;p) = \int \mathcal{D} \mathcal{X_H} q(\mathcal{X_H} | \mathcal{X_V}) \text{log} \frac{q(\mathcal{X_H} | \mathcal{X_V})}{p(\mathcal{X_H} | \mathcal{X_V})}

= \langle \text{log} q(\mathcal{X_H} | \mathcal{X_V}) - \text{log} p(\mathcal{X_H,X_V}) \rangle_{q(\mathcal{X_H} | \mathcal{X_V})} + \text{log} p(\mathcal{X_V})

= \langle \mathcal{L^Q} - \mathcal{L^M} \rangle_{q(\mathcal{X_H} | \mathcal{X_V})} + \text{log} p(\mathcal{X_V}).

The second term is the data log likelihood. The first term, \mathcal{F}, is the Helmholtz free energy and like always in VI it represents an upper bound on the negative log likelihood. We can therefore change our optimization problem to minimize this function with respect to the parameters of q (the approximate posterior) and p (the true posterior). We do this by computing the gradients of \mathcal{F} with respect to the \mathcal{Q} (inference) network and the \mathcal{M} (generative) network​. First, the \mathcal{M} network, since it’s easier:

\dot{w_{ij}^\mathcal{M}} = -\mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \mathcal{F} = \mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \langle \mathcal{L^Q} - \mathcal{L^M} \rangle_q = \mu^\mathcal{M} \langle \nabla_{ij}^\mathcal{M} \mathcal{L^M} \rangle_q \approx \mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \hat{\mathcal{L}}^\mathcal{M}

where \hat{\mathcal{L}}^\mathcal{M} is a point estimate of the complete data log likelihood of the generative model. They will compute this with a Monte Carlo estimate. The gradient of the complete data log likelihood with respect to the connections is:

\nabla_{w_{ij}^\mathcal{M}} \hat{\mathcal{L}}^\mathcal{M} = \nabla_{w_{ij}^\mathcal{M}} \text{log} p(\mathcal{X_H, X_V}) = \sum_{k \in \mathcal{V} \cup \mathcal{H}} \int_0^T d\tau \frac{\partial \text{log} \rho_k(\tau)}{\partial w_{ij}^\mathcal{M}} [\mathbf{X}_k(\tau) - \rho_k(\tau)].

Here they used the handy identity: \frac{\partial f(x)}{\partial x} = \frac{\partial[\text{log} f(x)]}{\partial x} f(x)

The derivative of the firing rate function can be computed with the chain rule, \frac{\partial [\text{log} \rho_k(\tau)]}{\partial w_{ij}^\mathcal{M}} = \delta_{ki} \frac{g^\prime(u_k(\tau))}{g(u_k(\tau))} \phi_j(\tau).

This equation for updating the weights using gradient ascent is purely local, taking the form of a product between a presynaptic component, \phi_j(\tau), and a postsynaptic term \frac{g^\prime(u_k(\tau))}{g(u_k(\tau))} [\mathbf{X}_k(\tau) - \rho_k(\tau)].

They also compute the gradient of \mathcal{F} with respect to the \mathcal{Q} network, -\mu^\mathcal{Q} \nabla_{w_{ij}^\mathcal{Q}} \mathcal{F}. To do this, I revisited the 2014 Kingma and Welling paper, where I think they were particularly clear about how to compute gradients of expectations (i.e. the score function estimator). In section 2.2 they note that:

\nabla_\phi \langle f(z) \rangle_{q_\phi (z)}= \langle f(z) \nabla_\phi \text{log} q_\phi (z) \rangle .

A cute proof of this can be found hereThis comes in handy when computing the gradient of \mathcal{F} with respect to the connections of the \mathcal{Q} network:

\nabla_{w_{ij}^\mathcal{Q}} \mathcal{F} = \langle \mathcal{F} \nabla_{w_{ij}^\mathcal{Q}} \text{log} q(\mathcal{X_H} | \mathcal{X_V}) \rangle = \langle \mathcal{F} \nabla_{w_{ij}^\mathcal{Q}} \mathcal{L^Q} \rangle \approx \hat{\mathcal{F}} \nabla_{w_{ij}^\mathcal{Q}} \hat{\mathcal{L}}^\mathcal{Q}.

Here again we compute Monte Carlo estimators of \mathcal{F} and \mathcal{L^Q}. \nabla_{w_{ij}^\mathcal{Q}} \hat{\mathcal{L}}^\mathcal{Q} takes the exact same form as for the \mathcal{M} network, but the neat thing is that \nabla_{w_{ij}^\mathcal{Q}} \mathcal{F} contains a term in front of the gradient of the estimate of the log likelihood, \hat{\mathcal{F}}. This is a global signal (opposed to the local signals that are present in \hat{\mathcal{L}}^\mathcal{Q}) that they interpret as a novelty or surprise signal.

Reducing gradient estimation variance

The authors note that the stochastic gradient they introduced has been used extensively in reinforcement learning and that its variance is prohibitively high. To deal with this (presumably following the approach others have developed in RL, vice versa) they adopt a simple, baseline removal approach. They subtract the mean \bar{\mathcal{F}} of the free energy estimate \hat{\mathcal{F}} calculated as a moving average across several previous batches of length T from the current value \hat{\mathcal{F}}(T). They replace the free energy in the gradient for the \mathcal{Q} network with a free energy error signal, \hat{\mathcal{F}}(T) - \bar{\mathcal{F}}. Below, the log likelihood of the generated data when this procedure is used is plotted against a naively trained network, showing that this procedure works better than the naive rule.

Figure5a

Numerical results

Details of their numerical simulations:

  • Training data is binary arrays of spike data.
  • Training data comes in batches of 200 ms with 500 batches sequentially shown to the network (100 s of data).
  • During learning, visible neurons are forced to spike like the training data.
  • Log likelihood of test data was estimated with importance sampling. Given a generative model with density p(x_v,x_h), importance sampling allows us to estimate the density of p(x_v):

p(x_v) = \langle p(x_v | x_h) \rangle_{p(x_h)} = \langle p(x_v | x_h) \frac{p(x_h)}{q(x_h|x_v)} \rangle_{q(x_h|x_v)}

= \langle \text{exp}[\text{log}p(x_v,x_h) - \text{log} q(x_h|x_v)] \rangle_q = \langle \text{exp}[-\hat{\mathcal{F}}(x_v,x_h)]\rangle_{q(x_v|x_h)}

Using this equation, they estimate the log likelihood of the observed spike trains by sampling several times from the \mathcal{Q} network and computing the average of the free energy. They use 500 samples of duration 100 from the \mathcal{Q} network to compute this estimate.

Here is an example of training with this method with 50 hidden units using the “stairs” dataset. C shows that the network during the “sleep phase” (running in generative mode) forms a latent representation of the stairs in the hidden layers. Running the network in “inference mode” (wake, in the wake-sleep parlance), when the \mathcal{Q} network synapses are being used, the model is capable of performing inference on the causes of the incoming data (the visible neurons are being driven with the data).

Figure3

Role of the novelty signal

To examine the role of the novelty signal, they train a network to perform a maze task. Each maze contains 16 rooms where each room is a 28×28 pixel greyscale image of a MNIST digit. Each room is only accessible from a neighboring room. Pixel values were converted into firings rates from 0.01 to 9 Hz. In the test maze (or control maze), some of the rooms of the training maze were changed. The network had 28×28 visible units and 30 hidden units. These were recurrent binary units. Data were generated from random trajectories of 100 time steps in the target maze. Each learning epoch was 500 presentations of the data batches.

Below, (bottom left) they plotted the slow moving average of the free energy \bar{\mathcal{F}} as a function of the amount of observed data for the target maze (blue) and the same model when it was “teleported” to the control maze every 500 s. In the beginning of learning, the free energy is the same so the model cannot distinguish between them. As learning proceeds, the model identifies the test as unfamiliar (higher free energy).

Bottom right shows the free energy error signal for the sample trajectory in A. It fluctuates near zero for the learned maze but deviates largely for the test maze. We can see at (3,3) the free energy signal really jump up, meaning that the model identifies this as different from the target.

To conclude, the authors speculate that a neural correlate of this free energy error signal should look like an activity burst when an animal traverses unexpected situations. Also, they expect to see a substantial increase in the variance of the changes in synaptic weights when moving from a learned to a unfamiliar maze due to the change in the baseline of surprise levels.

Figure6.png

Deep kernel with recurrent structure

This week in lab meeting, we discussed the following work:

Learning Scalable Deep Kernels with Recurrent Structure
Al-Shedivat, Wilson, Saatchi, Hu, & Xing, JMLR 2017

This paper addressed the problem of learning a regression function that maps sequences to real-valued target vectors. Formally, the sequences of inputs are vectors of measurements \overline{\mathbf{x}_1}=[\mathbf{x}^1]\overline{\mathbf{x}_2}=[\mathbf{x}^1,\mathbf{x}^2], …, \overline{\mathbf{x}_n}=[\mathbf{x}^1,\mathbf{x}^2,\dots,\mathbf{x}^n], where \mathbf{x}^i\in\mathcal{X}, are indexed by time and would be of growing lengths. Let \mathbf{y}=\{\mathbf{y}_i\}_{i=1}^n, \mathbf{y}_i\in\mathbb{R}^d, be a collection of the corresponding real-valued target vectors. Assuming that only the most recent L steps of a sequence are predictive of the targets, the sequences can be written as \overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i], i=1,...,n. The goal is to learn a function, f:\mathcal{X}^L\rightarrow\mathbb{R}^d, based on the available data. The approach to learning the mapping function is to use Gaussian process.

The Gaussian process (GP) is a Bayesian nonparametric model that generalizes the Gaussian distributions to functions. They assumed the mapping function f is sampled from a GP. Then a standard GP regression would be performed to learn f as well as the posterior predictive distribution over \mathbf{y}. The inputs to the GP are \{\overline{\mathbf{x}_i}\}_{i=1}^n and the output function values are  \{{\mathbf{y}_i}\}_{i=1}^n. However, standard kernels, e.g. RBF, Matern, or periodic kernel, are not able to take input variables with varying length. Moreover, standard kernels won’t fully utilize the structure in an input sequence.  Therefore, this paper proposed to use recurrent models to convert an input sequence \overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1}, \mathbf{x}^{i-L+2},\dots,\mathbf{x}^i] to a latent representation \mathbf{h}^L_i\in\mathcal{H}, then map \mathbf{h}_i^L to \mathbf{y}_i. Formally, a recurrent model expresses the mapping f:\mathcal{X}^L\rightarrow \mathbb{R}^d as:

\mathbf{y}_i=\psi(\mathbf{h}_i^L)+\epsilon^t, \mathbf{h}_i^t=\phi(\mathbf{h}^{t-1}_i+\mathbf{x}^{i-L+t})+\delta^t,t=1,...,L

Screen Shot 2017-10-20 at 00.58.27

Combining the recurrent model with GP, the idea of deep kernel with recurrent model is that let \phi:\mathcal{X}^L\rightarrow\mathcal{H} be a deterministic transformation, which is a recurrent model, mapping \{\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]\}_{i=1}^n to \{\mathbf{h}^L_i\}_{i=1}^n. Sequential structure in \{\overline{\mathbf{x}_i}\}_{i=1}^n is integrated into the corresponding latent representations. Then a standard GP regression can be performed with input \{\mathbf{h}^L_i\}_{i=1}^n and output \{\mathbf{y}_i\}_{i=1}^n.

If we denote \tilde{k}:(\mathcal{X}^L)^2\rightarrow \mathbb{R} to be the kernel over the sequence space, and k:\mathcal{H}^2\rightarrow \mathbb{R} to be the kernel defined on the latent space \mathcal{H}, then

\tilde{k}(\overline{\mathbf{x}},\overline{\mathbf{x}}')=k(\phi(\overline{\mathbf{x}}),\phi(\overline{\mathbf{x}}'))=k(\mathbf{h}^L,{\mathbf{h}^L}')

  where k(\mathbf{h}^L,{\mathbf{h}^L}') denotes the covariance between \mathbf{y} and \mathbf{y}' in GP.

fig2.png

By now deep kernel with general recurrent structure has been constructed. With respect to recurrent model, there are multiple choices. Recurrent neural networks (RNNs) model recurrent processes by using linear parametric maps followed by nonlinear activations. A major disadvantage of the vanilla RNNs is that their training is nontrivial due to the so-called vanishing gradient problem: the error back-propagated through t time steps diminishes exponentially which makes learning long-term relationships nearly impossible. Thus in their paper, they chose the long short-term memory (LSTM) mechanism as the recurrent model. LSTM places a memory cell into each hidden unit and uses differentiable gating variables. The gating mechanism not only improves the flow of errors through time, but also, allows the the network to decide whether to keep, erase, or overwrite certain memorized information based on the forward flow of inputs and the backward flow of errors. This mechanism adds stability to the network’s memory. Therefore, their final model is a combination of GP and LSTM, named as GP-LSTM.

To solve the model, they needed to infer two sets of parameters: the base kernel hyperparameters, \theta, and the parameters of the recurrent neural transformation, denoted W. In order to update \theta, the algorithm needs to invert the entire kernel matrix K which requires the full dataset. However, once the kernel matrix K is fixed, update of W turns into a weighted sum of independent functions of each data point. One could compute a stochastic update for W on a mini-batch of training points by only using the corresponding sub-matrix of K. Hence, they proposed to optimize GPs with recurrent kernels in a semi-stochastic fashion, alternating between updating the kernel hyperparameters, \theta, on the full data first, and then updating the weights of the recurrent network, W, using stochastic steps. Moreover, they observed that when the stochastic updates of W are small enough, K does not change much between mini-batches, and hence they can perform multiple stochastic steps for W before re-computing the kernel matrix, K, and still converge. The final version of the optimization algorithm is given as follows:

Screen Shot 2017-10-20 at 01.49.47

To evaluate their model, they applied it to a number of tasks, including system identification, energy forecasting, and self-driving car applications. Quantitatively, the model was assessed on the data ranging in size from hundreds of points to almost a million with various signal-to-noise ratios demonstrating state-of-the-art performance and linear scaling of their approach. Qualitatively, the model was tested on consequential self-driving applications: lane estimation and lead vehicle position prediction. Overall, I think this paper achieved state-of-the-art performance on consequential applications involving sequential data, following straightforward and scalable approaches to building highly flexible Gaussian process.

 

Completing the Covariance: Deterministic matrix completion for symmetric positive semi-definite matrices

This week in J. Pillz Happy Hour we discussed the following work:

Deterministic Symmetric Positive Semi-definite Matrix Completion
William E. Bishop and Byron M. Yu, NIPS, 2014

This paper took on the interesting problem of low-rank matrix completion with the twists that (1) the observed entries of the matrix A are both chosen in a deterministic manner and that (2) those entries consist of the union of principal sub-matrices A_l for = 1,2,... K. The large body of prior work in this field mostly relied on random sampling, i.e. we get to observe some set of M entries Y_{i,j} = A_{i,j} + \epsilon of A chosen uniformly at random from the set of all entries (i.e. see Candes & Recht, Candes & Plan, etc.). These initial guarantees had the nice property that the recovery guarantees occurred with high probability, meaning that despite the probabilistic nature of the observed entry locations, solving a convex optimization program was virtually guaranteed to recover the original matrix up to a bounded recovery error. Additionally, this recovery error bound correctly deduced the linear dependence on the error in the observations \epsilon.  The framework used was also expandable to cases where M linear combinations of  the entries of A were observed, significantly expanding the set of problems the theory was applicable to.

For the case considered by Bishop & Yu, the random sampling framework considered previously was not useful, as there was much more structure in the problem being considered. Taking the motivating example of recovering a covariance matrix, the authors assume that instead of covariances between pairs of variables being observed individually, they instead assume that the covariance matrix between subsets of variables are, in turn, fully observed. They assume that there is no randomness in which variables are observed and instead seek to devise (1) conditions on when the full covariance is recoverable from the observed subset covariances, and (2) an algorithm that take the subset covariances and recover the full covariance. More generally, the authors phrase this as the recovery of symmetric positive semi-definite matrices from principal sub-blocks.

While the conditions are presented first in the paper, the algorithm actually motivated the need for the specific conditions needed. The algorithm devised is simple and elegant, and relies on the fact that a SPSD matrix can be decomposed as A = CC^T. Similarly, any principal sub-matrix of the rank-$lattex r$ matrix AA_l, can be similarly decomposed as A_l = C_lC_l^T. Additionally, the C_l matrices can, under a simple rotation, match the matrix C over the indices corresponding to the rows spanned by the sub-matrix A_l. These observations lead to the following algorithm; For each block A_l, decompose A_l into its eigenvalue decomposition A_l = \Sigma_l\Lambda_l\Sigma^T_l and set \widehat{C}_l = \Sigma_l\Lambda_l^{1/2}. These estimates of the C_l sub-matrices will not match on their overlaps, so the must be rotated in order to match. One sub-matrix is chosen as the seed, and all other matrices are, in turn, rotated to match the current set values of \widehat{C} on the overlap. The ordering of which matrices to rotate when is chosen to maximize, at each step, the overlap with all sub-matrices that came before it. The final estimate of A is then just the outer product of the estimate of C: \widehat{A} = \widehat{C}\widehat{C}^T.

The conditions for this algorithm to work boil down to requiring that (1) The observed sub-matrices are, in fact, principal sub-matrices and that the set of observed entries span all the rows of the matrix and (2) There is “enough” overlap between the principal sub-matrices for the alignment step in the algorithm to work. While (1) is fairly obvious, (2) can be boiled down to needing the index set spanned by any matrix and those that came before it (in the ordering discussed above) to have rank r: the same rank as A. When these conditions are satisfied, the authors can prove that with no noise their algorithm recovers the true matrix A exactly, and that with noise the algorithm recovers the matrix A up to an error bounded by a function ||A -\widehat{A}||_F^2 \leq K_1\sqrt{r\epsilon} + K_2\epsilon where K_1, K_2 are fairy involved functions of the eigenvalues of A, and A_l. Interestingly, they show that without their assumptions, no algorithm can ever recover the matrix A from its principal sub-matrices.

While these results mimic the style of results in the prior literature, the assumptions in this work can be easily checked given the set of observed indices. The authors consider this a benefit when compared to the assumptions in previous matrix completion theory, which cannot be checked when given a set of measurements. Additionally, the authors point out that their assumptions guarantee recovery for all rank-r matrices, indicating that there cannot be an adversarial matrix where the method fails. What the authors sacrifice for these benefits are two-fold. For one, the rank restriction on the overlap  is quite prohibitive, and basically requires that the full complexity of the entire matrix is represented in each overlap. Thus, while there do exist sets of principal sub-matrices that span very few of the entries of A and satisfy the assumptions, these are probably very unlikely sets. Unfortunately, the author’s Theorem 1 specifies that this is an unavoidable aspect of sampling principal sub-matrices. The second detriment is that the proof techniques used for the noise recovery guarantee resulted in an error bond that is sensitive to the specific eigen-spectrum of A, opaque in its behavior, and does not capture the true dependence of the perturbation \epsilon. The eigenvalue dependence is reminiscent of non-uniform recovery guarantees in the compressive sensing literature (e.g., block-diagonal sensing matrices; Eftekhari, et al.). It would have been nice to have in the paper some discussion of this dependency and if it was a fundamental aspect of the problem or a proof artifact. Overall, however, the simplicity of the algorithm combined with the guarantees make this a very nice paper.

Automated identification of mouse behavioral modules (using nonparametric Bayesian AR-HMM)

A few weeks ago, Lea and Anqi co-presented a paper from Robert Datta’s group for a joint lab meeting with 3 other labs:

Mapping Sub-Second Structure in Mouse Behavior
Wiltschko, Johnson, Iurilli, Peterson, Katon, Pashkovski, Abraira, Adams, & Datta. Neuron (2015).

It might not rival Newton’s apple, which led to his formulating the law of gravity, but the collapse of a lighting scaffold played a key role in the discovery that mice, like humans, have body language.

Behaviors are considered to evolve according to stereotyped forms, or modules, that are arranged in sequence to enable animals to accomplish particular goals. There are some recent technical advances that facilitate more comprehensive characterization of the components of behavior, but so far these have mostly only been applied to invertebrates. For mammals, current approaches tend to rely on human observers to specify what constitutes a meaningful behavioral module. The performance of these methods is therefore limited by human perception and intuition. Given the need for automated methods for identifying (and classifying) behavior in mammals, Wiltschko et al proposed an unsupervised learning method for automatically clustering patterns of mouse behavior without human supervision. They used state-of-the-art machine-learning algorithms to systematically describe short time-scale structure of behavior in mice and to understand how the brain might alter this structure to facilitate adaptive behaviors in response to environmental changes.

In order to investigate this, Wiltschko et al collected three-dimensional depth imaging data of freely behaving mice. They first developed a number of model-free approaches to see whether there exists block-wise structure in the data, hinting at a potential organization of behavior into stereotyped modules, while also giving insight into the time-scales at which these modules are organized. Surprisingly, block-wise structure was already apparent by visual inspection alone. A change-point analysis revealed that the mean block duration was about 350 ms, in accordance with a temporal autocorrelation analysis and also roughly matching the timescale of the blocks apparent upon visual inspection. The authors also carried out a PCA analysis to show that these segmented components are sub-second blocks of behaviors that encode recognizable actions, and thus serve as stereotyped and reused behavioral modules.

After pre-processing and dimensionality reduction of the imaging data, the authors investigated a series of models aiming to discover modular structure in  mouse behavior, using the model-free analysis to inform certain choices of hyper-parameters. The pipeline for the data preprocessing, dimensional compression, PCA component extraction and model fitting is shown below.

pipeline

The final model that the authors used in the paper is the AR-HMM. As the name suggests, the AR-HMM combines an autoregressive (AR) model with a Hidden Markov Model (HMM). The basic idea is to model behavioral modules as a discrete hidden state, with Markovian state transitions between modules at each time point. Depending on the current behavior (i.e the value of the hidden state) the mouse’s pose evolves according to module-specific autoregressive dynamics. Once the behavioral module switches (e.g. the mouse stops walking and pauses) the dynamics that govern the evolution of the mouse’s pose also switch. Thus, each behavioral module is associated with module-specific AR dynamics and the model can be viewed as a dynamic mixture model. This can hence capture the smooth evolution of the mouse’s pose through PC space during the same behavioral epoch, but can also account for abrupt changes in pose dynamics due to changes in behavior via the latent switching state.

The final AR-HMM model that the authors use is non-parametric, allowing them to remain agnostic about the number of behavioral modules that can be identified. This HDP AR-HMM model has previously been developed in [EEMA08, EMEA09, EEMA09]. The generative model is of the form:Screen Shot 2016-05-30 at 4.12.13 PMwhere the Matrix Normal Inverse-Wishart (MNIW) prior is shared across the different sets of model parameters. A graphical model of the AR-HMM is shown in Figure 1 (not including parameters and priors for simplicity).

Screen Shot 2016-05-11 at 1.08.53 AMThe HDP AR-HMM still requires specifying certain hyperparameters that will likely affect important characteristics of the fitted model. Most notably, the stickiness parameter \kappa dictates how long the model will remain in a given state by affecting the probability of self-transitioning. Furthermore, the concentration parameter \alpha determines the spread of the DP and thus affects the number of motifs the model will identify. While a non-informative prior is put over \alpha, \kappa is set using the previous change-point analysis, essentially tuning the model to the time-scale of interest. In order to determine the relevant number of time-lags in the AR-component of the model, the authors placed a block ARD prior over K_0. Finally, inference in the HDP AR-HMM was carried out via Gibbs Sampling.

The HDP AR-HMM is a powerful and flexible modeling framework for identifying behavioral modules in mice and relates closely to a larger body of work on switching time-series models with Markovian structure (Jump-Markov models, Switching Linear Dynamical Systems). These models have been applied to problems ranging from speech recognition [BD07] to neural data analysis [BMJGSKM11], illustrating their flexibility and expressive power.

To demonstrate the advantages of using the AR-HMM compared to previous supervised approaches, the authors first characterized baseline patterns of behavior when the mouse is freely moving in a circular open field. They found that the model manages to group stereotyped behaviors (like e.g. a low rear or a left turn) in the same cluster and can identify meaningful modules that are representative of normal mouse exploratory behaviors in the laboratory — all in an entirely unsupervised fashion.

Next, the authors utilized the AR-HMM model to gain insight into the nature of behavioral changes that may arise due to changes in the environment. First, they changed the circular field to a square arena and discovered that a small number of behavioral modules are uniquely expressed in only the circular or the square arena, while other modules are expressed more or less frequently in one or the other arena. Second, the authors delivered an aversive fox odor to a quadrant of the square arena, which is known to lead to profound changes in mouse behavior characterized by odor investigation, escape, and freezing. The AR-HMM analysis revealed that the seemingly new behaviors were best described by up- or down-regulating modules that were identified previously during normal behavior, together with an altered transition structure between the individual modules. Thus, the behavioral changes associated with the aversive odor are not due to the introduction of new behavioral modules but are rather regulated via a re-arrangement of the components of normal behavior.

Finally, the authors investigated how genetic mutations and manipulations of neural activity may influence the sub-second structure of mouse behavior. They first characterized the phenotype of mice carrying a homozygous or heterozygous genetic mutation using the AR-HMM model framework. While the homozygous mutation results in an abnormal waddling walk, no phenotype has previously been reported for the heterozygous mutation, in which mice only have one copy of the mutated gene. The AR-HMM model correctly identifies a behavioral module unique to the homozygous mutants representing the waddling gate, but also shows that other behavioral modules are up-regulated in these mutants. Furthermore, the analysis revealed that the heterozygous mice do have a phenotype that distinguished them from wild type mice: they over-express the same set of modules that were up-regulated in the homozygous mutants, while failing to express the module for waddling gait. Thus, the AR-HMM allows one to gain insight into subtle behavioral differences that were not previously identifiable by human observers. The authors next investigate the effects of manipulating neural activity by optical stimulation using unilaterally expressed Channelrhodopsin-2 in a subset of layer 5 corticostriatal neurons in motor cortex. AR-HMM identifies and characterizes both obvious and subtle optogenetically induced phenotypes, distinguishing “new” optogenetically induced behaviors from up-regulated expression of “old” behaviors.

Overall, we think that unsupervised probabilistic machine learning is an interesting approach to better understanding the architecture of mouse behavior. Using a generative modeling approach provides a principled framework for investigating questions relating not only to the organization of behavior at fine time scales, but also to the influence of environmental factors, individual genes or neural circuits on mouse behavior. All in all, this work may facilitate a better fundamental understanding of how the brain can build and adapt patterns of behavior and may also have interesting implications for diagnosing disease based on subtle behavioral phenotypes.

References:

[EEMA08] Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. An hdp-hmm for systems with state persistence. In Proceedings of the 25th international conference on Machine learning, pages 312–319. ACM, 2008.

[EMEA09] Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In Advances in Neural Information Processing Systems, pages 549–557, 2009.

[EEMA09] Emily Fox, Erik Sudderth, Michael Jordan, and A Willsky. Nonparametric bayesian identification of jump systems with sparse dependencies. In Proc. 15th IFAC Sympo- sium on System Identification, pages 1886–1898, 2009.

[BD07] Bertrand Mesot and David Barber. Switching linear dynamical systems for noise ro- bust speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 15(6):1850–1858, 2007.

[BMJGSKM11] Biljana Petreska, M Yu Byron, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and Maneesh Sahani. Dynamical segmentation of single trials from population neural data. In Advances in neural information processing systems, pages 756–764, 2011.

 

 

 

Discussion of Fox & Friends, “Bayesian Structure Learning for Stationary Time Series”

[latexpage]

Last week in lab meeting I discussed a paper from Emily Fox’s group [TFF15], Bayesian Structure Learning for Stationary Time Series.

If you’ve read Carvalho and Scott’s paper on Bayesian inference for graphical models [CS09], then you’ll find [TFF15] a natural extension to [CS09] and reading the latter is surely the best way to get your head wrapped around the former.

From my view [TFF15] makes 2 main contributions:
1) Formulating the graph inference problem in terms of time series by borrowing from [CS09]
2) Formulating the problem in the Fourier domain, making the problem tractable.

On the way to making the problem properly specified, [TFF15] introduced a new conjugate prior on complex normal distributions defined on graphs, the hyper- complex inverse Wishart distribution.

To give you a little background, [CS09] used an algorithm (FINCS, originally developed in [CS08]) for inferring a undirected graph that defines conditional dependence between random variables when

  1. variables are Gaussian distributed
  2. observations are independent
  3. the admissible graphs are “decomposable,” which mean that the graph can be unambiguously described by a bunch of cliques and their intersecting nodes (called “separators”).

Solving the problem in these terms provides some convenient simplifications.

The first simplification, owing to the variables being Gaussian, is a well-known result by Dempster [Demp72],

Suppose X = (X_1,\dots,X_p)^T\sim \mathcal{N}(0,\Sigma), and Z_{ij}\equiv \{\text{all } X_k, k\in (1,\dots,p)| k\neq i,j\}. Then [\Sigma]_{ij}^{-1} = 0 implies P(X_i,X_j|Z_{ij})=P(X_i|Z_{ij})P(X_j|Z_{ij}).

In other words, X_i and X_j are independent, conditionally on all other variables, when the inverse covariance matrix has a 0 in the i,j^\text{th} element. Thus, the problem of evaluating the likelihood of a graph becomes the problem of evaluating the likelihood of a covariance that satisfies the corresponding pattern of 0’s in \Sigma^{-1}.

The simplification relies upon the restriction of searching over decomposable graphs on more way than one. Not only does FINCS relyupon this assumption but decomposability provides for a convenient factorization of the prior. Conditioned on the graph G, the prior can be decomposed in terms of cliques \mathcal{C} and separators \mathcal{S}:

p(\Sigma|G) = \frac{\prod_{C\in\mathcal{C}}p(\Sigma_C|\delta,W_C)}{\prod_{S\in\mathcal{S}}p(\Sigma_S|delta,W_S)},

where (b, D) are the parameters of the hyper- inverse Wishart distribution; the conjugate prior for the covariance parameters of a Gaussian distribution that obeys the conditional independence laws defined by the graph. This dramatically reduces the complexity since operations need only be performed on covariance matrices that are the size of the cliques and separators, and not on the full dimensionality of the observations.

The snag that kept [TFF15] from using [CS09] more or less out of the box is that time series do not have independent observations, except in the trivial case (negating any simplification afforded by assumption (2), in the above list).  Thus, the likelihood over times series does not factorize over observations, thereby requiring us to contend with potentially very large covariance matrices. The key insight that [TFF15] made is that moving to the Fourier domain allowed them to make use of the Whittle approximation, which states that Fourier coefficients of stationary times series are asymptotically independent. This is a clever trick that allowed [TFF15] to operate as if the observations d (now indexed by frequency rather than time) were independent. The one wrinkle is that the observations were now complex.

The complex likelihood required [TFF15] to derive a new hyper-Markov conjugate prior over the covariance matrices (or rather, spectral density matrics S), following the work of [DaLa93]. The result was a complex generalization of the hyper- inverse Wishart distribution, aptly named, the “hyper- complex inverse Wishart” (the “hyper-” part denotes that the prior obeys conditional independence laws that are specified by the graph).

After specifying the prior over the spectral densities, [TFF15] specify a hyperprior over the graph, which determines the number of edges k.  They also go further in placing a Beta (a,b) prior over the probability of an edge.  The full graphical model is schematized in the left panel below.

GraphModelFig1GraphModelFig2

It turns out you can marginalize out both the probability of edges r and the spectral density of cliques and separators so that the final model is greatly simplified (right panel above) while still having the flexibility of a proper Bayesian formulation.

You can check out the analysis results for yourselves but I think that they look pretty awesome and make a great argument in favor of their approach.

Now, I’m not an expert on graph theory but I think that the most dubious of the assumptions that [TFF15] inherited from [CS09] is the requirement that all admissible graphs be decomposable. Its not clear to me how (or if) this is a very strong restriction and I’m curious how the algorithm adapts when the model is misspecified. When does it matter? How could this kind of bias influence scientific conclusions drawn from the inferred graphs? I’m also interested in how one might leverage these techniques to infer directed graphs.

If others have intuition about these points then I’m curious to know.

 
References:

[TFF15] Tank A, Foti NJ, and Fox E. Bayesian Structure Learning for Stationary Time Series. Uncertainty in Artificial Intelligence (UAI), 2015.

[CS09] C. M. Carvalho and J. G. Scott. Objective Bayesian model selection in Gaussian graphical models. Biometrika, 96(3):497?512, 2009

[CS08] C. M. Carvalho and J. G. Scott. Feature-inclusion stochastic search for Gaussian graphical models. J. Computational and Graphical Statistics, 17(4):790-808, 2008

[Demp72] Dempster, A. Covariance selection. Bometics, 28, 157-175

[DaLa93] A. P. Dawid and S. L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist., 21(3):1272–1317, 1993.

Estimating within-session changes in firing rate

In the lab meeting on March 23, I presented the following paper:

Learning In Spike Trains: Estimating Within-Session Changes In Firing Rate Using Weighted Interpolation
Greg Jensen, Fabian Munoz, Vincent P Ferrera
bioRxiv (2016).

This paper seeks to develop a method that can track non-stationary dynamics of firing rates. Specifically, it deals with two problems:
1. How to estimate the firing rate accurately from single (or a small number of ) trials?
2. Given a (possibly inhomogeneous) session of consecutive trials, how to subdivide the session into ensembles of trials of similar features?

1. Single-trial rate estimation “ARRIS”

Their method, named ARRIS (Adaptive Rate Regression Involving Steps), uses reversible jump MCMC (RJMCMC) to detect step-like transitions of firing rates.

The RJMCMC (Green 1995) method is a variant of MCMC algorithm which dynamically adjusts the number of parameters (dimensionality of posterior), by adding/removing them as parts of its simulations. RJMCMC was used in a previous method called BARS (Bayesian Adaptive Regression Splines, DiMatteo et al 2001), where they dynamically add/remove knots (spline points).

In this paper’s ARRIS method, they use RJMCMC to determine step functions. The authors emphasize that unlike the previous rate estimation methods which almost always shows attenuation at the transition boundary (and more strongly with smaller samples), the ARRIS method captures sharp transitions successfully even with small samples. The method also works when the transition is smooth, because the MCMC generates an ensemble of steps.

2. Weighted interpolation

On the other hand, the firing rate dynamics varies in a non-random way in many natural applications. It’s nice that one could estimate the firing rates from single trials, but one should also be able to benefit from looking at the next trial in time, which presumably has a similar firing pattern. Based on this idea, the authors consider a weighted ensemble of trials centered on the time of interest, using a tri-cube weight function with a single parameter (the bandwidth). Afterwards, ARRIS could be applied to the weighted ensemble of firing data. Now the problem is: What is the optimal bandwidth of the weight function? In other words, about how many trials should one average over?

Optimal bandwidth determination:
The authors use generalized cross-validation (GCV) to find the optimal bandwidth across trials. Importantly, the across-trial bandwidth is optimized by GCV at each trial r separately. The resulting bandwidth h(r) is itself a good representation of data variability.
More specifically, in order to find the optimal bandwidth h(r) at a given trial r , one needs to iterate the following. Specify a fixed bandwidth h ; Construct a weighted ensemble of spikes with across-trial width h and centered at trial r ; Calculate a generalized cross-validation score \textrm{GCV}(h,r) by running ARRIS on this weighted ensemble while leaving one time-point out at a time.

Rapid evaluation of the bandwidth:
However, there is a practical difficulty in executing the optimization procedure described above. The most obvious reason is that MCMC is already computationally intensive, and that iterating over many candidate values of h may be too expensive. But more fundamentally, the single-trial estimate of the firing rate would fluctuate, because it depends on a stochastic algorithm (RJMCMC). It means the optimization of h(r) will not be robust, because the GCV score will turn out differently every time it is calculated.

To get around this problem the authors suggest using a kernel density approximation, where the ARRIS estimate is replaced by a box kernel with a fixed bandwidth (*note* this bandwidth is within-trial, not across-trial). This within-trial bandwidth is determined by a simple estimator that depends on spike interval median (rather than being optimized). The approximation is only used to evaluate the optimal bandwidth, then ARRIS can be used to obtain the final estimate at the optimal bandwidth.

Overall, this is a nice paper that provides a readily applicable method for learning non-stationary firing dynamics from small samples. It has two contributions: (1) the step-function-based ARRIS method for single-trial firing rate estimate, and (2) the weighted interpolation framework that involves the determination the optimal bandwidth h(r) which captures data variability.

Additional comments:

  • For better representation, the authors suggest that the optimal bandwidth h(r) can be smoothed once again with a “smoothing bandwidth h_{s} ” (which is optimized by another round of GCV).
  • There are three different uses of the term “bandwidth” in the paper, which may be confusing: the across-trial bandwidth h(r) for the weighted ensemble construction which is at the core of the paper, the smoothing bandwidth that goes on top of h(r) , and finally the within-trial bandwidth that is introduced for the kernel density approximation.