Completing the Covariance: Deterministic matrix completion for symmetric positive semi-definite matrices

This week in J. Pillz Happy Hour we discussed the following work:

Deterministic Symmetric Positive Semi-definite Matrix Completion
William E. Bishop and Byron M. Yu, NIPS, 2014

This paper took on the interesting problem of low-rank matrix completion with the twists that (1) the observed entries of the matrix A are both chosen in a deterministic manner and that (2) those entries consist of the union of principal sub-matrices A_l for = 1,2,... K. The large body of prior work in this field mostly relied on random sampling, i.e. we get to observe some set of M entries Y_{i,j} = A_{i,j} + \epsilon of A chosen uniformly at random from the set of all entries (i.e. see Candes & Recht, Candes & Plan, etc.). These initial guarantees had the nice property that the recovery guarantees occurred with high probability, meaning that despite the probabilistic nature of the observed entry locations, solving a convex optimization program was virtually guaranteed to recover the original matrix up to a bounded recovery error. Additionally, this recovery error bound correctly deduced the linear dependence on the error in the observations \epsilon.  The framework used was also expandable to cases where M linear combinations of  the entries of A were observed, significantly expanding the set of problems the theory was applicable to.

For the case considered by Bishop & Yu, the random sampling framework considered previously was not useful, as there was much more structure in the problem being considered. Taking the motivating example of recovering a covariance matrix, the authors assume that instead of covariances between pairs of variables being observed individually, they instead assume that the covariance matrix between subsets of variables are, in turn, fully observed. They assume that there is no randomness in which variables are observed and instead seek to devise (1) conditions on when the full covariance is recoverable from the observed subset covariances, and (2) an algorithm that take the subset covariances and recover the full covariance. More generally, the authors phrase this as the recovery of symmetric positive semi-definite matrices from principal sub-blocks.

While the conditions are presented first in the paper, the algorithm actually motivated the need for the specific conditions needed. The algorithm devised is simple and elegant, and relies on the fact that a SPSD matrix can be decomposed as A = CC^T. Similarly, any principal sub-matrix of the rank-$lattex r$ matrix AA_l, can be similarly decomposed as A_l = C_lC_l^T. Additionally, the C_l matrices can, under a simple rotation, match the matrix C over the indices corresponding to the rows spanned by the sub-matrix A_l. These observations lead to the following algorithm; For each block A_l, decompose A_l into its eigenvalue decomposition A_l = \Sigma_l\Lambda_l\Sigma^T_l and set \widehat{C}_l = \Sigma_l\Lambda_l^{1/2}. These estimates of the C_l sub-matrices will not match on their overlaps, so the must be rotated in order to match. One sub-matrix is chosen as the seed, and all other matrices are, in turn, rotated to match the current set values of \widehat{C} on the overlap. The ordering of which matrices to rotate when is chosen to maximize, at each step, the overlap with all sub-matrices that came before it. The final estimate of A is then just the outer product of the estimate of C: \widehat{A} = \widehat{C}\widehat{C}^T.

The conditions for this algorithm to work boil down to requiring that (1) The observed sub-matrices are, in fact, principal sub-matrices and that the set of observed entries span all the rows of the matrix and (2) There is “enough” overlap between the principal sub-matrices for the alignment step in the algorithm to work. While (1) is fairly obvious, (2) can be boiled down to needing the index set spanned by any matrix and those that came before it (in the ordering discussed above) to have rank r: the same rank as A. When these conditions are satisfied, the authors can prove that with no noise their algorithm recovers the true matrix A exactly, and that with noise the algorithm recovers the matrix A up to an error bounded by a function ||A -\widehat{A}||_F^2 \leq K_1\sqrt{r\epsilon} + K_2\epsilon where K_1, K_2 are fairy involved functions of the eigenvalues of A, and A_l. Interestingly, they show that without their assumptions, no algorithm can ever recover the matrix A from its principal sub-matrices.

While these results mimic the style of results in the prior literature, the assumptions in this work can be easily checked given the set of observed indices. The authors consider this a benefit when compared to the assumptions in previous matrix completion theory, which cannot be checked when given a set of measurements. Additionally, the authors point out that their assumptions guarantee recovery for all rank-r matrices, indicating that there cannot be an adversarial matrix where the method fails. What the authors sacrifice for these benefits are two-fold. For one, the rank restriction on the overlap  is quite prohibitive, and basically requires that the full complexity of the entire matrix is represented in each overlap. Thus, while there do exist sets of principal sub-matrices that span very few of the entries of A and satisfy the assumptions, these are probably very unlikely sets. Unfortunately, the author’s Theorem 1 specifies that this is an unavoidable aspect of sampling principal sub-matrices. The second detriment is that the proof techniques used for the noise recovery guarantee resulted in an error bond that is sensitive to the specific eigen-spectrum of A, opaque in its behavior, and does not capture the true dependence of the perturbation \epsilon. The eigenvalue dependence is reminiscent of non-uniform recovery guarantees in the compressive sensing literature (e.g., block-diagonal sensing matrices; Eftekhari, et al.). It would have been nice to have in the paper some discussion of this dependency and if it was a fundamental aspect of the problem or a proof artifact. Overall, however, the simplicity of the algorithm combined with the guarantees make this a very nice paper.

Neuronal Circuits Underlying Persistent Representations Despite Time Varying Activity

This week in computational neuroscience journal club we discussed the paper

Neuronal Circuits Underlying Persistent Representations Despite Time Varying Activity
Shaul Druckmann and Dmitri B. Chklovskii
Current Biology, 22(22):2095–2103, 2012.

The main idea of this work was to demonstrate that the fact that non-trivial temporal dynamics occur in neural circuits, this activity does not prevent such circuits from persistently representing a constant stimulus. The main example used (although other neural regions, such as V1 are discussed) is working memory in pre-frontal cortex. The authors cite previous literature that has shown that even when a stimulus is `stored’ in memory, there is still significant time-varying neural activity. While the authors make no indication that the represented stimulus must be preserved (i.e. temporal dynamics are due to temporally changing stimulus representations), they do present a viable architecture that describes how the temporal dynamics do not effect the stimulus stored for the working memory task.

At a high level, the authors use the high redundancy of neural circuits to introduce a network that can retain a stimulus in the space of allowable features while still permitting neural activity to continue changing the null-space of the feature matrix. This paper ties together some very nice concepts from linear algebra and network dynamics to present a nice and tidy `proof-by-construction’ of the hypothesis that evolving networks can retain constant stimuli information. To tie this paper in a broader setting, the network dynamics here are similar to those used in network encoding models (e.g. the locally competitive algorithm for sparse coding) and the idea of non-trivial activity in the null space of a neural representation ties in nicely to later work discussing preparatory activity in motion tasks.

The formal mathematics of the paper are concerned with neural representations based on the linear generative model. This model asserts that a stimulus s\in\mathbb{R}^M can be composed of a linear sum of $K$ dictionary elements d_i \in\mathbb{R}^K

s = \sum_{i=1}^N d_i a_i = Da

where the coefficients a_i represent amount of each d_i needed to construct s. In the network, a_i is also the activity of neuron i, meaning that activity in neuron i directly relates to the stimulus encoded by the network.  If the number of dictionary elements K is the same as the dimension of the stimulus M, then there is no possible way for the neural activity a to deviate from the unique solution a = D^{-1}s without changing the stimulus encoded by the system. The authors note, however, that in many neural systems, the neural code is highly redundant (i.e. K >> M), indicating that the matrix D has a large null-space and thus there are an infinite number of ways that a can change while still faithfully encoding the stimulus.

With the idea of allowing neural activity to vary within the nullspace, the authors turn to an analysis of a specific neural network evolution equation where the change in neural activity decays is defined by a leaky-integrator-type system

\tau \frac{d}{dt} a = -a + La

where \tau is the network time constant, and L is the network connectivity matrix, i.e. L_{i,j} indicates how activity in neuron j influences neuron i. The authors do not specify a particular input method, however it seems that it is assumed that at t=0 the neural activity encodes the stimulus. To keep the stimulus constant, the authors observe that setting

\frac{d}{dt}s = D\frac{d}{dt}a = 0

results in the feature vector recombination (FEVER) rule

D = DL

This rule is the primary focus of attention of the paper. The FEVER rule is a mathematically simple rule that lays out the necessary condition for a linear network (and some classes of non-linear networks) to have time-varying activity while encoding the same stimulus. Note that this concept is in stark contrast to much of the encoding literature (particularly for vision) which utilizes over-complete neural codes to encode stimuli more efficiently. Instead this work does not try to constrain the encoding any more than the encoding must represent the stimulus. The remainder of this paper is concerned with how such networks can be found given a feature dictionary D and the resulting properties of the connectivity matrix L. Specifically the problem of finding L given D is highly under-determined. In fact there are K^2 unknowns and only M equations. As we already have K>M, this problem requires fairly strong regularization in order to yield a good solution. Two methods are proposed to solve for L. The first method seeks a sparse connectivity matrix by solving the \ell_1-regularized least squares (LASSO) optimization

\widehat{L} = \arg\min_{L} \|D - DL\|_F^2 + \lambda\sum_{i,j} |L_{i,j}|

where \lambda trades off between the sparsity of L and how strictly we adhere to the FEVER rule. This optimization, however, could result in neurons that both excite and inhibit other neurons. By Dale’s law, this should not be possible, so an alternate optimization is proposed. Setting L = E-N where E is an excitatory matrix and N is an inhibition matrix, the authors make the assumption that there are dispersed, yet sparse excitation connections (E is sparse) and that there are few, widely connected, inhibitory neurons (N is low-rank). Thus these two matrices can be solved using a “sparse + low-rank” optimization program

\{\widehat{E},\widehat{N}\} = \arg\min_{E,N} \|D - D(E-N)\|_F^2 + \lambda_1\sum_{i,j} |E_{i,j}| + \lambda_2\|N\|_{\ast}

where \lambda_1,\lambda_2 trade off between the sparsity of E, rank of N and adherence to the FEVER rule. This optimization is surprisingly well defined and allows for the determination of L from D. Before implementing this learning, however, the authors note two more aspects of the model. First, by using a modified FEVER rule \alpha D = DL, a memory element can be built into the network. Second, certain classes of nonlinear networks can use the FEVER rule to preserve stimulus encoding. Specifically, networks that evolve as

\tau \frac{d}{dt} a = -f(a) + Lf(a)

with a point-wise nonlinearity f(\cdot) will have ds/dt = 0 for a linear generative model. This property only holds, however, since the nonlinearity comes before the linear summation L. If the nonlinearity came afterwards, the derivative of f(\cdot) would have to be accounted for, complicating the calculations. The authors use these two properties to introduce realistic properties, such as fading memory and spiking activity, into the FEVER network.

With the network architecture set and an inference method for L, the authors then analyze the resulting properties of the connectivity L. Specifically, the authors make note of the eigenvalues and engenvectors L as well as the occurrence of different motifs (connectivity patters). The authors also make indirect observations by looking at the resulting time-traces resulting from the dynamics under the FEVER network.

In terms of the eigenvalues – arguably the most important aspect of a linear dynamical system – the first major observation is that L must have at least M eigenvalues at one. This necessitates from the fact that in the FEVER condition, L must preserve all the stimulus dimensions. As all other eigenvalues lie in the null-space of D, the authors do not prove stability directly, but instead demonstrate empirically that the FEVER networks do not result in large eigenvalues. It most likely helps that sparsity constraints attempt to reduce the overall magnitude of the network connections, reducing the chance that eigenvalues not necessary for maintaining the stimulus will be bias towards smaller values (more explicitly so in the low-rank segment of the Dale’s FEVER inference). As a side note, the authors do present, however, in the supplementary information a proof that inhibitory connections are required for stability.

For the eigenvectors, the authors mostly aim to show that there are mostly sparse connections between neurons (i.e. FEVER networks are not fully connected). Branching from this measure, the total number of single connections, pairs of connected neurons, and motifs of triads of connected neurons are compared to the similar counts in rat V1, layer V. The authors show that the above-chance occurrence of these connectivity patters match the above-chance occurrences in rat V1.

For the indirect relation to biological systems, the authors also show how the temporal evolution in time can match the behavior of pre-frontal cortex in working memory tasks. Specifically, a spiking FEVER network is driven to a given stimulus, and then allowed to evolve naturally. The authors observe that the PSTH of subsets of neurons match the qualitative behaviors observed in biological studies: ramp up, ramp down, and time-invariant.

The authors use this accumulation of evidence to as a steps to the conclusion that time-varying activity and persistent stimulus encoding are not mutually exclusive. All in all I believe the following quote from the paper summarizes the paper quite nicely:

“Our study does not refute the idea of explicit coding of time in working memory but rather shows that time-varying activity does not necessarily imply that the underlying network stimulus representations explicitly encodes time-varying properties”





Binary Neurons can have Short Temporal Memory

This (belated) post is about the paper:

Randomly connected networks have short temporal memory,
Wallace, Hamid, & Latham, Neural Computation  (2013),

which I presented a few weeks ago at the Pillow lab group meeting. This paper analyzes the abilities of randomly connected networks  of binary neurons to store memories of network inputs. Network memory is  valuable quantity to bound; long memory indicates that a network is more likely to be able to perform complex operations on streaming inputs. Thus, the ability to recall past inputs provides a proxy for being able to operate on those inputs.  The overall result seems to stand in contrast to much of the literature because it predicts a very short memory (on the order of the logarithm of the number of nodes). The authors mention that this difference in the result is due to their use of more densely connected networks. There seem to be additional differences, though, between this papers’s network construction and those analyzed in related work.

Continue reading

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

This week in lab meeting we discussed:

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Andrew M. Saxe, James L. McClelland, Surya Ganguli. arxiv (2013).

This work aims to start analyzing a gnawing question in machine learning: How do deep neural networks actually work? Continue reading