# Binary Neurons can have Short Temporal Memory

This (belated) post is about the paper:

Randomly connected networks have short temporal memory,
Wallace, Hamid, & Latham, Neural Computation  (2013),

which I presented a few weeks ago at the Pillow lab group meeting. This paper analyzes the abilities of randomly connected networks  of binary neurons to store memories of network inputs. Network memory is  valuable quantity to bound; long memory indicates that a network is more likely to be able to perform complex operations on streaming inputs. Thus, the ability to recall past inputs provides a proxy for being able to operate on those inputs.  The overall result seems to stand in contrast to much of the literature because it predicts a very short memory (on the order of the logarithm of the number of nodes). The authors mention that this difference in the result is due to their use of more densely connected networks. There seem to be additional differences, though, between this papers’s network construction and those analyzed in related work.

The network described in this paper consists of a set of N neurons with internal states $h_i$ and activation values $x_i$ that evolve in discrete time as

$x_i(t) = \mbox{sign}(h_i(t) + u_i(t))$

$h_i(t+1) = \sum_{i=1}^N w_{ij}x_j(t)$

In these equations, each weight $w_{ij}$ represents the connection from the $j^{th}$ neuron to the $i^{th}$ neuron, and $u_i(t)$ represents the input into each neuron. One difference in this network is that the sign function is much more restrictive in the information it can transmit than the sigmoidal non-linearity found in other networks. The authors say that the sign function is to preserve the spiking behavior of the network. Aside from the non-linearity, the other network choice is the recurrence matrix weights. Here the authors choose a “spike-and-slab” prior where there is some non-zero probability of a weight being exactly zero, and otherwise the weight is distributed via i.i.d. Gaussian distributions. This choice also deviates from work describing longer network memory as near-unitary (i.e. random orthogonal) matrices typically have better signal preservation properties.

Using this matrix construction, the authors proceed to analyze the network memory by analyzing the state-space trajectories incurred by different input sequences.  The idea here is that the network will be driven to significantly different states given significantly different input sequences. To quantify this distance, the authors analyze the hamming distance between $x_i(t)$ for two different inputs

$d(t) = \frac{1}{2N} \sum_i |x_i^{(1)}(t)-x_i^{(2)}(t)|$

By analyzing the behavior of $d(t)$, the authors determine the rate at which this value decays or grows over time. First the authors find a recursive expression for $d(t)$$d(t+1) = f(d(t))$. Interestingly, the authors find that in the limits of small distances,  $f(d(t)) \propto d^{1/2}(t)$.  With this relationship, the authors note that there are two fixed points to the system, at $d = 0$ and at $d = d^{\ast}$. Thus the remainder of the paper is concerned with finding how fast a system reverts to these fixed points.

This brings the paper to the main result, which states that the decay rate to these fixed points is an exponential function $d(t) -d^{\ast} \propto e^{-\lambda t}$.  This indicates that without consistantly different inputs, two networks will converge to a baseline distance exponentially fast. The decay rate $\lambda$ gives the speed of this decay. The authors use this relationship and the fact that states are practically indistinguishable if $d(t) -d^{\ast} < N^{-1/2}$ to show that the time to indistinguishability  is

$t^{\ast} \sim \frac{1}{2\lambda}\log N.$

At the end of the paper, the authors compare to the remaining literature, discussing how the seemingly pessimistic bounds compare to other bounds where $t^{\ast} \sim N$ rather than $\log N$. The authors discuss mostly the difference in terms of the connectivity matrix, discussing how long-short term memory is often discussed in the regime sometimes called “the edge of chaos”. This regime essentially discusses when the network has better high energy-retention properties, but the amplification is not so high that the non-linearity saturates completely. The authors show that their network can be brought closer to the edge of chaos by reducing the number of network connections (increasing the probability that each weight is exactly zero). The authors mention that the sparsity level is unrealistic given current knowledge of neural anatomy, however they do not seem to mention changing the non-zero weight value distribution, which can potentially also have a large effect on how close the network is to the edge of chaos. I also feel that the binary quantization makes the network less likely to approach the edge of chaos with more connections. Given these points, I think the paper presents an interesting case of trying to analyze networks that are a little more anatomically inspired. It would also be interesting to see if the discrete-time case discussed here extends to continuous time systems where spiking is more informative with the increased time resolution.