A couple weeks ago during lab meeting we discussed parts of Monte Carlo Gradient Estimation in Machine Learning. This is a lovely survey of the topic and unfortunately we only covered a small part of it. The things we did cover consisted the introduction, score function gradient estimators and pathwise gradient estimators. Here we will for the most part only outline a little bit of the score function estimator. Before jumping right into it let’s take a step back and consider (for this specific paper):
• The types of probabilistic objects that gradients are taken with respect to.
• Why we might want / need an estimator for the above

So what exactly are we taking gradients of? In the paper the authors introduce this object as “a general probabilistic objective” (note, for the remainder of this blog post, that all lowercase non-bold symbols except for the index variable n should be considered vector-valued quantities):

$\mathcal{F(\mathbf{\theta})} := \int p(x; \theta)f(x; \phi)dx = \mathop{\mathbb{E}}_{p(x; \theta)}\Big[f(x; \phi)\Big] \qquad (1) \bigskip$

In (1) we notice that the objective involves integrating over all the possible values $x$ can become. Now if you’re at all familiar with topics in any theory/computational orientated subject you’ll recognize that integrating (1) might be rather difficult, if not impossible. With the usual culprits of having no closed form evaluation of the integral (as opposed to say $\int x^2 dx$ for which we do have a closed form evaluation), and or issues regarding the large space of values that variables of interest can take, thus making approximation by quadrature ineffective. These reasons tell us we will need some other type of approximation. That approximation will be a Monte Carlo estimator of the expectation. Generally we can define this estimator, given samples $\hat x^{(n)} \sim p(x ; \theta)$ for $n = 1,..., N$ as (note that we are using approximately below to guide intuition, this is not the usual use of it):

$\mathop{\mathbb{E}}_{p(x; \theta)}\Big[f(x; \phi)\Big] \approx \mathcal{\bar F}_N = \bigskip \frac{1}{N} \sum_{n=1}^N f\Big(\hat x^{(n)} \Big) \qquad (2)$

One last thing that we’ll come back to later on. We note that the subject of the article and this blog post is “Monte Carlo gradient estimator”, however above I’ve outline why you might want a Monte Carlo estimator of an expectation. These two things are not necessarily the same. The gradient of the expectation is not simply the expectation of the gradient (obtained by exchanging the order of integration and differentiation), at least when the gradient is taken with respect to the parameters of the distribution itself. To make this work, we will make use of ideas from probability and statistics to write down the gradient of the expectation again as a proper expectation — thus allowing us to approximate said expectation with a Monte Carlo estimator.

So we have the thing to approximate, how we’re going to approximate it, and know what are the computational difficulties that call for an approximation to be made. What’s left is to motivate where we’d use this approximation. In short, (please refer to the original papers!) policy gradient (think reinforcement learning .i.e. Sutton & Barto), and Black Box Variational Inference both make use of what’s been outlined above (though there are other use cases!).

Both methods need to learn some set of parameters $\theta$. One way to learn such parameters is through gradient descent. So we will need the gradient of the objective (which we have defined above as the expectation), which is (as usual):

$\nabla_{\theta}\mathbb{E}_{p(x; \theta)}[f(x; \phi)] \qquad (3)$

This now calls us to refine our words a bit. We’re not trying to arrive at a Monte Carlo estimator of the expectation per-se, rather we want a Monte Carlo estimator of the gradient of that expectation. This is where the score function estimator and the pathwise estimator come into play. The score function is defined as the gradient of the log of the function $p(x; \theta)$. Thus it is (by the chain rule of differentiation):

$\nabla_{\theta}\log p(x; \theta) = \frac{\nabla_{\theta}p(x; \theta)}{p(x; \theta)}$

This is useful because it is an expression relating the gradient of the function of interest $p(x; \theta)$ to the gradient of its log. That’s good to know but why is it useful for what we hope to accomplish? This becomes clearer if we write out the gradient of the expectation (under conditions that allow the interchange of integration and differentiation):

$\nabla_{\theta} \mathbb{E}_{p(x; \theta)}[f(x; \phi)] = \nabla_{\theta} \int p(x; \theta)f(x; \phi) dx = \int f(x; \phi) \nabla_{\theta}p(x; \theta) dx \qquad (4)$

note now that the goal is to arrive at a Monte Carlo estimator of the expectation, however in (4) the right most side of you’ll note that we no longer have an expectation. This is because $p(x; \theta)$ has become $\nabla_{\theta}p(x; \theta)$, and the gradient does not in general satisfy the conditions to be considered a proper probability (e.g. the gradient can be negative!). However, due to the expression given by the definition of the score function, we can replace $\nabla_{\theta}p(x; \theta)$ by $p(x; \theta)\nabla_{\theta} \log p(x; \theta)$. Plugging this back in the the right most side of (4) we have:

$\int f(x; \phi)p(x; \theta)\nabla_{\theta} \log p(x; \theta) = \int p(x; \theta) [f(x; \phi) \nabla_{\theta} \log p(x; \theta)] = \mathbb{E}_{p(x; \theta)} [f(x; \phi)\nabla_{\theta} \log p(x; \theta)]$

which brings us back into the realm of computing a proper expectation and we can now use (2) to write down the form of the estimator:

$\frac{1}{N} \sum_{n=1}^N f(\hat x^{(n)})\nabla_{\theta}\log p(\hat x^{(n)}; \theta) \qquad \hat x^{(n)} \sim p(x; \theta)$

Now for the pathwise estimator. In the score function estimator we took the gradient of $p(x; \theta)$, for the pathwise estimator we will take the gradient of $f(x; \phi)$. The general reasoning behind why we’d want to do it, why it helps us achieve our goal is exactly the same as stated above. That is, when we take the gradient of the regular definition of the expectation, we no longer have an expectation. However for the reason of why’d you want to use the pathwise estimator over the score function we refer you to the original article. The name “pathwise” is motivated by two equivalent ways to sample a random variant from $p(x; \theta)$. The first we are familiar with: $\hat x \sim p(x; \theta)$, and have used in the score function. The second and equivalent way to do this is by sampling a random variant from a simpler, continuous distribution: $\hat \epsilon \sim p(\epsilon)$, and then transforming that variant via some function $g()$, that “encodes” the distributional parameters $\theta$ of the “end” distribution of interest. For a concrete example of what this means (which the authors give in the paper), say $p(x; \theta) = Normal(x | \mu, \Sigma)$. The “path” version of sampling a random variant $\hat x \sim p(x; \theta)$, is by first sampling $\epsilon \sim Normal(0, \mathbf{I})$. Then we define the function : $g() = g(\epsilon, \theta) = \mu + \mathbf{L}\epsilon$ . So we have that $\{ \theta \} = (\mu, \mathbf{L}\mathbf{L}^T)$, where $\mathbf{L}\mathbf{L}^T = \Sigma$, and the random variant is now $\hat x = g(\epsilon ; \theta)$. The form of this estimator is then:

$\frac{1}{N} \sum_{n=1}^N \nabla_{\theta} f(g(\hat \epsilon^{(n)}; \theta)) \qquad \hat \epsilon^{(n)} \sim p(\epsilon)$

Unfortunately that’s all we’ll get into for now. Please do take a look at the original article. Thanks for checking out the lab blog and happy holidays!

# Tutorial on Normalizing Flows

Today in lab meeting, we continued our discussion of deep unsupervised learning with a tutorial on Normalizing Flows. Similar to VAEs, which we have discussed previously, flow-based models are used to learn a generative distribution, $p_{X}(x)$, when this is arbitrarily complex and may, for example, represent a distribution over all natural images. Alternatively, in the context of neuroscience, $p_{X}(x)$ may represent a distribution over all possible neural population activity vectors. Learning $p_{X}(x)$ can be useful for missing data imputation, dataset augmentation (deepfakes) or to characterize the data generating process (amongst many other applications).

Flow-based models allow us to efficiently and exactly sample from $p_{X}(x)$, as well as to efficiently and exactly evaluate $p_{X}(x)$. The workhorse for Normalizing Flows is the Change of Variables formula, which maps a probability distribution over $X$ to a simpler probability distribution, such as a multivariate Gaussian distribution, over latent variable space $Z$. Assuming a bijective mapping $f: X \rightarrow Z$, the Change of Variables formula is

$p_{X}(x) = p_{Z}(f(x)) \bigg|\text{det}\big(\dfrac{\partial f(x)}{\partial x^{T}}\big)\bigg|$

where $\bigg|\text{det}\big(\dfrac{\partial f(x)}{\partial x^{T}}\big)\bigg|$ is the absolute value of the determinant of the Jacobian. This term is necessary so as to ensure probability mass is preserved during the transformation. In order to sample from $p_{X}(x)$, a sample from $p_{Z}(z)$ can be converted into a sample from $p_{X}(x)$ using the inverse transformation:

$z \sim p_{Z}(z)$

$x = f^{-1}(z)$

Flow-models such as NICE, RealNVP, and Glow utilize specific choices for $f(x)$ so as to ensure that $f(x)$ is both invertible and differentiable (so that both sampling from and evaluating $p_{X}(x)$ are possible), and so that the calculation of $\bigg|\text{det}\big(\dfrac{\partial f(x)}{\partial x^{T}}\big)\bigg|$ is computationally tractable (and not an $O(D^{3})$ operation, where $D$ is the dimension of $x$). In lab meeting, we discussed the Coupling Layers transformation used in the RealNVP model of Dinh et al. (2017):

$x_{1:d} = z_{1:d}$

$x_{d+1:D} = z_{d+1:D} \circ \exp(s(z_{1:d})) + t(z_{1:d})$

This is an invertible, differentiable mapping from latent variables $z \in \mathbb{R}^{D}$, which are sampled from a multivariate normal distribution, to the target distribution. Here ‘$\circ$‘ denotes elementwise multiplication and $s$ and $t$ are functions implemented by neural networks. The RealNVP transformation results in a triangular Jacobian with a determinant that can be efficiently evaluated as the product of the terms on the diagonal. We examined the JAX implementation of the RealNVP model provided by Eric Jang in his ICML 2019 tutorial.

As a neural computation lab, we also discussed the potential usefulness of flow-based models in a neuroscience context. Some potential limitations to their usefulness may lie in the fact that they are typically used to model continuous probability distributions; yet in neuroscience, we are often interested in Poisson-like spike distributions. However, recent work on dequantization, which describes how to model discrete pixel intensities with flows, may provide inspiration for how to handle the discreteness of neural data. One other potential limitation to their usefulness related to the fact that the dimensionality of the latent variable in flow-models is equal to that of the observed data. In neuroscience, we are often interested in finding lower-dimensional structure within neural population data; so flow-based models may not be well-suited for this purpose. Regardless of these potential limitations; it is clear that normalizing flows models are powerful and we look forward to continuing to explore their applications in the future.

# Step-by-step procedure for choosing a learning rate (and other optimization hyperparameters)

I’ve been using deep neural networks (DNNs) in my research. DNNs, as is often preached, are powerful models, capable of mapping almost any function. However, after the sermon is over, someone starting to train a DNN in the wild can quickly be overwhelmed by the many subtle technical details involved. The architecture, regularization techniques, and optimization are all inter-related, and these relationships change for different datasets.

My first steps in training a DNN primarily focused on architecture—do I have enough layers and filters for my problem? The optimization—stochastic gradient descent (SGD) with all of its hyperparameters (learning rate, momentum, …) and variants—was an afterthought, and I chose hyperparameters that seemed reasonable. Still, a nagging question persisted in the back of my mind: What if different hyperparameters led to even better performance? It was an obvious case of fear-of-missing-out (FOMO).

## Grid search (aka brute force) and black-box optimization techniques should be last resorts.

Due to FOMO, I began to more properly choose my hyperparameters by using a grid search (e.g., random search). This takes a lot of time. For each tweak of the architecture, I would need to rerun the entire grid search. For each grid search, out popped a solution and nothing else—no intuition about my problem, no understanding about the tradeoffs of the hyperparameters, and no clear way to check my code for bugs (except evaluating performance). The same can be said for black-box optimization techniques—which are fancier versions of grid search. I felt ashamed each grid search I ran because it’s brainless. It encourages me not to think about my data, model, or the bridge between the two: optimization.

## Let’s use our intuition about optimization to help guide our choice of hyperparameters.

Optimization is not a black-box method. In fact, gradient descent is one of the most intuitive concepts in math: You are a hiker on top of a mountain, and you need to get down. What direction and how far do you move? Inspired by this, I started reading about other ways to choose optimization hyperparameters. I came up with a step-by-step procedure that I now follow for every problem. It’s largely inspired by Smith 2018 and Jordan 2018.

This procedure outputs:
1) reasonable values for the hyperparameters,
2) intuition about the problem, and
3) red flags if there are bugs in your code.
The procedure takes a couple of training runs (fast!).

[Note: This procedure is great for researchers who want to get a model up and running. Grid/black-box search is more appropriate for architecture searches and when you really care about gaining 1% in accuracy.]

## Step-by-step procedure:

Here’s the full procedure. I’ll discuss each step individually.

1. Compute initial line search.
2. Increase learning rate.
3. Increase momentum.
4. Check if learning rate decay helps.
5. Check if a cyclical learning rate helps.
6. (optional) Check your other favorite SGD variants.
7. Compute final line search, for closure.

The procedure is general for any architecture and problem. For an example problem, I run the procedure on a small (3-layer) convolutional neural network to predict the responses of visual cortical neurons from image features (see Cowley and Pillow, 2020). I report accuracy on heldout data, as we want to make sure our optimized solution generalizes. The description of each step is short for ease of mind; I assume you are familiar with optimization and SGD (if you aren’t, check out the Convex Optimization book).

For a quick review, here is the equation for gradient descent:

And here are the equations for gradient descent with momentum:

## Step 1: Compute initial line search.

• Choose a direction in weight space (found either by training on a small amount of data or choosing weights randomly).
• Perform a line search by taking small steps (linear increments) along that direction, plotting accuracy for each step.

• How nonconvex/complicated is your accuracy landscape? Here, I have a nonconvex problem with a meandering slope and a sharp ridge with a cliff.
→ Choosing a large learning rate may mean I fall off the cliff!

Step 2: Increase learning rate.

• Start with an epsilon-small learning rate. Then, increase the learning rate for each consecutive training epoch.

• Small learning rates lead to small increases in accuracy.
→ You’ve barely moved in weight space.
• Large learning rates lead to large decreases in accuracy.
→ You’re moving too far, and you’ve jumped off the cliff!
• Just right learning rates are where the accuracy increases steadily.
• Here, we choose a learning rate of 1.5.

## Step 3: Increase momentum.

• Train for N epochs with the chosen learning rate.
• Similar to Step 2, start with a small momentum and increase it each training epoch.

• A small momentum yields only a small increase in accuracy.
A large momentum sees a drop in accuracy.
• Choose a momentum in the sweet spot—where accuracy is steadily increasing.
• Here, we choose a momentum of 0.7.
• Using the chosen learning rate and momentum, train for N epochs. Confirm that momentum improves performance.

• Thoughts: Momentum does help in this case (red above black).

## Step 4: Check if learning rate decay helps.

• Train model until convergence (might be longer than N epochs).
• Use a schedule for learning rate decay. The idea is that as you get closer to an optimum, you should make smaller steps (i.e., decay) to prevent yourself from jumping passed the optimum. Lots of choices here.
• My schedule is a power decay:
Let M be the epoch number where performance plateaus/falls off.
Then alpha = (⅔)^(1/M).

• Train model with learning rate decay. Confirm learning rate decay is needed.

• Thoughts: Learning rate decay is not really helpful for my problem. So I won’t use it.

## Step 5: Check if a cyclical learning rate helps.

• Use a cyclical learning rate. The idea is that you may want to explore different optima, so we need a way to push ourselves far from the current optimum.

Here’s one option:
• Choose the upper learning rate bound as 2 * chosen learning rate.
• Choose the lower learning rate bound as ½ * chosen learning rate.
• Start the learning rate at the upper bound then shrink to the lower bound after M epochs (where M is chosen from Step 4). Then, switch back to the largest learning rate:

• Train model with a cyclical learning rate. Confirm that a cyclical learning rate is needed.

• Thoughts: A cyclical learning rate does train faster, but not substantially. So I won’t use it.

## Step 6: (optional) Check your other favorite SGD variants.

• Likely everyone has their favorite SGD variant. Try it here!
Note: Don’t use the hyperparameters chosen above for some SGD variants, like Adam (which usually assumes a very small initial learning rate). Instead, you may need to repeat some steps to find the variant’s hyperparameters.
• I ran Adam on the same dataset, choosing a learning rate of 1-e2 (much smaller than the learning rate of 1.5 chosen in Step 1!).

• Adam does well in the beginning, likely because it adapts its learning rate and can get away with a large learning rate for the earlier epochs.
• I’ve found that Adam doesn’t always work well. This is especially true when training a DNN to predict behavior of different subjects—each subject’s behavior (e.g., velocity) may have a different variance. This is why we check!

## Step 7: Compute final line search, for closure.

• This is my favorite step. It’s the exact same as Step 1, except we now compute a line search in weight space along a line between the initial, untrained weight vector and the final, trained weight vector.

• Thoughts: This looks very similar to Step 1. The optimization first finds the ridge and then moves along this ridge.
• In this example, there doesn’t seem to be multiple local minima. But this is not always the case!  Here’s a final line search from one of my other research projects:

## Final thoughts

You can see why I like this procedure so much. It takes about an afternoon to run, and I get some interesting plots to look at and think about. The procedure outputs a nice set of hyperparameters, some intuition about the optimization problem we face, and gives me peace of mind that I shouldn’t keep twiddling hyperparameters and preventing my FOMO.

It would be great to have a gallery of observed final line searches (Step 7), just to see the variety of loss landscapes out there. In the meantime, you can check out these artistic renderings of loss landscapes: losslandscape.com

# Attention is all you need. (aka the Transformer network)

No matter how we frame it, in the end, studying the brain is equivalent to trying to predict one sequence from another sequence. We want to predict complicated movements from neural activity. We want to predict neural activity from time-varying stimuli. We even want to predict neural activity in one brain area from neural activity in another area. Thus, as computational neuroscientists, we should be intimately familiar with new machine learning techniques that allow us to better relate one sequence to another.

One sequence-to-sequence problem receiving a lot of interest in machine learning is “translation”—converting a sentence in one language (e.g., English) to another (e.g., Polish). The main challenge is getting context correct. For example, consider “bark” for the following sentences: “The bark is loud.” and “The bark is brown.” In Polish, bark would be “szczekanie” and “kora,” respectively, based on context. Machine learning folks have devised some ingenious architectures to tackle this problem. In this post, I’ll talk about one architecture that is unexpected but seems to work quite well: Transformer networks. This architecture was proposed in “Attention is all you need.” by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NIPS 2017.

## High-level intuition

The Transformer network relies on an encoding-decoding approach. The idea is to “encode” the input sequence into a latent representation (i.e., an embedding) that is easier to work with. This embedding (typically a vector of abstract variables) is then “decoded” into an output sequence (e.g., another language). Our problem, then, is to design a “useful” embedding for translation, and build an architecture that can express it.

What properties do we want our embedding to have? One property is a measure of similarity— “tomato” is more similar to “blueberry” than to “chair.” An embedding with this property has been solved with word2vec. Another important property is context—is “tomato” the subject? direct object? is it modified by “tasty” or “rotten”? It turns out we can incorporate context by adding embeddings together. For example, let embeddings e1=”tomato” and e2=”tasty”. Then
e3 = e1 + e2 corresponds to an embedding of a “tasty tomato”. We can keep tacking on more context (e.g., e4 = “subject of sentence”, e5 = “the”, e6 = “belongs to Tom”, etc.). With this embedding property, if we come across an ambiguous word like “it”, we simply find the word that “it” refers to, and add that word to the embedding!

## Architecture: Stacked attention layers

The authors came up with a way to efficiently search for context and add that context to the embedding. They use “attention”—in the general sense of the word, not based on neuroscience—as a way to find distant relationships between words. Figure 1 is a stylized version of this computation. The idea is, given a word, compare this word with all other words in the sentence for a specific context. For example, if the word is “sees”, attention will try to find “who sees?” and “sees what?” If attention finds answers to its context questions, these answers will be incorporated into the embedding via a linear combination. If attention does not receive answers, no harm is done—attention will simply pass on the unmodified word embedding (i.e., weights for the other words will be zero).

One can imagine that having one level of attention may not be the best way to extract the structure of natural language. Just like we learned in sixth grade, sentences have hierarchical structure (did you also have to draw those syntax trees?). So, we should stack levels of attention. The lower levels correspond to easy questions (“Is there an adjective to this noun?”) that likely involve only two or three words of the input sequence. The deeper levels correspond to more nuanced questions (“What is the direct object of this sentence?”) that span most if not all of the input sequence. To achieve this, the authors stacked layers of attention on top of each other. To decode, they basically reverse the process. Figure 2 is an illustration of this process.

The authors claim this approach minimizes the distance of relating one position in a sequence to another. This is because attention looks at all pairwise interactions, while other architectures (such as recurrent neural networks) sequentially look at the input sequence (making it difficult to compare the first word to the last word). The primary point of confusion while reading the paper was how input sequences were fed into the network (hopefully Fig. 2 clarifies this—each word embedding is transformed in parallel to other word embeddings. If you input N words, each encoder will output N embeddings). This architecture produced state-of-the-art results, and has since been used in many different natural language processing tasks. It also uses many bells and whistles (residual blocks, layer normalization, …)—it will be interesting to see future work argue which components are the most important for this architecture.

## Getting back to the neuroscience

As computational neuroscientists, we should take advantage of these architectures (which have been designed and trained with the aid of GPU armies), and use them to help solve our own problems. One aspect of this work (and in natural language processing in general) is the idea of adding two embedding vectors to form a more “context-relevant” embedding. This is missing from our current latent variable models. More importantly, it may be a way in which the brain also encodes context. I also think this type of attention computation will be useful when we are trying to predict sequences of natural behavior, where we cannot make use of task structure (e.g., a delay period, stimulus onset, etc.). Experimentalists are now collecting massive datasets of natural behavior—a perfect opportunity to see how this attention computation holds up!

# Lapses in perceptual judgments reflect exploration

In lab meeting this week, we discussed Lapses in perceptual judgments reflect exploration by Pisupati*, Chartarifsky-Lynn*, Khanal and Churchland. This paper proposes that, rather than corresponding to inattention (Wichmann and Hill, 2001), motor error, or $\epsilon$-greedy exploration as has previously been suggested, lapses (errors on “easy” trials with strong sensory evidence) correspond to uncertainty-guided exploration. In particular, the authors compare empirically-obtained psychometric curves characterizing the performance of rats on a 2AFC task, with predicted psychometric curves from various normative models of lapses. They found that their softmax exploration model explains the empirical data best.

Empirical psychometric curve

Psychometric curves are used to characterize the behavior of animals as a function of the stimulus intensity when performing, for example, 2AFC tasks. They are defined by four parameters:

$p(\hat{a}= Right|s) = \gamma + (1-\gamma - \lambda)\phi(s; \mu, \sigma)$

where $\phi(s; \mu, \sigma)$ is a sigmoidal curve (we will assume the cumulative normal distribution in what follows); $\mu$ determines the decision boundary and $\sigma$ is the inverse slope parameter. $\gamma$ is the lower asymptote of the curve, while $1- \lambda$ is the upper asymptote. Together, $\{\gamma, \lambda\}$, comprise the asymmetric lapse rates for the “easiest” stimuli (highest intensity stimuli).

While Bayesian ideal observer models have been able to motivate the cumulative normal shape of the psychometric curve (as the authors show in their Methods section), this paper aims to compare different normative models to explain the $\gamma$ and $\lambda$ parameters of the psychometric curve.

Inattention model

For their inattention model, the authors assume that, with probability $p_{attend}$, the rat pays attention to the task-relevant stimulus, while, with probability $1- p_{attend}$, it ignores the stimulus and instead chooses according to its bias $p_{bias}$. That is:

$p(\hat{a}=Right|s) = \phi(s; \mu, \sigma) p_{attend} + (1-p_{attend})p_{bias}$

or, equivalently, $\gamma = p_{bias}(1-p_{attend})$ and $\lambda = (1-p_{bias})(1-p_{attend})$.

Softmax exploration model

In comparison to the inattention model, which assumes that, when the animal is paying attention to the task-relevant variable, it chooses the action corresponding to the maximum expected action-value, $\hat{a} =\max_{a}Q(a)$; the softmax-exploration model assumes that the animal chooses to go right in a 2AFC task according to

$p(\hat{a} = Right|Q(a)) = \dfrac{1}{1+exp(-\beta(Q(R)-Q(L))}$

where $\beta$ is the inverse temperature parameter and controls the balance between exploration and exploitation and $Q(R) - Q(L)$ is the difference in the expected value of choosing Right compared to choosing Left (see below). In the limit of $\beta \rightarrow \infty$, the animal will once again choose its action according to $\hat{a} =\max_{a}Q(a)$; while in the low $\beta$ regime, the animal is more exploratory and may choose actions despite them not having the largest expected action values.

In the task they consider, where an animal has to determine if the frequency of an auditory and/or visual cue exceeds a predetermined threshold, the expected action values are $Q(R) = p(High|x_{A}, x_{V})r_{R}$, where $p(High|x_{A}, x_{V})$ is the posterior distribution over the category of the stimulus (whether the frequency of the generated auditory and/or visual stimuli are above or below the predetermined threshold) given the animal’s noisy observations of the auditory and visual stimuli, $x_{A}$ and $x_{V}$. $r_{R}$ is the reward the animal will obtain if it chooses to go Right and is correct. Similarly, $Q(L) = p(Low|x_{A}, x_{V})r_{L}$.

The authors show that the softmax exploration model corresponds to setting the lapse parameters as

$\gamma = \dfrac{1}{1 + exp(\beta r_{L})}$ and $\lambda = \dfrac{1}{1+exp(\beta r_{R})}$

Model Comparison

The authors compare the empirically obtained psychometric curves with those predicted by the inattention and exploration models for two experimental paradigms:

1. Matched vs Neutral Multisensory stimuli: in this experiment, rats are either presented with auditory and visual stimuli of the same frequency (‘matched’), or with auditory and visual stimuli, where one of the channels has a frequency close to the category threshold, so is, in effect, informationless (‘neutral’). The idea here is that both matched and neutral stimuli have the same ‘bottom-up salience’, so that the $p_{attend}$ parameter is the same for both matched and neutral experiments in the inattention model. By contrast, there is no such restriction for the softmax exploration model, and there is a different $\beta$ exploration parameter for the matched and neutral experiments. The authors find that the psychometric curves corresponding to the softmax model resemble the shape of the empirically obtained curves more closely; and that BIC/AIC are lower for this model.
2. Reward manipulations: whereas the reward for choosing left or right was previously equal, now the reward for choosing right (when the frequency is above the predetermined threshold) is either increased or decreased. Again, the authors find that the psychometric curves corresponding to the softmax model resemble the shape of the empirically obtained curves more closely; and that BIC/AIC are lower for this model.

Reflections on this paper

By demonstrating that lapse rates can be manipulated with rewards or by using unisensory compared to multisensory stimuli, this paper highlights that traditional explanations for lapses as being due to a fixed probability of the animal neglecting the task-relevant stimulus; motor-error or $\epsilon-$greedy exploration are overly-simplistic. The asymmetric effect on left and right lapse rates of modified reward is particularly interesting, as many traditional models of lapses fail to capture this effect. Another contribution that this paper makes is in implicating the posterior striatum and secondary motor cortex as areas which may be involved in determining lapse rates; and better characterizing the role of these areas in lapse behavior than has been done in previous experiments.

This being said, some lab members raised some questions and/or points of concern as we discussed the paper. Some of these points include:

1. We would have liked to see further justification for keeping $p_{attend}$ the same across the matched and neutral experiments and we question if this is a fair test for the inattention model of lapses. Previous work such as Körding et al. (2007) makes us question whether the animal uses different strategies to solve the task for the matched and neutral experiments. In particular, in the matched experiment, the animal may infer that the auditory and visual stimuli are causally related; whereas in the neutral experiment, the animal may detect the two stimuli as being unrelated. If this is true, then it seems strange to assume that $p_{attend}$ and $p_{bias}$ for the inattention model should be the same for the matched and neutral experiments.
2. When there are equal rewards for left and right stimuli, is there only a single free parameter determining the lapse rates in the exploration model (namely $\beta$)? If so, how do the authors allow for asymmetric left and right lapse rates for the exploration model curves of Figure 3e (that is, the upper and lower asymptotes look different for both the matched and neutral curves despite equal left and right reward, yet the exploration model seems able to handle this – how does the model do this?).
3. How could uncertainty $\beta$ be calculated by the rat? Can the empirically determined values of $\beta$ be predicted from, for example, the number of times that the animal has seen the stimulus in the past? And what were some typical values for the parameter $\beta$ when the exploration model was fit with data? How exploratory were the rats for the different experimental paradigms considered in this paper?

# The Lottery Ticket Hypothesis

Deep learning is black magic. For some reason, a neural network with millions of parameters is not cursed to overfit. Somewhere, either in the architecture or the training or the weights themselves, exists a magic that allows deep neural networks to generalize.  We are now only beginning to understand why this is.  As in most cases in science, a good starting place is to observe a paradox about the system, suggest a hypothesis to explain the paradox, and then test, test, test.

The Paradox: This week we discussed “The lottery ticket hypothesis: Finding sparse, trainable neural networks” by Frankle and Carbin, ICML, 2019. Recent studies have shown that a deep neural network can be pruned down to as little as 10% of the size of the original network with no loss in test prediction. This leads to a paradox: Training the small, pruned network (randomly initialized) leads to worse test prediction than training a large network and then pruning it. This is heresy to the ML doctrine of “Thou shalt start simple, and then go complex.” What is going on?

The Hypothesis: This paradox suggests that the initialization of the weights of the pruned network matters.  This leads to the Lottery Ticket Hypothesis:

The Lottery Ticket Hypothesis: “dense, randomly-initialized, feed-forward networks contain subnetworks (i.e., winning tickets) that—when trained in isolation—reach test prediction comparable to the original network in a similar number of iterations.”

These “winning tickets” start out with an initialization that make training particularly effective.  If the winning tickets were re-initialized randomly, they would no longer be winning tickets—training would not reach the same level of test prediction. This may explain why deep and wide networks tend to perform better than shallow, narrow networks—> the deeper, wider networks have more chances of having the winning ticket. It also suggests that training a neural network is akin to stochastic gradient descent seeking out these winning tickets.

The Test, Test, Test: To find the winning ticket, the authors employ iterative pruning. They start with a large neural network, train it a bit, set the smallest-magnitude weights to zero (these weights are no longer trainable), rewind the trainable parameters to their original initialized values, and train again. This is repeated until pruning drastically hurts test prediction. They find that these winning tickets can be as sparse as 3.7% of the original network’s size while having the same if not higher test prediction (Figure 1, solid lines). Interestingly, if they randomly re-initialize the weights of these pruned networks, test prediction decreases considerably  (Figure 1, dashed lines, ‘reinit’). Thus, we have a way to identify winning tickets, these winning tickets are a measly fraction of the size of the original network, and their original initialization is crucial to train them.  The authors do a thorough job of confirming the robustness of this effect (45 figures in all). However, they did not investigate the properties of the subnetworks (e.g., did pruning happen most in the deeper layers?).

One question I had remaining is if these winning tickets were necessary or sufficient to train the large network.  One analysis that could get at this question is the following. First, identify a winning ticket, and then re-initialize its weights in the original, large network. This ensures that this subnetwork is no longer is a winning ticket. Keep repeating this process.  After removing winning tickets, does the large network fail to train? How many winning tickets does the large network have?  Will the large network always have a winning ticket?  We could also do the reverse of this: For a large network, keep initializing subnetworks that are winning tickets.  Is it the case that with more winning tickets, the network trains faster with higher test prediction?

Implications for deep learning: There are several implications for deep learning. 1. Finding out what is special about the initializations of these winning tickets may help us improve initialization strategies. 2. We may be able to improve optimization techniques by better guiding stochastic gradient descent to find and train the winning ticket as fast as possible. 3. New theory that focuses on subnetworks may lead to more insight into deep learning.  4. These winning tickets may be helpful for solving other tasks (i.e., transfer learning). There have already been some follow up studies about these issues and re-examining the lottery ticket hypothesis:  Liu et al., 2019, Crowley et al., 2019, Frankle et al., 2019, Zhou et al, 2019.

Does the brain have winning tickets? I had one recurring thought while reading this paper: I know brains prune synaptic connections substantially during development—could the brain also consist of whittled-down winning tickets? After some searching, I realized that it is largely unknown the amount of pruning that occurs or when certain pruning happens.  What would be some tell-tale signs of winning tickets?  First, substantial pruning should occur (and likely does…perhaps as much as 50%). Second, randomly initializing a developing circuit should lead to a drop in performance after the same amount of training as a control subject (not sure if we can randomly set synaptic weights yet). Third, what would happen if we could prune small-magnitude synaptic connections ourselves during development? Could the brain recover? These tests could first be carried out in insects, where we have gene lines, optogenetics, whole-brain recordings, and well-labeled cell types.

# Deep Neural Networks as Gaussian Processes

In lab meeting this week, we read Deep Neural Networks as Gaussian Processes by Lee, Bahri, Novak, Schoenholz, Pennington and Sohl-Dickstein, and which appeared at ICLR 2018. The paper extends a result derived by Neal (1994); and the authors show that there is a correspondence between deep neural networks and Gaussian processes. After coming up with an efficient method to evaluate the associated kernel, the authors compared the performance of their Gaussian process model with finite width neural networks (trained with SGD) on an image classification task (MNIST, CIFAR-10). They found that the performance of the finite width networks approached that of the Gaussian process performance as the width increased, and that the uncertainty captured by the Gaussian process correlated with mean squared prediction error. Overall, this paper hints at new connections between Gaussian processes and neural networks; and it remains to be seen whether future work can harness this connection in order to extend Gaussian process inference to larger datasets, or to endow neural networks with the ability to capture uncertainty. We look forward to following progress in this field.

Single Layer Neural Networks as Gaussian Processes – Neal 1994

Let us consider a neural network with a single hidden layer. We can write the ith output of the network, $z_{i}^{1}$, as

$z_{i}^{1}(x) = b_{i}^{1} + \sum_{j}^{N_{1}}W_{ij}^{1}x_{j}^{1}(x)$

where $x_{j}^{1}(x) = \phi(b_{j}^{0} + \sum_{k}^{d_{in}}W_{jk}^{0}x_{k})$ is the post-activation of the jth neuron in the hidden layer; $\phi(x)$ is some nonlinearity, and $x_{k}$ is the kth input to the network.

If we now assume that the weights for each layer in the network are sampled i.i.d. from a Gaussian distribution: $W_{ij}^{1} \sim \mathcal{N}(0, \dfrac{\sigma_{w}^{2}}{N_{1}})$, $W_{ij}^{0} \sim \mathcal{N}(0, \dfrac{\sigma_{w}^{2}}{d_{in}})$; and that the biases are similarly sampled: $b_{i}^{1} \sim \mathcal{N}(0, \sigma_{b}^{2})$ and $b_{i}^{0} \sim \mathcal{N}(0, \sigma_{b}^{2})$; then it is possible to show that, in the limit of $N_{1} \rightarrow \infty$, $z_{i} \sim \mathcal{GP} (0, K_{\phi})$, for a kernel $K_{\phi}$ which depends on the nonlinearity. In particular, this follows from application of the Central Limit Theorem: for a fixed input to the network $\vec{x}$, $z_{i}^{1} (\vec{x}) \rightarrow \mathcal{N}(0, \sigma_{b}^{2}+\sigma_{w}^{2}V_{\phi}(x^{1}(\vec{x})))$ as $N_{1} \rightarrow \infty$ where $V_{\phi}(x^{1}(\vec{x})) \equiv \mathbb{E}[(x^{1}_{i}(\vec{x}))^{2}]$ (which is the same for all $i$).

We can now apply a similar argument to the above in order to examine the distribution of ith output of the network for a collection of inputs: that is we can examine the joint distribution of $\{z_{i}^{1}(\vec{x}^{\alpha = 1}), z_{i}^{1}(\vec{x}^{\alpha = 2}), ..., z_{i}^{1}(\vec{x}^{\alpha = k})\}$. Application of the Multidimensional Central Limit Theorem tells us that, in the limit of $N_{1} \rightarrow \infty$,

$\{z_{i}^{1}(\vec{x}^{\alpha = 1}), z_{i}^{1}(\vec{x}^{\alpha = 2}), ...,z_{i}^{1}(\vec{x}^{\alpha = k})\} \sim \mathcal{N}(0, K_{\phi})$,

where $K_{\phi} \in \mathbb{R}^{k \times k}$ and $K_{\phi}(\vec{x}, \vec{x}') \equiv \sigma_{b}^{2} + \sigma_{w}^{2}C_{\phi}(\vec{x}, \vec{x'})$ and $C_{\phi}(\vec{x}, \vec{x'}) \equiv \mathbb{E}[x_{i}^{1}(\vec{x})x_{i}^{1}(\vec{x}')]$.

Since we get a joint distribution of this form for any finite collection of inputs to the network, we can write that $z_{i}^{1} \sim \mathcal{GP}(0, K_{\phi})$, as this is the very definition of a Gaussian process.

This result was shown in Neal (1994); and the precise form of the kernel $K_{\phi}$ was derived for the error function (a form of sigmoidal activation function) and Gaussian nonlinearities in Williams (1997).

Deep Neural Networks as Gaussian Processes

Lee et al. use similar arguments to those presented in Neal (1994) to show that the $i$th output of the $l$th layer of a network with a Gaussian prior over all of the weights and biases is a sample from a Gaussian process in the limit of $N_{l} \rightarrow \infty$. They use an inductive argument of the form: suppose that $z_{j}^{l-1} \sim \mathcal{GP}(0, K_{\phi}^{l-1})$ (the jth output of the $(l-1)$th layer of the network is sampled from a Gaussian process). Then:

$z_{i}^{l} \equiv b_{i}^{l} + \sum_{j=1}^{N_{l}}W_{ij}^{l}x_{j}^{l}(\vec{x})$

is Gaussian distributed as $N_{l} \rightarrow \infty$ and any finite collection of $\{z_{i}^{l}(\vec{x}^{\alpha=1}), ..., z_{i}^{l}(\vec{x}^{\alpha=k})\}$ will have a joint multivariate Gaussian distribution, i.e., $z_{i}^{l} \sim \mathcal{GP}(0, K_{\phi}^{l})$ where

$K_{\phi}^{l}(\vec{x}, \vec{x'}) \equiv \mathbb{E}[z_{i}^{l}(\vec{x})z_{i}^{l}(\vec{x'})] = \sigma_{b}^{2} + \sigma_{w}^{2} \mathbb{E}_{z_{i}^{l-1}\sim \mathcal{GP}(0, K_{\phi}^{l-1})}[\phi(z_{i}^{l-1}(\vec{x})) \phi(z_{i}^{l-1}(\vec{x'}))]$.

If we assume a base kernel of the form $K^{0}(\vec{x}, \vec{x'}) \equiv \sigma_{b}^{2} + \sigma_{w}^{2}(\dfrac{\vec{x}\cdot \vec{x'}}{d_{in}})$, these recurrence relations can be solved in analytic form for the ReLU nonlinearity (as was demonstrated in Cho and Saul (2009)), and they can be solved numerically for other nonlinearities (and Lee et al., give a method for finding the numerical solution efficiently).

Comparison: Gaussian Processes and Finite Width Neural Networks

Lee et al. went on to compare predictions made via Gaussian process regression with the kernels obtained by solving the above recurrence relations (for nonlinearities ReLU and tanh), with the predictions obtained from finite width neural networks trained with SGD. The task was classification (reformulated as a regression problem) of MNIST digits and CIFAR-10 images. Overall, they found that their “NNGP” often outperformed finite width neural networks with the same number of layers for this task; and they also found that the performance of the finite width networks often approached that of the NNGP as the width of these networks was increased:

# oi-VAE: output interpretable VAEs

We recently read oi-VAE: Output Interpretable VAEs for Nonlinear Group Factor Analysis, which was published at ICML in 2018 by Samuel Ainsworth, Nicholas Foti, Adrian Lee, and Emily Fox. This paper proposes a nonlinear group factor analysis model that is an adaptation of VAEs to data with groups of observations. The goal is to identify nonlinear relationships between groups and to learn latent dimensions that control nonlinear interactions between groups. This encourages disentangled latent representations among sets of functional groups. A prominent example in the paper is motion capture data, where we desire a generative model of human walking and train on groups of observed recorded joint angles.

Let us consider observations $\mathbf{x}$ that we group into $G$ groups $\mathbf{x} = [\mathbf{x}^{(1)}, \cdots, \mathbf{x}^{(G)}]$.  We note that the paper does not discuss how to choose the groups and assumes that a grouping has already been specified. The generative model maps a set of latent variables to group-specific generator networks via group-specific mapping matrices $\mathbf{W}^{(g)}$  such that

$\mathbf{z} \sim \mathcal{N}(0, \mathbf{I}) \\ \mathbf{x}^{(g)} \sim \mathcal{N}( f_{\theta_g}^{(g)} (\mathbf{W}^{(g)} \mathbf{z} ))$

for each $g$.

oi-VAE generative model (Ainsworth et al., 2018)

For learning interpretable sets of latents that control separate groups, the key feature of this approach is to place a sparsity-inducing prior on the columns of each $\mathbf{W}^{(g)}$. The authors use a hierarchical Bayesian sparse prior that when analytically marginalized corresponds to optimizing a group lasso penalty on the columns of $\mathbf{W}^{(g)}$.

The model is trained in the standard VAE approach by optimizing the ELBO with an amortized inference network $q_{\phi}(\mathbf{z} | \mathbf{x})$, with the addition of the group lasso penalty and a prior on the parameters of the generator networks

$\mathcal{L}(\phi, \theta, \mathcal{W}) = \mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})} [ \log p(\mathbf{x} | \mathbf{z}, \mathcal{W}, \theta)] - D_{\mathrm{KL}} ( q_\phi(\mathbf{z} | \mathbf{x}) || p(\mathbf{z}) ) + \log p(\theta) - \lambda \sum_{g,j} \| w_j^{(g)} \|_2$

where $w_j^{(g)}$ is the $j$-th column of $\mathbf{W}^{(g)}$ and $\lambda$ is a parameter controlling the sparsity. The prior $\log p(\theta)$ fixes the scaling of the neural network parameters relative to the mapping matrices $\mathbf{W}^{(g)}$.

The above objective consists of a differentiable term (the ELBO plus log prior on $\theta$) plus a convex but non-differential term (group LASSO). Therefore the authors use proximal gradient methods to optimize it. First, they update all parameters using the gradients of the ELBO plus log-prior with respect to $\phi, \theta$, and $\mathcal{W}$. Then, they apply the proximal operator

$w_{j}^{(g)} \leftarrow \frac{ w_{j}^{(g)} } { \| w_{j}^{(g)} \|_2 } ( \| w_{j}^{(g)} \|_2 - \eta \lambda )_+$

to each $\mathbf{W}^{(g)}$ to respect the group lasso penalty, where $\eta$ is a step-size. The authors fixed $\lambda$ for all of their experiments fitting oi-VAE, so one question I had reading was how the authors determined $\lambda$ and how varying $\lambda$ affects the results.

The authors validate the approach on a toy example. They generated synthetic bars data, where one row of a square matrix was sampled from a Gaussian with non-zero mean while the rest of the matrix was sampled from zero-mean noise. The authors fit oi-VAE with each group set to a row of observations, and found that the model learned the appropriate sparse mapping where each latent component mapped to one of the rows. This latent space improved on the VAE, which did not have any discernible structure in the latent space. Importantly, oi-VAE still successfully identified the correct number of latent components (8) and sparse mapping even when the model was fit with double the amount of components.

(a) Synthetic bar data example and (b) reconstruction from oi-VAE. The learned oi-VAE latent-group mappings (c) match the true structure, while a VAE (d) does not learn a sparse structure.

After validating the approach, the authors applied it to motion capture data. Here, the output groups were different joint angles. They trained the oi-VAE model on walking motion capture data. The learned latent dimensions nicely corresponded to intuitive groups of joint angles, such as the left leg (left foot, left lower leg, left upper leg). Next, the imposed structure in the model helped it generate more realistic unconditional samples of walking than the VAE, presumably because the inductive bias allowed oi-VAE to better learn invalid joint angles.

Unconditional walking pose samples from the VAE and oi-VAE models.

These results suggest the oi-VAE is a useful model for discovering nonlinear interactions between groups. In particular, I liked the approach of adding structure in the generative model to gain interpretability, and hypothesize that adding other forms of structure to VAE generative models is a good way to encourage disentangled representations (see a recent example of this in Dieng et al., 2019).

Two questions when using the approach are how to choose $\lambda$ and how to choose the grouping of the data, as this work assumes a grouping has been chosen. In some data we may have prior knowledge about a natural grouping structure but that will not always be the case. However, even without multiple groups, the approach could be useful for learning the number of latent dimensions useful for explaining the amount of data. Finally, we point the reader to factVAE, where the authors further develop this idea to simultaneous learn complementary sparse structure in the inference network and generative model.

# Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization

This week we continued the deep generative model theme and discussed Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization. This work is in the context of statistical compressive sensing, which attempts to recover a high-dimensional signal $x \in \mathbb { R } ^ { n }$ from $m$ low dimensional measurements $y \in \mathbb { R } ^ { m }$ in a way that leverages a set of training data $X$. This is in contrast to traditional compressive sensing, which a priori asserts a sparsity based prior over $x$ and learns the recovery of a single $x$ from a single compressed measurement $y$.

The central compressive sensing problem can be written as

$y = W x + \epsilon$.

In traditional compressive sensing,  $W$ is a known random Gaussian matrix that takes linear combinations of the sparse signal $x$ to encode $y$. Here, $\epsilon$ describes additive Gaussian noise. The goal of compressive sensing is to decode $x$, given $m$ measurements $y$. This is traditionally done via LASSO, which minimizes the $\ell_1$ convex optimization problem $\hat { x } = \arg \min _ { x } \| x \| _ { 1 } + \lambda \| y - W x \| _ { 2 } ^ { 2 }$. In contrast, statistical compressive sensing is a set of methods which recovers $x$ from $y$ using a procedure that results from training on many examples of $x$. That is, the decoding of $x$ from $y$ is done in a way that utilizes statistical properties learned from training data $X$. The encoding, on the other hand, may either involve a known random Gaussian matrix $W$, as is the case above, or it may also be learned as a part of the training procedure.

There are two statistical compressive sensing methods discussed in this paper. One is prior work that leverages the generative stage of variational autoencoders to reconstruct $x$ from $y$ (Bora et al. 2017). In this paper, the decoding network of learned VAE creates an $x$ from a sample $z$. This is adapted to a compressive sensing framework of recreating the best $x$ from measurement $y$ by solving the following minimization problem:

$\widehat { x } = G \left( \arg \min _ { z } \| y - W G ( z ) \| _ { 2 } \right)$

Here, $G(z)$ is the VAE decoding mapping. The optimization is: given some measurements $y$, we seek to find the $z$ that generates an $x$ such that, when it is compressed, best matches the measurements $y$.

This VAE-based decompression method is used as a comparison to a new method presented in this paper, called the Uncertainty Autoencoder (UAE) which is able to optionally learn both an encoding process, the mapping of $x$ to $y$, and a decoding process, the recovery of a reconstructed $x$ from a $y$. It learns these encoding and decoding mappings by maximizing the mutual information between $X$ and $Y$, parameterized by encoding and decoding distributions. The objective, derived through information theoretic principles, can be written as:

$\max _ { \phi , \theta } \sum _ { x \in \mathcal { D } } \mathbb { E } _ { Q _ { \phi } ( Y | x ) } \left[ \log p _ { \theta } ( x | y ) \right] \stackrel { \mathrm { def } } { = } \mathcal { L } ( \phi , \theta ; \mathcal { D } )$

Here, $Q _ { \phi } ( Y | X)$ is an encoding distribution parameterized like the original sparse coding compression mapping $\mathcal { N } \left( W ( X ) , \sigma ^ { 2 } I _ { m } \right)$, and  $\log p _ { \theta } ( x | y )$ is a variational distribution that decodes $x$ from $y$ using a deep neural network parameterized by $\theta$. This objective is maximized by drawing training data samples from what is asserted as a uniform prior over $Q$, which is simply the data itself, $Q_{data}(X)$.

Using this objective, it is possible to derive an optimal linear encoder $W$ under a Gaussian noise assumption in the limit of infinite noise. This linear encoding, however, is not the same as PCA, which is derived under the assumption of linear decoding. UAE linear compression, instead, makes no assumptions about the nature of the decoding. The authors use this optimal linear compressor $W*$ on MNIST, and use a variety of classification algorithms (k-nearest neighbors, SVMs, etc) on this low-d representation to test classification performance. They find that the UAE linear compression better separates clusters of digits than PCA. It is unclear, however, how this UAE classification performance would compare to linear compression algorithms that are known to work better for classification, such as random projections and ICA. I suspect it will not do as well. Without these comparisons, unclear what use this particular linear mapping provides.

The authors demonstrate that UAE outperforms both LASSO and VAE decoding on reconstruction of MNIST and omniglot images from compressed measurements. They implement two versions of the UAE, one that includes a trained encoding $W$, and another where $W$ is a random Gaussian matrix, as it is for the other decoding conditions. This allows the reader to distinguish to what extent the UAE decoder $p(x|y)$ does a better job at reconstructing $x$ under the same encoding as the alternative algorithms.

Most interestingly, the authors test the UAE in a transfer compressive sensing condition, where an encoder and decoder is learned on some data, say omniglot, and the decoder is re-learned using compressed measurements from different data, MNIST. In this condition, the training algorithm has no access to the test MNIST signals, but still manages to accurately recover the test signals given their compressed representations. This suggests that reasonably differently structured data may have similar optimal information-preserving linear encodings, and an encoder learned from one type of data can be utilized to create effective reconstruction mappings across a variety of data domains. The applications here may be useful.

It is unclear, however, how well these comparisons hold up in a context where traditional compressive sensing has proven to be very effective, like MRI imaging.

There are interesting parallels to be drawn between the UAE and other recent work in generative modeling and their connections to information theory. As the author’s point out, their objective function corresponds to last week’s $\beta$VAE objective function with $\beta = 0$. That is, the UAE objective written above is the same as the VAE objective minus the KL term. Though a VAE with $\beta = 0$ does not actually satisfy the bound derived from the marginal distribution, and hence is not a valid ELBO, is does satisfy the bound of maximal mutual information. And as Mike derives in the post below, there is an additional connection between rate-distortion theory and the $\beta$VAE  objective. This suggests that powerful generative models can be derived from either principles of Bayesian modeling or information theory, and connections between the two are just now beginning to be explored.

# Reductions in representation learning with rate-distortion theory

In lab meeting this week, we discussed unsupervised learning in the context of deep generative models, namely $\beta$-variational auto-encoders ($\beta$-VAEs), drawing from the original, Higgins et al. 2017 (ICLR), and its follow-up, Burgess et al. 2018. The classic VAE represents a clever approach to learning highly expressive generative models, defined by a deep neural network that transforms samples from a standard normal distribution to some distribution of interest (e.g., natural images).  Technically, VAE training seeks to maximize a lower bound on the likelihood $p_\theta(x) = \int p_\theta(x\mid z) p(z) dz$, where $p_\theta(x|z)$ defines the generative mapping from latents $z$ to data $x$. This “evidence lower bound” (ELBO) depends on a variational approximation to the posterior, $q_\phi(z\mid x)$, which is also parametrized by a deep neural network (the so-called “encoder”).

A crucial drawback to the classic VAE, however, is that the learned latent representations tend to lack interpretability. The $\beta$-VAE seeks to overcome this limitation by learning “disentangled” representations, in which single latents are sensitive to single generative factors in the data and relatively invariant to others (Bengio et al. 2013). I would call these “intuitively robust” — rotating an apple (orientation) shouldn’t make its latent representation any less red (color) or any less fruity (type). To overcome this challenge, $\beta$-VAEs optimize a modified ELBO given by:

$\underset{\theta,\phi}{\text{maximize}}\:\:\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x\mid z)\right]-\beta D_{KL}(q_\phi(z\mid x)\Vert\, p(z))$

with and standard VAEs corresponding to $\beta=1$. The new hyperparameter $\beta$ controls the optimization’s tension between maximizing the data likelihood and limiting the expressiveness of the variational posterior relative to a fixed latent prior $p(z)=\mathcal{N}(0,I)$.

Recent work has been interested in tuning the latent representations of deep generative models (Adversarial Autoencoders (Makhzani et al. 2016), InfoGANS (Chen et al. 2016), Total Correlation VAEs (Chen et al. 2019), among others), but the generalization used by $\beta$-VAEs in particular looked somehow familiar to me. This is because $\beta$-VAEs recapitulate the classical rate-distortion theory problem. This was observed briefly also in recent work by Alemi et al. 2018, but I would like to elaborate and show explicitly how $\beta$-VAEs are reducible to a distortion-rate minimization using deep generative models.

Rate-distortion theory is a theoretical framework for lossy data compression through a noisy channel. This fundamental problem in information theory balances the minimum permissible amount of information (in bits) transmitted across the channel, the “rate”, against the corruption of the original signal, a penalty measured by a “distortion” function $d(x,z)$. Our terminology changes, but the fundamental problem is the same; I made that comparison as obvious as possible in the figure below.

Derivation. Given a dataset $\mathcal{D}$ with a distribution $p^*(x)$, define any statistical mapping $q_\phi(z\mid x)$ that encodes $x$ into a code $z$. Note that $q_\phi$ is just an encoder, and together they induce a joint distribution $p(x,z)=q_\phi(z\mid x)p^*(x)$ with a marginal $p(z)=\int dx\, q_\phi(z\mid x)p^*(x)$. The distortion-rate optimization would minimize distortion $d(\cdot,\cdot)$ subject to a maximum rate $R$, i.e.

$\underset{q_\phi(z\mid x),p(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]\:\:\text{subject to}\:\: I(x,z)\le R$

$\Longrightarrow \underset{q_\phi(z\mid x),p(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]-\beta I(x,z)$

Consider first the mutual information. We leverage a more tractable upper bound with

$I(x,z)=\int dx\,p^*(x)\int dz\,q_\phi(z\mid x)\log\frac{q_\phi(z\mid x)}{p(z)}\le \int dx\,p^*(x)\int dz\,q_\phi(z\mid x)\log\frac{q_\phi(z\mid x)}{m(z)}$

$\text{since}\:\: D_{KL}\left(p(z)\Vert\, m(z)\right)\Longrightarrow -\int dz\,p(z)\log p(z) \le -\int dz\,p(z)\log m(z)$

We’ve replaced the marginal $p(z)$ induced by our choice of encoder $q_\phi$ with another distribution $m(z)$ that makes the optimization more tractable, e.g. $\mathcal{N}(0,I)$ in the VAE. Our objective can be rewritten as

$\underset{q_\phi(z\mid x),m(z)}{\text{minimize}}\:\:\mathbb{E}_{p(x,z)}[d(x,z)]-\beta\, \mathbb{E}_{x\sim\mathcal{D}}\left[D_{KL}(q_\phi(z\mid x)\Vert\, m(z))\right]$

Suppose the distortion of interest is posterior density (mis)estimation, $d(x,z)=-\log p_\theta(x\mid z)$. Such a function penalizes representations $z$ from which we cannot regenerate an observed data vector $x$ through the decoding network $p_\theta$ with high probability. A typical distortion-rate problem would fix the distortion function, but we choose to learn this decoder. We can optimize the objective for each $x$ to eliminate the outer expectation over the data $\mathcal{D}$, fix $m(z)=\mathcal{N}(0,I)$, and recover the $\beta$-VAE objective precisely:

$\underset{q_\phi(z\mid x),p_\theta(x\mid z)}{\text{maximize}}\:\:\mathbb{E}_{q_\phi(z\mid x)}\left[\log p_\theta(x\mid z)\right]-\beta\, D_{KL}(q_\phi(z\mid x)\Vert\, m(z))$

When $\beta> 1$, our optimization prioritizes minimizing the second term (rate) over maximizing the first one (distortion). In this sense, the authors’ argument for large $\beta$ can be reinterpreted as an argument for higher-distortion, lower-rate codes (read: latent representations) to encourage interpretability. I edited a figure below from Alemi et al. 2018 to clarify this.

Information-theoretic hypotheses abound. Perhaps enforcing optimization in this region could discourage solutions that depend on learning an ultra-powerful decoder (VAE: generator) $p_\theta(x\mid z)$, in other words solutions that depend on a good code, not necessarily a good decode. Does eliminating this possibility simply make room to fish out an ad-hoc interpretable representation, or is there a more sophisticated explanation waiting to be found? We’ll see.

# Machine Theory of Mind

In lab meeting last week, we read Machine Theory of Mind, a recent paper from Neil Rabinowitz and his collaborators at DeepMind & Google Brain (a trimmed version of the paper was presented at ICML 2018). Here, Theory of Mind (ToM) is broadly defined as the ability to represent the mental states of others. This paper aims to demonstrate ToM in an artificial agent. While designing & training such an agent constitutes one challenge, the authors must first devise a scenario in which ToM can be convincingly shown. Inspired by the Sally-Anne test — a classic test of ToM from developmental psychology that evaluates whether a child understands that others can hold false beliefs — the authors construct an analogous test, then train an agent to successfully pass it.

The paper is composed of a series of experiments that build in complexity to this final test. Within each experiment are three key parts. First, the environment: a simple 11×11 grid-world containing walls and 4 colored boxes that are all randomly located within each new world. Second, the agents: an individual agent belongs to a particular “species” according to its policy for acting within an environment. Agents can behave randomly, algorithmically, or with a learned policy (via deep RL). The trajectory of a particular agent within a particular environment constitutes an episode. Reward within an episode is generally maximized by navigating to a box of a particular color as fast as possible. However, limitations on the sightedness and statefulness of the agents, as well as the inclusion of more complex subgoals, are adjusted per the needs of each experiment. Finally, the observer: a meta-learning agent, called ToMnet, that parses the episodes of many agents in many environments so as to learn a prior over the behavior of an agent species. At test time, ToMnet uses a novel agent’s recent episodes & its trajectory on a current episode thus far to infer a posterior and make predictions regarding the agent’s future behavior.

To probe ToM in ToMnet, the authors introduce a species of agent with both a limited field of view and a subgoal. For example, an agent that can only see the squares adjacent to it must first navigate to a “star” in the grid-world before finally navigating to the blue box to achieve maximum reward. In certain environments, the agent passes the blue box early in its initial search and so knows directly where to go after finding the star, even if the blue box is not visible to the agent from the star’s location. The test comes when the experimenter now swaps the locations of the boxes while the agent is on the star and the boxes are out of view. While the agent is blind to the swap, ToMnet is not. And so, the analogous Sally-Anne test arises: Will ToMnet not recognize that the swap occurred outside of the agent’s field of view, and thus mistakenly predict that it will move toward the new location of the blue box? Or, will ToMnet recognize that the agent maintains the false belief that no swap has occurred, and thus correctly predict that it will move toward the old location of the blue box?

ToMnet predicts behavior reflecting the agent’s false belief, successfully passing the test. Importantly, this finding is supplemented with results that show that ToMnet is sensitive to how different fields of view make an agent “differentially vulnerable to acquire false beliefs,” and that ToMnet still passes the test even if it had never seen swap events during training. Thus, ToMnet “learns that agents can act based on false beliefs,” providing a compelling proof-of-concept for Machine Theory of Mind.

# Insights on representational similarity in neural networks with canonical correlation

For this week’s journal club, we covered “Insights on representational similarity in neural networks with canonical correlation” by Morcos, Raghu, and Bengio, NeurIPS, 2018.  To date, many different convolutional neural networks (CNNs) have been proposed to tackle the object recognition problem, including Inception (Szegedy et al., 2015), ResNet (He et al., 2016), and VGG (Simonyan and Zisserman, 2015). These networks have vastly different architectures but all achieve high accuracy. How can this be the case? One possibility is that although the architectures vary, the representations (i.e., the way these networks encode information about the objects of natural images) are very similar.

To test this, we first need a metric of similarity. One approach has been “representation similarity analysis” (RSA) (Kriegskorte et al., 2008) which relies on distance matrices to test if two representations are similar. One potential problem with RSA is that some dimensions of the representations may be “noisy” (i.e., dimensions that do not pertain to encoding the input information). For example, during training, some dimensions of the activity of CNN neurons may vary substantially across epochs but are not relevant to encoding object information. These dimensions could mask the signal of relevant dimensions when analyzing a distance matrix.

One way to avoid this is to try to directly identify the relevant dimensions, allowing us to ignore the noisy dimensions. The authors relied on an old but trusted method called canonical correlation analysis (CCA), which was developed way back in the 1930s (Hotelling, 1936)! CCA has been a handy tool in computational neuroscience, relating the activity of neurons across two populations (Semedo et al., 2014) as well as relating population activity to the output of model neurons (Susillo et al., 2015). Newer methods have been developed that are more appropriate for various problems. These include partial least squares (Höskuldsson, 1988), kernel CCA (Hardoon et al., 2004), as well as a method I developed for my own work called distance covariance analysis (DCA) (Cowley et al., 2017).  The common thread among all of these methods is that they identify dimensions that encode similar information among two or more datasets.

Overview of CCA. CCA is a close relative to linear regression, but whereas linear regression aims at prediction, CCA focuses on correlation—and thus is most suitable for cases in which the investigator seeks intuition of the data.  Given two datasets (e.g., $\mathbf{X} \in \mathcal{R}^{k \times N} \textrm{ and } \mathbf{Y} \in \mathcal{R}^{p \times N}$, both centered, where $N$ is the number of samples), CCA seeks to identify a pair of dimensions $\mathbf{u} \in \mathcal{R}^k \textrm{ and } \mathbf{v} \in \mathcal{R}^p$ such that the Pearson’s correlation between the projections $\mathbf{u}^T \mathbf{X} \textrm{ and } \mathbf{v}^T \mathbf{Y}$ is the largest. In other words, CCA identifies linear combinations of the variables in $\mathbf{X} \textrm{ and } \mathbf{Y}$ that are the most linearly-related. CCA need not stop there—it can identify pairs of dimensions that monotonically decrease in correlation. In this way, we can ignore the dimensions with the smallest correlations (which likely are spurious). One fun fact about CCA is that any two identified dimensions in $\mathbf{X}$ are uncorrelated: $\textrm{corr}(\mathbf{u}_i^T \mathbf{X}, \mathbf{u}_j^T \mathbf{X}) = 0 \textrm{ for } i \neq j$ (and the same for $\mathbf{v}_i, \mathbf{v}_j$). This is different from PCA, whose identified dimensions are both uncorrelated and orthogonal.  The uncorrelatedness of CCA dimensions ensures that we do not include dimensions that contain redundant information. (Implementation details: CCA is solved with singular-value decomposition, but be sure to use a regularized form akin to ridge regression—it was unclear if the authors used regularization).

Figure 1. Generalizing networks converge to more similar solutions than memorizing networks.

Onto the results. The authors proposed a distance metric of CCA to uncover some intuitive characteristics about deep neural networks. First, they found that different initializations of generalizing networks (i.e., networks trained on labeled natural images) were more similar than different initializations of memorizing networks (i.e., networks trained on the same dataset with randomly-shuffled labels). This is expected, as natural labels likely put a constraint on generalizing networks. Interestingly, when comparing generalizing and memorizing networks (Fig. 1, yellow line, ‘Inter’), they found that generalizing and memorizing networks were as similar as different memorizing networks trained on the same fixed dataset. This suggests that overfitted networks converge on very different solutions for the same problem. Also interesting was that earlier layers of both generalizing and memorizing networks seem to converge on similar solutions, while the later layers diverged. This suggests that earlier layers rely more on the structure of natural images while the later layers rely more on the structure of the labels. Second, they found that wider networks (i.e., networks with more filters per layer) converge to more similar solutions than those of narrower networks.  They argue that this supports the “lottery-ticket” hypothesis that wider networks are more likely to have a sub-network that fits the desired function.  Finally, they found that networks trained with different initializations and learning rates on the same problem converge to different groups of solutions. This highlights the need to try different initializations when training neural networks.

This paper left me thinking a lot about representation in the visual cortex of the brain. Does visual cortical population activity have stable and “noisy” dimensions?  If we reduced the number of visual cortical neurons per visual cortical area (either via lesion or pharmacological intervention) in a developing animal, would these animals have severe perceptual deficits (i.e., their visual system did not have the right lottery ticket when developing)?  Lastly, it seems plausible that humans start out with different initializations of their visual cortices—does that suggest different humans have converged on different solutions to solving visual perception?  If so, it suggests that inter-subject variability may be larger than previously thought.

# The Loss-Calibrated Bayesian

By Farhan Damani

In lab meeting this week, we discussed loss-calibrated approximate inference in the context of Bayesian decision theory (Lacoste-Julien et. al. 2011, Cobb et. al. 2018). For many applications, the cost of an incorrect prediction can vary depending on the nature of the mistake. Suppose you are in charge of controlling a nuclear power plant with an unknown temperature $\theta$. We observe indirect measurements of the temperature $D$, and we use Bayesian inference to infer a posterior distribution over the temperature given the observations $p(\theta|D)$. The plant is in danger of over-heating and as the operator, you can either keep the plant running or shut it down. Keeping the plant running while the plant’s temperature exceeds a critical threshold $T_{\text{critic}}$ will cause a nuclear meltdown, incurring a huge loss $L(\theta > T_{\text{critic}}, \text{'on'})$ while shutting off the plant for benign temperatures incurs a minor loss $L(\theta < T_{\text{critic}}, \text{'off'})$

In figure 1 we observe the true posterior $p(\theta|D)$ is multi-modal. Our suite of approximate inference techniques characterize general properties of the posterior, attempting to match either the first or second moment of $p$. Both strategies underestimate the posterior mass for the safety-critical region. Instead, the dash-dotted line, while failing to characterize typical properties of the posterior, results in the same decision as the true posterior by optimizing for task-specific utility. The point is the “best” approximate posterior is subjective, and therefore, we should tailor our inferential resources to find an approximation that is well suited for the decision task at hand.

Bayesian decision theory extends the Bayesian paradigm by including a task-specific utility function $U(\theta, a)$, which tells us the utility of taking action $a \in \mathcal{A}$ when the world is in state $\theta$. According to this view, the optimal action minimizes the posterior risk: $\underset{a}{\arg \min} \text{ } \mathcal{R}(a) = \mathbb{E}_{p(\theta|D)}[U(\theta, a)]$. Typically, this is computed using a 2-step procedure. First approximate the posterior $p(\theta|D)$ with a $q(\theta|D)$ and then minimize the risk under $q$. This approach, however, assumes our approximate $q$ measures properties of the posterior that we care about. This by definition requires our utility function, so therefore, we should jointly optimize the approximate posterior with the action that minimizes the posterior risk. Cobb et. al. 2018 show how to derive a variational lower bound that depends on a task-specific utility function. In their setup, they show that minimizing the KL divergence between an approximate posterior q and a calibrated posterior scaled by the utility function results in the standard ELBO loss plus an additional utility-dependent regularization term. This formulation is amenable to stochastic optimization, allowing for the practical deployment of this framework to supervised learning.

# An orderly single-trial organization of population dynamics in premotor cortex predicts behavioral variability

This week we read some new work from Shaul Druckmann and Karel Svoboda’s groups (https://www.biorxiv.org/content/early/2018/07/25/376830). They analyzed simultaneously recorded activity from the anterior lateral motor cortex (ALM; perhaps homologous to a premotor area in primates) from mice performing a delayed discrimination task (either a somatosensory pole detection task or an auditory tone discrimination task). They analyzed 55 sessions with 6-31 units on each session. Given the strong task epoch dependent responsiveness of most cells in the population, they fit a switching linear dynamical system (sLDS) to the data using expectation-maximazation. However, in their model, the switch times were dictated by the task structure, making the model significantly easier to fit than a sLDS with unconstrained switch times. They called their model a epoch-dependent linear dynamical system (EDLDS).

They used leave-one-neuron-out cross validation (i.e. compute the posterior on test trials without including one neuron’s activity, and then predict that neuron’s activity from the posterior) to test the model fit and found that it often fit the data about as well a sLDS that could flexibly assign the timing of the same number of switch events and significantly outperformed Gaussian process factor analysis.

The model defines a low-dimensional latent space to which they apply several analyses. First, they applied linear discriminate analysis (LDA) to decode the animal’s choice on each trial and show that it outperforms LDA applied to the activity of the full neural population, even when regularization is included.

Next, they applied principal components analysis (PCA) to the latent activity and the full neural population activity to visualize the dominant temporal trajectories within each space. PC projections of the full population activity showed sharp temporal transitions between task epochs and a random spatial ordering across trials, while PC projections of the latent activity showed smooth temporal transitions and strongly ordered dynamics.

They quantified the “orderliness” of each representation by computing the consistency of the trial-ranked value of the LDA projection across time to confirm greater orderliness within the latent space than the full neural activity. They also found that decode analyses to previous trial outcome or choice on error trials using the latent activity outperformed the same analyses applied to the full neural activity.

In summary, the dynamical nature of the EDLDS provides a smooth, de-noised portrait of the temporal dynamics present in the data but that might not easily reveal itself with standard analyses.

# Motor Cortex Embeds Muscle-like Commands in an Untangled Population Response

This week we discussed a recent publication by Abigail Russo from Mark Churchland’s lab. The authors examined primary motor cortex (M1) population responses and EMG activity from primates performing a novel cycling task. The task required the animal to rotate a pedal (like riding a bicycle, except with one’s arm) a fixed number of rotations. There were several task conditions, including pedaling forward and backward.

The authors found that while M1 neural activity contained components of muscle activity (e.g. trial-averaged EMG activity could be accurately predicted by linear combinations of the trial-averaged neural activity) the dominant structure present in the neural population response was not muscle-like. They came to this conclusion by examining the top principal component projections of the neural activity and the EMG activity, where they found that the former co-rotated for forward and backward pedaling while the later counter-rotated. This discrepancy in rotation direction is inconsistent with the notation that neural activity encodes force or kinematic commands.

Based on this observation, the authors proposed a novel hypothesis: the dominant, non-muscle-like activity patterns in M1 exist so as to “detangle” the representation of the muscle-like activity patterns. A rough analogy would be something like a phonograph, whose dominant dynamics are rotating, but which only serve to lay out a coding direction (normal to the rotation) which can be “read-out” in a simple way by the phonograph needle. The authors show with network models that a “detangled” response has the desirable property of noise-robustness.

The following toy-model from the paper illustrates the idea of “tangled-ness.” Imagine that a population of neurons must generate output 1 and output 2 depicted below. If the population represented those signals directly (depicted in the leftmost phase portrait) in a 2-dimensional space, the trajectory would trace out a “figure-8,” a highly tangled trajectory and one that cannot be generated by an autonomous dynamical system (which the authors assume more-or-less accurately caricatures the dynamical properties of M1). In order to untangle the neural representation (depicted in the rightmost phase portrait), the neural activity needs to add an extra, third dimension which resides in the null space of the output. Now, these dynamics can be generated autonomously and a linear projection of them can generate the output.

The authors directly compute a measure of tangling within the neural data and the EMG data. The metric is the following:

$Q(t) = \text{max}_{t^\prime} \frac{|| \dot x_t - \dot x_{t^\prime}||}{|| x_t - x_{t^\prime} || + \epsilon}$.

It can be summarized in the following way: identify two moments in time where the state is very similar, but where the derivative of the state is very different. Such points are exemplified by the intersection of the “figure-8” trajectory above, since the intersection is two identical states with very different derivatives. Across multiple animals, species and motor tasks the authors found a consistent relationship: neural activity is less tangled than EMG activity (as shown below). The authors note that a tangled EMG response is acceptable or perhaps even desirable, since EMG reflects incoming commands and therefore does not need to abide by the requirements that an autonomous dynamical system (like M1) does.

Based on these analyses, the authors conclude that the dominant signals present in M1 population activity principally perform a computational role, by untangling the representation of muscle-like signals that can be read-out approximately linearly by the muscles.

# Automatic Differentiation in Machine Learning: a Survey

 In lab meeting this week we discussed Automatic Differentiation in Machine Learning: a Survey, which addresses the general technique of autodifferentiation for functions expressed in computer programs. The paper initially outlines three main approaches to the computation of derivatives in computer programs: 1) manually working out derivates by hand and coding the result. 2) using numerical approximations to derivates in the form of assessing function values at small discrete steps, and 3) computer manipulation of symbolic mathematic expressions that automatically generates differential expressions via standard calculus rules. The limitations of approach 1 are that derivatives can involve the calculation of a large number of terms (expression swell) that are both manually cumbersome to deal with and lend themselves to easy algebraic mistakes. The disadvantage of approach 2 is a finite difference approach to numerical differentiation inevitably involves round-off errors and is sensitive to step-size. The disadvantage to 3 is that this approach is computationally cumbersome and often standard calculus rules involve forms of derivates that still need to be manually reduced. In contrast to these methods, the paper introduces autodifferentiation as a set of techniques that operates at the elementary computation level via step-wise implementation of the chain rule. The assessment of a function in a computer program involves an execution of procedures that can be represented via a computational graph where each node in the graph is an intermediary temporary variable. Computers execute this trace of elementary operations via intermediary variables and this procedure can be utilized efficiently to additionally concurrently calculate a derivative. The paper outlines two general approaches to autodifferentiation: forward mode and backward mode. In forward mode, a derivative trace is evaluated alongside the computers execution of the functional trace, which intermediary values from the functional evaluation trace are used in concert with the derivative trace calculation. This execution can be conveniently realized by extending the representation of the real numbers to include a dual component. Analogous to imaginary numbers, this dual component is denoted with an $\epsilon$, where $\epsilon^2 = 0$. That is, each number in the execution of the function on a computer is now extended to include tuple which includes this additional dual component whose expressions evaluate using intermediate variables alongside the evaluation of the function.  Using this simple rule of the dual number ($\epsilon^2 = 0$), evaluations implement the notion of the chain rule in differentiation to automatically calculate derivatives at every computational step along the evaluation trace. In executing this forward mode autodifferentiation, a derivative trace is initially ‘seeded’ with an input derivative vector as a function is evaluated with an initial input. Whereas symbolic differentiation can be thought of as the mapping of $f(x)$ to $J_{f}(x)$, forward autodifferentiation is the mapping of $f(c)$ to $J_{f}(c) \cdot \vec{x}$. As such, considering a functional mapping from dimension $n$ to $m$, forward mode requires n trace evaluations to calculate the entire Jacobian, each time the seeded derivative vector is an index for a particular dimension in n. Thus, the calculation of the Jacobian can be completed in approximately n times the total number of operations in an evaluation of $f(c)$. Backward mode, by contrast, computes $f(c)$ to $\vec{y} \cdot J_{f}(c)$. That is, one evaluation of the derivative trace in backward mode can similarly compute a Jacobian vector dot product, in this case extracting a row of the Jacobian (as opposed to a column as in forward mode). In this case, the calculation of the full Jacobian can be completed in approximately m times the total number of operations in an evaluation of $f(c)$.  A disadvantage to backward mode autodifferentiation is that each intermediate variable along a function evaluation trace must be stored in memory for use during the backward execution of the derivative trace.

# Stochastic variational learning in recurrent spiking networks

This week we discussed Stochastic variational learning in recurrent spiking networks by Danilo Rezende and Wolfram Gerstner.

## Introduction

This paper brings together variational inference (VI) and biophysical networks of spiking neurons. The authors show:

1. variational learning can be implemented by networks of spiking neurons to learn generative models of data,
2. learning takes the form of a biologically plausible learning rule, where local synaptic learning signals are augmented with a global “novelty” signal.

One potential application the authors mention is to use this method to identify functional networks from experimental data. Through the course of the paper, some bedrock calculations relevant to computational neuroscience and variational inference are performed. These include computing the log likelihood of a population of spiking neurons with Poisson noise (including deriving the continuum limit from discrete time) and derivation of the score function estimator. I’ve filled in some of the gaps in these derivations in this blog post (plus some helpful references I consulted) for anyone seeing this stuff for the first time.

## Neuron model and data log likelihood

The neuron model used in this paper is the spike response model which the authors note (and we discussed at length) is basically a GLM. The membrane potential of each unit in the network is described by the following equation:

$\mathbf{u} = \mathbf{w \phi(t)} + \mathbf{\eta(t)}$

where $\mathbf{u}$ is a $N$-dimensional vector, $\mathbf{\phi}(t)$ are exponentially filtered synaptic currents from the other neurons, $w$ is a $N \times N$ matrix of connections and $\mathbf{\eta}(t)$ is an adaptation potential that mediates the voltage reset when a neuron spikes (this can be thought of as an autapse).

Spikes are generated by defining an instantaneous firing rate $\rho(t) = \rho_0 \text{exp}[\frac{\mathbf{u} - \theta}{\Delta u}]$ where $\theta$, $\Delta u$ and $\rho_0$ are physical constants. The history of all spikes from all neurons ​is denoted by $\mathbf{X}$. We can define the probability the $i^{th}$ neuron producing a spike in the interval $[t,t+\Delta t]$, conditioned on the past activity of the entire network $\mathbf{X}(0...t)$ as $P_i(t_i^f \in [t,t+\Delta t] | \mathbf{X}(0...t)) \approx \rho_i(t) \Delta t$and the probability of not producing a spike as $P_i(t_i^f \notin [t,t+\Delta t] | \mathbf{X}(0...t)) \approx 1 - \rho_i(t) \Delta t$.

Aside: in future sections, the activity of some neurons will be observed (visible) and denoted by a super- or subscript $\mathcal{V}$ and the activity of other neurons will be hidden and similarly denoted by $\mathcal{H}$.

We can define the joint probability of the entire set of spikes as:

$P(X(0...T)) \approx \Pi_{i \in \mathcal{V} \cup \mathcal{H}} \Pi_{k_i^s} [\rho_i(t^f_{k_i^s})\Delta t] \Pi_{k_i^{ns}} [1 - \rho_i(t^f_{k_i^{ns}})\Delta t]$

The authors re-express this in the continuum limit. A detailed explanation of how to do this can be found in Abbott and Dayan, Chapter 1, Appendix C. The key is to expand the log of the “no spike” term into a Taylor series, truncate at the first term and then exponentiate it:

$\text{exp} \hspace{1mm} \text{log} (\Pi_{k_i^{ns}} [1 - \rho_i(t^f_{k_i^{ns}})\Delta t]) = \text{exp} \sum_{k_i^{ns}} \text{log} [1 - \Delta t \rho_i(t_{k_i^{ns}}^f)] \approx \text{exp} \sum_{k_i^{ns}} - \Delta t \rho_i(t_{k_i^{ns}}^f)$.

As $\Delta t \rightarrow 0$, this approximation becomes exact and the sum across bins with no spikes becomes an integral across time,

$P(\mathbf{X}(0...T)) = \Pi_{i \in \mathcal{V} \cup \mathcal{H}} [\Pi_{t_i^f}\rho_i(t_i^f) \Delta t] \text{exp}(- \int_0^T dt \rho_i(t))$

from which we can compute the log likelihood,

$\text{log} P(\mathbf{X}(0...T) = \sum_{i \in \mathcal{V} \cup \mathcal{H}} \int_0^T d\tau [\text{log} \rho_i(\tau) \mathbf{X}_i(\tau) - \rho_i(\tau)]$

where we use $\mathbf{X}_i(\tau)$ to identify the spike times.

They note that this equation is not a sum of independent terms, since $\rho$ depends on the entire past activity of all the other neurons.

Figure 2 from the paper show the relevant network structures we will focus on. Panel C shows the intra- and inter- network connectivity between and among the hidden and visible units. Panel D illustrates the connectivity for the “inference” $\mathcal{Q}$ network and the “generative” $\mathcal{M}$ network. This structure is similar to the Helmholtz machine of Dayan (2000) and the learning algorithm will be very close to the wake-sleep algorithm used there.

## Variational Inference with stochastic gradients

From here, they follow a pretty straightforward application of VI. I will pepper my post with terms we’ve used/seen in the past to make these connections as clear as possible. They construct a recurrent network of spiking neurons where the spiking data of a subset of the neurons (the visible neurons or “the observed data”) can be explained by the activity of a disjoint subset of unobserved neurons (or a “latent variable” ala the VAE). Like standard VI, they want to approximate the posterior distribution of the spiking patterns of the hiding variables (like one would approximate the posterior of a latent variable in a VAE) by minimizing the KL-divergence between the true posterior and an approximate posterior $q$:

$KL(q;p) = \int \mathcal{D} \mathcal{X_H} q(\mathcal{X_H} | \mathcal{X_V}) \text{log} \frac{q(\mathcal{X_H} | \mathcal{X_V})}{p(\mathcal{X_H} | \mathcal{X_V})}$

$= \langle \text{log} q(\mathcal{X_H} | \mathcal{X_V}) - \text{log} p(\mathcal{X_H,X_V}) \rangle_{q(\mathcal{X_H} | \mathcal{X_V})} + \text{log} p(\mathcal{X_V})$

$= \langle \mathcal{L^Q} - \mathcal{L^M} \rangle_{q(\mathcal{X_H} | \mathcal{X_V})} + \text{log} p(\mathcal{X_V})$.

The second term is the data log likelihood. The first term, $\mathcal{F}$, is the Helmholtz free energy and like always in VI it represents an upper bound on the negative log likelihood. We can therefore change our optimization problem to minimize this function with respect to the parameters of $q$ (the approximate posterior) and $p$ (the true posterior). We do this by computing the gradients of $\mathcal{F}$ with respect to the $\mathcal{Q}$ (inference) network and the $\mathcal{M}$ (generative) network​. First, the $\mathcal{M}$ network, since it’s easier:

$\dot{w_{ij}^\mathcal{M}} = -\mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \mathcal{F} = \mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \langle \mathcal{L^Q} - \mathcal{L^M} \rangle_q = \mu^\mathcal{M} \langle \nabla_{ij}^\mathcal{M} \mathcal{L^M} \rangle_q \approx \mu^\mathcal{M} \nabla_{w_{ij}^\mathcal{M}} \hat{\mathcal{L}}^\mathcal{M}$

where $\hat{\mathcal{L}}^\mathcal{M}$ is a point estimate of the complete data log likelihood of the generative model. They will compute this with a Monte Carlo estimate. The gradient of the complete data log likelihood with respect to the connections is:

$\nabla_{w_{ij}^\mathcal{M}} \hat{\mathcal{L}}^\mathcal{M} = \nabla_{w_{ij}^\mathcal{M}} \text{log} p(\mathcal{X_H, X_V}) = \sum_{k \in \mathcal{V} \cup \mathcal{H}} \int_0^T d\tau \frac{\partial \text{log} \rho_k(\tau)}{\partial w_{ij}^\mathcal{M}} [\mathbf{X}_k(\tau) - \rho_k(\tau)]$.

Here they used the handy identity: $\frac{\partial f(x)}{\partial x} = \frac{\partial[\text{log} f(x)]}{\partial x} f(x)$

The derivative of the firing rate function can be computed with the chain rule, $\frac{\partial [\text{log} \rho_k(\tau)]}{\partial w_{ij}^\mathcal{M}} = \delta_{ki} \frac{g^\prime(u_k(\tau))}{g(u_k(\tau))} \phi_j(\tau)$.

This equation for updating the weights using gradient ascent is purely local, taking the form of a product between a presynaptic component, $\phi_j(\tau)$, and a postsynaptic term $\frac{g^\prime(u_k(\tau))}{g(u_k(\tau))} [\mathbf{X}_k(\tau) - \rho_k(\tau)]$.

They also compute the gradient of $\mathcal{F}$ with respect to the $\mathcal{Q}$ network, $-\mu^\mathcal{Q} \nabla_{w_{ij}^\mathcal{Q}} \mathcal{F}$. To do this, I revisited the 2014 Kingma and Welling paper, where I think they were particularly clear about how to compute gradients of expectations (i.e. the score function estimator). In section 2.2 they note that:

$\nabla_\phi \langle f(z) \rangle_{q_\phi (z)}= \langle f(z) \nabla_\phi \text{log} q_\phi (z) \rangle$.

A cute proof of this can be found hereThis comes in handy when computing the gradient of $\mathcal{F}$ with respect to the connections of the $\mathcal{Q}$ network:

$\nabla_{w_{ij}^\mathcal{Q}} \mathcal{F} = \langle \mathcal{F} \nabla_{w_{ij}^\mathcal{Q}} \text{log} q(\mathcal{X_H} | \mathcal{X_V}) \rangle = \langle \mathcal{F} \nabla_{w_{ij}^\mathcal{Q}} \mathcal{L^Q} \rangle \approx \hat{\mathcal{F}} \nabla_{w_{ij}^\mathcal{Q}} \hat{\mathcal{L}}^\mathcal{Q}$.

Here again we compute Monte Carlo estimators of $\mathcal{F}$ and $\mathcal{L^Q}$. $\nabla_{w_{ij}^\mathcal{Q}} \hat{\mathcal{L}}^\mathcal{Q}$ takes the exact same form as for the $\mathcal{M}$ network, but the neat thing is that $\nabla_{w_{ij}^\mathcal{Q}} \mathcal{F}$ contains a term in front of the gradient of the estimate of the log likelihood, $\hat{\mathcal{F}}$. This is a global signal (opposed to the local signals that are present in $\hat{\mathcal{L}}^\mathcal{Q}$) that they interpret as a novelty or surprise signal.

The authors note that the stochastic gradient they introduced has been used extensively in reinforcement learning and that its variance is prohibitively high. To deal with this (presumably following the approach others have developed in RL, vice versa) they adopt a simple, baseline removal approach. They subtract the mean $\bar{\mathcal{F}}$ of the free energy estimate $\hat{\mathcal{F}}$ calculated as a moving average across several previous batches of length $T$ from the current value $\hat{\mathcal{F}}(T)$. They replace the free energy in the gradient for the $\mathcal{Q}$ network with a free energy error signal, $\hat{\mathcal{F}}(T) - \bar{\mathcal{F}}$. Below, the log likelihood of the generated data when this procedure is used is plotted against a naively trained network, showing that this procedure works better than the naive rule.

## Numerical results

Details of their numerical simulations:

• Training data is binary arrays of spike data.
• Training data comes in batches of 200 ms with 500 batches sequentially shown to the network (100 s of data).
• During learning, visible neurons are forced to spike like the training data.
• Log likelihood of test data was estimated with importance sampling. Given a generative model with density $p(x_v,x_h)$, importance sampling allows us to estimate the density of $p(x_v)$:

$p(x_v) = \langle p(x_v | x_h) \rangle_{p(x_h)} = \langle p(x_v | x_h) \frac{p(x_h)}{q(x_h|x_v)} \rangle_{q(x_h|x_v)}$

$= \langle \text{exp}[\text{log}p(x_v,x_h) - \text{log} q(x_h|x_v)] \rangle_q = \langle \text{exp}[-\hat{\mathcal{F}}(x_v,x_h)]\rangle_{q(x_v|x_h)}$

Using this equation, they estimate the log likelihood of the observed spike trains by sampling several times from the $\mathcal{Q}$ network and computing the average of the free energy. They use 500 samples of duration 100 from the $\mathcal{Q}$ network to compute this estimate.

Here is an example of training with this method with 50 hidden units using the “stairs” dataset. C shows that the network during the “sleep phase” (running in generative mode) forms a latent representation of the stairs in the hidden layers. Running the network in “inference mode” (wake, in the wake-sleep parlance), when the $\mathcal{Q}$ network synapses are being used, the model is capable of performing inference on the causes of the incoming data (the visible neurons are being driven with the data).

## Role of the novelty signal

To examine the role of the novelty signal, they train a network to perform a maze task. Each maze contains 16 rooms where each room is a 28×28 pixel greyscale image of a MNIST digit. Each room is only accessible from a neighboring room. Pixel values were converted into firings rates from 0.01 to 9 Hz. In the test maze (or control maze), some of the rooms of the training maze were changed. The network had 28×28 visible units and 30 hidden units. These were recurrent binary units. Data were generated from random trajectories of 100 time steps in the target maze. Each learning epoch was 500 presentations of the data batches.

Below, (bottom left) they plotted the slow moving average of the free energy $\bar{\mathcal{F}}$ as a function of the amount of observed data for the target maze (blue) and the same model when it was “teleported” to the control maze every 500 s. In the beginning of learning, the free energy is the same so the model cannot distinguish between them. As learning proceeds, the model identifies the test as unfamiliar (higher free energy).

Bottom right shows the free energy error signal for the sample trajectory in A. It fluctuates near zero for the learned maze but deviates largely for the test maze. We can see at (3,3) the free energy signal really jump up, meaning that the model identifies this as different from the target.

To conclude, the authors speculate that a neural correlate of this free energy error signal should look like an activity burst when an animal traverses unexpected situations. Also, they expect to see a substantial increase in the variance of the changes in synaptic weights when moving from a learned to a unfamiliar maze due to the change in the baseline of surprise levels.

# Deep kernel with recurrent structure

This week in lab meeting, we discussed the following work:

Learning Scalable Deep Kernels with Recurrent Structure
Al-Shedivat, Wilson, Saatchi, Hu, & Xing, JMLR 2017

This paper addressed the problem of learning a regression function that maps sequences to real-valued target vectors. Formally, the sequences of inputs are vectors of measurements $\overline{\mathbf{x}_1}=[\mathbf{x}^1]$$\overline{\mathbf{x}_2}=[\mathbf{x}^1,\mathbf{x}^2]$, …, $\overline{\mathbf{x}_n}=[\mathbf{x}^1,\mathbf{x}^2,\dots,\mathbf{x}^n]$, where $\mathbf{x}^i\in\mathcal{X}$, are indexed by time and would be of growing lengths. Let $\mathbf{y}=\{\mathbf{y}_i\}_{i=1}^n$, $\mathbf{y}_i\in\mathbb{R}^d$, be a collection of the corresponding real-valued target vectors. Assuming that only the most recent $L$ steps of a sequence are predictive of the targets, the sequences can be written as $\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]$, $i=1,...,n$. The goal is to learn a function, $f:\mathcal{X}^L\rightarrow\mathbb{R}^d$, based on the available data. The approach to learning the mapping function is to use Gaussian process.

The Gaussian process (GP) is a Bayesian nonparametric model that generalizes the Gaussian distributions to functions. They assumed the mapping function $f$ is sampled from a GP. Then a standard GP regression would be performed to learn $f$ as well as the posterior predictive distribution over $\mathbf{y}$. The inputs to the GP are $\{\overline{\mathbf{x}_i}\}_{i=1}^n$ and the output function values are  $\{{\mathbf{y}_i}\}_{i=1}^n$. However, standard kernels, e.g. RBF, Matern, or periodic kernel, are not able to take input variables with varying length. Moreover, standard kernels won’t fully utilize the structure in an input sequence.  Therefore, this paper proposed to use recurrent models to convert an input sequence $\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1}, \mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]$ to a latent representation $\mathbf{h}^L_i\in\mathcal{H}$, then map $\mathbf{h}_i^L$ to $\mathbf{y}_i$. Formally, a recurrent model expresses the mapping $f:\mathcal{X}^L\rightarrow \mathbb{R}^d$ as:

$\mathbf{y}_i=\psi(\mathbf{h}_i^L)+\epsilon^t, \mathbf{h}_i^t=\phi(\mathbf{h}^{t-1}_i+\mathbf{x}^{i-L+t})+\delta^t,t=1,...,L$

Combining the recurrent model with GP, the idea of deep kernel with recurrent model is that let $\phi:\mathcal{X}^L\rightarrow\mathcal{H}$ be a deterministic transformation, which is a recurrent model, mapping $\{\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]\}_{i=1}^n$ to $\{\mathbf{h}^L_i\}_{i=1}^n$. Sequential structure in $\{\overline{\mathbf{x}_i}\}_{i=1}^n$ is integrated into the corresponding latent representations. Then a standard GP regression can be performed with input $\{\mathbf{h}^L_i\}_{i=1}^n$ and output $\{\mathbf{y}_i\}_{i=1}^n$.

If we denote $\tilde{k}:(\mathcal{X}^L)^2\rightarrow \mathbb{R}$ to be the kernel over the sequence space, and $k:\mathcal{H}^2\rightarrow \mathbb{R}$ to be the kernel defined on the latent space $\mathcal{H}$, then

$\tilde{k}(\overline{\mathbf{x}},\overline{\mathbf{x}}')=k(\phi(\overline{\mathbf{x}}),\phi(\overline{\mathbf{x}}'))=k(\mathbf{h}^L,{\mathbf{h}^L}')$

where $k(\mathbf{h}^L,{\mathbf{h}^L}')$ denotes the covariance between $\mathbf{y}$ and $\mathbf{y}'$ in GP.

By now deep kernel with general recurrent structure has been constructed. With respect to recurrent model, there are multiple choices. Recurrent neural networks (RNNs) model recurrent processes by using linear parametric maps followed by nonlinear activations. A major disadvantage of the vanilla RNNs is that their training is nontrivial due to the so-called vanishing gradient problem: the error back-propagated through t time steps diminishes exponentially which makes learning long-term relationships nearly impossible. Thus in their paper, they chose the long short-term memory (LSTM) mechanism as the recurrent model. LSTM places a memory cell into each hidden unit and uses differentiable gating variables. The gating mechanism not only improves the flow of errors through time, but also, allows the the network to decide whether to keep, erase, or overwrite certain memorized information based on the forward flow of inputs and the backward flow of errors. This mechanism adds stability to the network’s memory. Therefore, their final model is a combination of GP and LSTM, named as GP-LSTM.

To solve the model, they needed to infer two sets of parameters: the base kernel hyperparameters, $\theta$, and the parameters of the recurrent neural transformation, denoted $W$. In order to update $\theta$, the algorithm needs to invert the entire kernel matrix $K$ which requires the full dataset. However, once the kernel matrix $K$ is fixed, update of $W$ turns into a weighted sum of independent functions of each data point. One could compute a stochastic update for $W$ on a mini-batch of training points by only using the corresponding sub-matrix of $K$. Hence, they proposed to optimize GPs with recurrent kernels in a semi-stochastic fashion, alternating between updating the kernel hyperparameters, $\theta$, on the full data first, and then updating the weights of the recurrent network, $W$, using stochastic steps. Moreover, they observed that when the stochastic updates of $W$ are small enough, $K$ does not change much between mini-batches, and hence they can perform multiple stochastic steps for $W$ before re-computing the kernel matrix, $K$, and still converge. The final version of the optimization algorithm is given as follows:

To evaluate their model, they applied it to a number of tasks, including system identification, energy forecasting, and self-driving car applications. Quantitatively, the model was assessed on the data ranging in size from hundreds of points to almost a million with various signal-to-noise ratios demonstrating state-of-the-art performance and linear scaling of their approach. Qualitatively, the model was tested on consequential self-driving applications: lane estimation and lead vehicle position prediction. Overall, I think this paper achieved state-of-the-art performance on consequential applications involving sequential data, following straightforward and scalable approaches to building highly flexible Gaussian process.

# Completing the Covariance: Deterministic matrix completion for symmetric positive semi-definite matrices

This week in J. Pillz Happy Hour we discussed the following work:

Deterministic Symmetric Positive Semi-definite Matrix Completion
William E. Bishop and Byron M. Yu, NIPS, 2014

This paper took on the interesting problem of low-rank matrix completion with the twists that (1) the observed entries of the matrix $A$ are both chosen in a deterministic manner and that (2) those entries consist of the union of principal sub-matrices $A_l$ for $= 1,2,... K$. The large body of prior work in this field mostly relied on random sampling, i.e. we get to observe some set of $M$ entries $Y_{i,j} = A_{i,j} + \epsilon$ of $A$ chosen uniformly at random from the set of all entries (i.e. see Candes & Recht, Candes & Plan, etc.). These initial guarantees had the nice property that the recovery guarantees occurred with high probability, meaning that despite the probabilistic nature of the observed entry locations, solving a convex optimization program was virtually guaranteed to recover the original matrix up to a bounded recovery error. Additionally, this recovery error bound correctly deduced the linear dependence on the error in the observations $\epsilon$.  The framework used was also expandable to cases where $M$ linear combinations of  the entries of $A$ were observed, significantly expanding the set of problems the theory was applicable to.

For the case considered by Bishop & Yu, the random sampling framework considered previously was not useful, as there was much more structure in the problem being considered. Taking the motivating example of recovering a covariance matrix, the authors assume that instead of covariances between pairs of variables being observed individually, they instead assume that the covariance matrix between subsets of variables are, in turn, fully observed. They assume that there is no randomness in which variables are observed and instead seek to devise (1) conditions on when the full covariance is recoverable from the observed subset covariances, and (2) an algorithm that take the subset covariances and recover the full covariance. More generally, the authors phrase this as the recovery of symmetric positive semi-definite matrices from principal sub-blocks.

While the conditions are presented first in the paper, the algorithm actually motivated the need for the specific conditions needed. The algorithm devised is simple and elegant, and relies on the fact that a SPSD matrix can be decomposed as $A = CC^T$. Similarly, any principal sub-matrix of the rank-$lattex r$ matrix $A$$A_l$, can be similarly decomposed as $A_l = C_lC_l^T$. Additionally, the $C_l$ matrices can, under a simple rotation, match the matrix $C$ over the indices corresponding to the rows spanned by the sub-matrix $A_l$. These observations lead to the following algorithm; For each block $A_l$, decompose $A_l$ into its eigenvalue decomposition $A_l = \Sigma_l\Lambda_l\Sigma^T_l$ and set $\widehat{C}_l = \Sigma_l\Lambda_l^{1/2}$. These estimates of the $C_l$ sub-matrices will not match on their overlaps, so the must be rotated in order to match. One sub-matrix is chosen as the seed, and all other matrices are, in turn, rotated to match the current set values of $\widehat{C}$ on the overlap. The ordering of which matrices to rotate when is chosen to maximize, at each step, the overlap with all sub-matrices that came before it. The final estimate of $A$ is then just the outer product of the estimate of $C$: $\widehat{A} = \widehat{C}\widehat{C}^T$.

The conditions for this algorithm to work boil down to requiring that (1) The observed sub-matrices are, in fact, principal sub-matrices and that the set of observed entries span all the rows of the matrix and (2) There is “enough” overlap between the principal sub-matrices for the alignment step in the algorithm to work. While (1) is fairly obvious, (2) can be boiled down to needing the index set spanned by any matrix and those that came before it (in the ordering discussed above) to have rank $r$: the same rank as $A$. When these conditions are satisfied, the authors can prove that with no noise their algorithm recovers the true matrix $A$ exactly, and that with noise the algorithm recovers the matrix $A$ up to an error bounded by a function $||A -\widehat{A}||_F^2 \leq K_1\sqrt{r\epsilon} + K_2\epsilon$ where $K_1, K_2$ are fairy involved functions of the eigenvalues of $A$, and $A_l$. Interestingly, they show that without their assumptions, no algorithm can ever recover the matrix $A$ from its principal sub-matrices.

While these results mimic the style of results in the prior literature, the assumptions in this work can be easily checked given the set of observed indices. The authors consider this a benefit when compared to the assumptions in previous matrix completion theory, which cannot be checked when given a set of measurements. Additionally, the authors point out that their assumptions guarantee recovery for all rank-$r$ matrices, indicating that there cannot be an adversarial matrix where the method fails. What the authors sacrifice for these benefits are two-fold. For one, the rank restriction on the overlap  is quite prohibitive, and basically requires that the full complexity of the entire matrix is represented in each overlap. Thus, while there do exist sets of principal sub-matrices that span very few of the entries of $A$ and satisfy the assumptions, these are probably very unlikely sets. Unfortunately, the author’s Theorem 1 specifies that this is an unavoidable aspect of sampling principal sub-matrices. The second detriment is that the proof techniques used for the noise recovery guarantee resulted in an error bond that is sensitive to the specific eigen-spectrum of $A$, opaque in its behavior, and does not capture the true dependence of the perturbation $\epsilon$. The eigenvalue dependence is reminiscent of non-uniform recovery guarantees in the compressive sensing literature (e.g., block-diagonal sensing matrices; Eftekhari, et al.). It would have been nice to have in the paper some discussion of this dependency and if it was a fundamental aspect of the problem or a proof artifact. Overall, however, the simplicity of the algorithm combined with the guarantees make this a very nice paper.