Attention is all you need. (aka the Transformer network)

No matter how we frame it, in the end, studying the brain is equivalent to trying to predict one sequence from another sequence. We want to predict complicated movements from neural activity. We want to predict neural activity from time-varying stimuli. We even want to predict neural activity in one brain area from neural activity in another area. Thus, as computational neuroscientists, we should be intimately familiar with new machine learning techniques that allow us to better relate one sequence to another.

One sequence-to-sequence problem receiving a lot of interest in machine learning is “translation”—converting a sentence in one language (e.g., English) to another (e.g., Polish). The main challenge is getting context correct. For example, consider “bark” for the following sentences: “The bark is loud.” and “The bark is brown.” In Polish, bark would be “szczekanie” and “kora,” respectively, based on context. Machine learning folks have devised some ingenious architectures to tackle this problem. In this post, I’ll talk about one architecture that is unexpected but seems to work quite well: Transformer networks. This architecture was proposed in “Attention is all you need.” by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, NIPS 2017.

High-level intuition

The Transformer network relies on an encoding-decoding approach. The idea is to “encode” the input sequence into a latent representation (i.e., an embedding) that is easier to work with. This embedding (typically a vector of abstract variables) is then “decoded” into an output sequence (e.g., another language). Our problem, then, is to design a “useful” embedding for translation, and build an architecture that can express it.

What properties do we want our embedding to have? One property is a measure of similarity— “tomato” is more similar to “blueberry” than to “chair.” An embedding with this property has been solved with word2vec. Another important property is context—is “tomato” the subject? direct object? is it modified by “tasty” or “rotten”? It turns out we can incorporate context by adding embeddings together. For example, let embeddings e1=”tomato” and e2=”tasty”. Then
e3 = e1 + e2 corresponds to an embedding of a “tasty tomato”. We can keep tacking on more context (e.g., e4 = “subject of sentence”, e5 = “the”, e6 = “belongs to Tom”, etc.). With this embedding property, if we come across an ambiguous word like “it”, we simply find the word that “it” refers to, and add that word to the embedding!

Architecture: Stacked attention layers

The authors came up with a way to efficiently search for context and add that context to the embedding. They use “attention”—in the general sense of the word, not based on neuroscience—as a way to find distant relationships between words. Figure 1 is a stylized version of this computation. The idea is, given a word, compare this word with all other words in the sentence for a specific context. For example, if the word is “sees”, attention will try to find “who sees?” and “sees what?” If attention finds answers to its context questions, these answers will be incorporated into the embedding via a linear combination. If attention does not receive answers, no harm is done—attention will simply pass on the unmodified word embedding (i.e., weights for the other words will be zero).

Figure 1: The attention mechanism. (this is my first time making gifs—sorry for the blurriness!)

One can imagine that having one level of attention may not be the best way to extract the structure of natural language. Just like we learned in sixth grade, sentences have hierarchical structure (did you also have to draw those syntax trees?). So, we should stack levels of attention. The lower levels correspond to easy questions (“Is there an adjective to this noun?”) that likely involve only two or three words of the input sequence. The deeper levels correspond to more nuanced questions (“What is the direct object of this sentence?”) that span most if not all of the input sequence. To achieve this, the authors stacked layers of attention on top of each other. To decode, they basically reverse the process. Figure 2 is an illustration of this process. 

Figure 2: Attention blocks form an encoder stack, which feeds into a decoder stack.

The authors claim this approach minimizes the distance of relating one position in a sequence to another. This is because attention looks at all pairwise interactions, while other architectures (such as recurrent neural networks) sequentially look at the input sequence (making it difficult to compare the first word to the last word). The primary point of confusion while reading the paper was how input sequences were fed into the network (hopefully Fig. 2 clarifies this—each word embedding is transformed in parallel to other word embeddings. If you input N words, each encoder will output N embeddings). This architecture produced state-of-the-art results, and has since been used in many different natural language processing tasks. It also uses many bells and whistles (residual blocks, layer normalization, …)—it will be interesting to see future work argue which components are the most important for this architecture.

Getting back to the neuroscience

As computational neuroscientists, we should take advantage of these architectures (which have been designed and trained with the aid of GPU armies), and use them to help solve our own problems. One aspect of this work (and in natural language processing in general) is the idea of adding two embedding vectors to form a more “context-relevant” embedding. This is missing from our current latent variable models. More importantly, it may be a way in which the brain also encodes context. I also think this type of attention computation will be useful when we are trying to predict sequences of natural behavior, where we cannot make use of task structure (e.g., a delay period, stimulus onset, etc.). Experimentalists are now collecting massive datasets of natural behavior—a perfect opportunity to see how this attention computation holds up!

Lapses in perceptual judgments reflect exploration

In lab meeting this week, we discussed Lapses in perceptual judgments reflect exploration by Pisupati*, Chartarifsky-Lynn*, Khanal and Churchland. This paper proposes that, rather than corresponding to inattention (Wichmann and Hill, 2001), motor error, or \epsilon-greedy exploration as has previously been suggested, lapses (errors on “easy” trials with strong sensory evidence) correspond to uncertainty-guided exploration. In particular, the authors compare empirically-obtained psychometric curves characterizing the performance of rats on a 2AFC task, with predicted psychometric curves from various normative models of lapses. They found that their softmax exploration model explains the empirical data best.

Empirical psychometric curve

Psychometric curves are used to characterize the behavior of animals as a function of the stimulus intensity when performing, for example, 2AFC tasks. They are defined by four parameters:

p(\hat{a}= Right|s) = \gamma + (1-\gamma - \lambda)\phi(s; \mu, \sigma)

where \phi(s; \mu, \sigma) is a sigmoidal curve (we will assume the cumulative normal distribution in what follows); \mu determines the decision boundary and \sigma is the inverse slope parameter. \gamma is the lower asymptote of the curve, while 1- \lambda is the upper asymptote. Together, \{\gamma, \lambda\}, comprise the asymmetric lapse rates for the “easiest” stimuli (highest intensity stimuli).

While Bayesian ideal observer models have been able to motivate the cumulative normal shape of the psychometric curve (as the authors show in their Methods section), this paper aims to compare different normative models to explain the \gamma and \lambda parameters of the psychometric curve.

Inattention model

For their inattention model, the authors assume that, with probability p_{attend}, the rat pays attention to the task-relevant stimulus, while, with probability 1- p_{attend}, it ignores the stimulus and instead chooses according to its bias p_{bias}. That is:

p(\hat{a}=Right|s) = \phi(s; \mu, \sigma) p_{attend} + (1-p_{attend})p_{bias}

or, equivalently, \gamma = p_{bias}(1-p_{attend}) and \lambda = (1-p_{bias})(1-p_{attend}).

Softmax exploration model

In comparison to the inattention model, which assumes that, when the animal is paying attention to the task-relevant variable, it chooses the action corresponding to the maximum expected action-value, \hat{a} =\max_{a}Q(a); the softmax-exploration model assumes that the animal chooses to go right in a 2AFC task according to

p(\hat{a} = Right|Q(a)) = \dfrac{1}{1+exp(-\beta(Q(R)-Q(L))}

where \beta is the inverse temperature parameter and controls the balance between exploration and exploitation and Q(R) - Q(L) is the difference in the expected value of choosing Right compared to choosing Left (see below). In the limit of \beta \rightarrow \infty, the animal will once again choose its action according to \hat{a} =\max_{a}Q(a); while in the low \beta regime, the animal is more exploratory and may choose actions despite them not having the largest expected action values.

In the task they consider, where an animal has to determine if the frequency of an auditory and/or visual cue exceeds a predetermined threshold, the expected action values are Q(R) = p(High|x_{A}, x_{V})r_{R}, where p(High|x_{A}, x_{V}) is the posterior distribution over the category of the stimulus (whether the frequency of the generated auditory and/or visual stimuli are above or below the predetermined threshold) given the animal’s noisy observations of the auditory and visual stimuli, x_{A} and x_{V}. r_{R} is the reward the animal will obtain if it chooses to go Right and is correct. Similarly, Q(L) = p(Low|x_{A}, x_{V})r_{L}.

The authors show that the softmax exploration model corresponds to setting the lapse parameters as

\gamma = \dfrac{1}{1 + exp(\beta r_{L})} and \lambda = \dfrac{1}{1+exp(\beta r_{R})}

Model Comparison

The authors compare the empirically obtained psychometric curves with those predicted by the inattention and exploration models for two experimental paradigms:

  1. Matched vs Neutral Multisensory stimuli: in this experiment, rats are either presented with auditory and visual stimuli of the same frequency (‘matched’), or with auditory and visual stimuli, where one of the channels has a frequency close to the category threshold, so is, in effect, informationless (‘neutral’). The idea here is that both matched and neutral stimuli have the same ‘bottom-up salience’, so that the p_{attend} parameter is the same for both matched and neutral experiments in the inattention model. By contrast, there is no such restriction for the softmax exploration model, and there is a different \beta exploration parameter for the matched and neutral experiments. The authors find that the psychometric curves corresponding to the softmax model resemble the shape of the empirically obtained curves more closely; and that BIC/AIC are lower for this model.
  2. Reward manipulations: whereas the reward for choosing left or right was previously equal, now the reward for choosing right (when the frequency is above the predetermined threshold) is either increased or decreased. Again, the authors find that the psychometric curves corresponding to the softmax model resemble the shape of the empirically obtained curves more closely; and that BIC/AIC are lower for this model.
(Part of) Figure 3 of Pisupati*, Chartarifsky-Lynn*, Khanal and Churchland. The authors compare the shape of the empirically obtained psychometric curves for the matched and neutral experiments with those obtained by constraining the p_{attend} parameter to be the same for both matched and neutral conditions for the inattention model. In contrast, the \beta parameter is allowed to differ for matched and neutral conditions for the exploration model. The resulting psychometric curves for the exploration model more closely resemble those in (d) and also fit the data better according to BIC/AIC.
(Part of) Figure 4 of Pisupati*, Chartarifsky-Lynn*, Khanal and Churchland. Top: The inattention, exploration and fixed error models (the latter of which we did not discuss), make different predictions for the shape of the psychometric curve when the size of the reward on the left and right is modified from equality. Bottom: the empirically determined psychometric curves. Clearly, the empirically determined psychometric curves resemble the curves from the exploration model more closely.

Reflections on this paper

By demonstrating that lapse rates can be manipulated with rewards or by using unisensory compared to multisensory stimuli, this paper highlights that traditional explanations for lapses as being due to a fixed probability of the animal neglecting the task-relevant stimulus; motor-error or \epsilon-greedy exploration are overly-simplistic. The asymmetric effect on left and right lapse rates of modified reward is particularly interesting, as many traditional models of lapses fail to capture this effect. Another contribution that this paper makes is in implicating the posterior striatum and secondary motor cortex as areas which may be involved in determining lapse rates; and better characterizing the role of these areas in lapse behavior than has been done in previous experiments.

This being said, some lab members raised some questions and/or points of concern as we discussed the paper. Some of these points include:

  1. We would have liked to see further justification for keeping p_{attend} the same across the matched and neutral experiments and we question if this is a fair test for the inattention model of lapses. Previous work such as Körding et al. (2007) makes us question whether the animal uses different strategies to solve the task for the matched and neutral experiments. In particular, in the matched experiment, the animal may infer that the auditory and visual stimuli are causally related; whereas in the neutral experiment, the animal may detect the two stimuli as being unrelated. If this is true, then it seems strange to assume that p_{attend} and p_{bias} for the inattention model should be the same for the matched and neutral experiments.
  2. When there are equal rewards for left and right stimuli, is there only a single free parameter determining the lapse rates in the exploration model (namely \beta)? If so, how do the authors allow for asymmetric left and right lapse rates for the exploration model curves of Figure 3e (that is, the upper and lower asymptotes look different for both the matched and neutral curves despite equal left and right reward, yet the exploration model seems able to handle this – how does the model do this?).
  3. How could uncertainty \beta be calculated by the rat? Can the empirically determined values of \beta be predicted from, for example, the number of times that the animal has seen the stimulus in the past? And what were some typical values for the parameter \beta when the exploration model was fit with data? How exploratory were the rats for the different experimental paradigms considered in this paper?

The Lottery Ticket Hypothesis

Deep learning is black magic. For some reason, a neural network with millions of parameters is not cursed to overfit. Somewhere, either in the architecture or the training or the weights themselves, exists a magic that allows deep neural networks to generalize.  We are now only beginning to understand why this is.  As in most cases in science, a good starting place is to observe a paradox about the system, suggest a hypothesis to explain the paradox, and then test, test, test.

The Paradox: This week we discussed “The lottery ticket hypothesis: Finding sparse, trainable neural networks” by Frankle and Carbin, ICML, 2019. Recent studies have shown that a deep neural network can be pruned down to as little as 10% of the size of the original network with no loss in test prediction. This leads to a paradox: Training the small, pruned network (randomly initialized) leads to worse test prediction than training a large network and then pruning it. This is heresy to the ML doctrine of “Thou shalt start simple, and then go complex.” What is going on?

The Hypothesis: This paradox suggests that the initialization of the weights of the pruned network matters.  This leads to the Lottery Ticket Hypothesis:

The Lottery Ticket Hypothesis: “dense, randomly-initialized, feed-forward networks contain subnetworks (i.e., winning tickets) that—when trained in isolation—reach test prediction comparable to the original network in a similar number of iterations.” 

These “winning tickets” start out with an initialization that make training particularly effective.  If the winning tickets were re-initialized randomly, they would no longer be winning tickets—training would not reach the same level of test prediction. This may explain why deep and wide networks tend to perform better than shallow, narrow networks—> the deeper, wider networks have more chances of having the winning ticket. It also suggests that training a neural network is akin to stochastic gradient descent seeking out these winning tickets.

The Test, Test, Test: To find the winning ticket, the authors employ iterative pruning. They start with a large neural network, train it a bit, set the smallest-magnitude weights to zero (these weights are no longer trainable), rewind the trainable parameters to their original initialized values, and train again. This is repeated until pruning drastically hurts test prediction. They find that these winning tickets can be as sparse as 3.7% of the original network’s size while having the same if not higher test prediction (Figure 1, solid lines). Interestingly, if they randomly re-initialize the weights of these pruned networks, test prediction decreases considerably  (Figure 1, dashed lines, ‘reinit’). Thus, we have a way to identify winning tickets, these winning tickets are a measly fraction of the size of the original network, and their original initialization is crucial to train them.  The authors do a thorough job of confirming the robustness of this effect (45 figures in all). However, they did not investigate the properties of the subnetworks (e.g., did pruning happen most in the deeper layers?).

Figure 1. Winning tickets (subnetworks of large, original network) still maintain high test prediction (solid lines; 7.1% has higher accuracy than 100%). If these winning tickets were randomly initialized then trained (reinit), test prediction suffers (dashed lines below solid lines at 7.1%). Results include three convolutional neural networks at different depths trained and tested on the CIFAR10 image dataset. Figure reproduced from Frankle and Carbin, 2019.

One question I had remaining is if these winning tickets were necessary or sufficient to train the large network.  One analysis that could get at this question is the following. First, identify a winning ticket, and then re-initialize its weights in the original, large network. This ensures that this subnetwork is no longer is a winning ticket. Keep repeating this process.  After removing winning tickets, does the large network fail to train? How many winning tickets does the large network have?  Will the large network always have a winning ticket?  We could also do the reverse of this: For a large network, keep initializing subnetworks that are winning tickets.  Is it the case that with more winning tickets, the network trains faster with higher test prediction?

Implications for deep learning: There are several implications for deep learning. 1. Finding out what is special about the initializations of these winning tickets may help us improve initialization strategies. 2. We may be able to improve optimization techniques by better guiding stochastic gradient descent to find and train the winning ticket as fast as possible. 3. New theory that focuses on subnetworks may lead to more insight into deep learning.  4. These winning tickets may be helpful for solving other tasks (i.e., transfer learning). There have already been some follow up studies about these issues and re-examining the lottery ticket hypothesis:  Liu et al., 2019, Crowley et al., 2019, Frankle et al., 2019, Zhou et al, 2019.

Does the brain have winning tickets? I had one recurring thought while reading this paper: I know brains prune synaptic connections substantially during development—could the brain also consist of whittled-down winning tickets? After some searching, I realized that it is largely unknown the amount of pruning that occurs or when certain pruning happens.  What would be some tell-tale signs of winning tickets?  First, substantial pruning should occur (and likely does…perhaps as much as 50%). Second, randomly initializing a developing circuit should lead to a drop in performance after the same amount of training as a control subject (not sure if we can randomly set synaptic weights yet). Third, what would happen if we could prune small-magnitude synaptic connections ourselves during development? Could the brain recover? These tests could first be carried out in insects, where we have gene lines, optogenetics, whole-brain recordings, and well-labeled cell types. 

Deep Neural Networks as Gaussian Processes

In lab meeting this week, we read Deep Neural Networks as Gaussian Processes by Lee, Bahri, Novak, Schoenholz, Pennington and Sohl-Dickstein, and which appeared at ICLR 2018. The paper extends a result derived by Neal (1994); and the authors show that there is a correspondence between deep neural networks and Gaussian processes. After coming up with an efficient method to evaluate the associated kernel, the authors compared the performance of their Gaussian process model with finite width neural networks (trained with SGD) on an image classification task (MNIST, CIFAR-10). They found that the performance of the finite width networks approached that of the Gaussian process performance as the width increased, and that the uncertainty captured by the Gaussian process correlated with mean squared prediction error. Overall, this paper hints at new connections between Gaussian processes and neural networks; and it remains to be seen whether future work can harness this connection in order to extend Gaussian process inference to larger datasets, or to endow neural networks with the ability to capture uncertainty. We look forward to following progress in this field.

Single Layer Neural Networks as Gaussian Processes – Neal 1994

Let us consider a neural network with a single hidden layer. We can write the ith output of the network, z_{i}^{1}, as

z_{i}^{1}(x) = b_{i}^{1} + \sum_{j}^{N_{1}}W_{ij}^{1}x_{j}^{1}(x)

where x_{j}^{1}(x) = \phi(b_{j}^{0} + \sum_{k}^{d_{in}}W_{jk}^{0}x_{k}) is the post-activation of the jth neuron in the hidden layer; \phi(x) is some nonlinearity, and x_{k} is the kth input to the network.

If we now assume that the weights for each layer in the network are sampled i.i.d. from a Gaussian distribution: W_{ij}^{1} \sim \mathcal{N}(0, \dfrac{\sigma_{w}^{2}}{N_{1}}), W_{ij}^{0} \sim \mathcal{N}(0, \dfrac{\sigma_{w}^{2}}{d_{in}}); and that the biases are similarly sampled: b_{i}^{1} \sim \mathcal{N}(0, \sigma_{b}^{2}) and b_{i}^{0} \sim \mathcal{N}(0, \sigma_{b}^{2}); then it is possible to show that, in the limit of N_{1} \rightarrow \infty, z_{i} \sim \mathcal{GP} (0, K_{\phi}), for a kernel K_{\phi} which depends on the nonlinearity. In particular, this follows from application of the Central Limit Theorem: for a fixed input to the network \vec{x}, z_{i}^{1} (\vec{x}) \rightarrow \mathcal{N}(0, \sigma_{b}^{2}+\sigma_{w}^{2}V_{\phi}(x^{1}(\vec{x}))) as N_{1} \rightarrow \infty where V_{\phi}(x^{1}(\vec{x})) \equiv \mathbb{E}[(x^{1}_{i}(\vec{x}))^{2}] (which is the same for all i).

We can now apply a similar argument to the above in order to examine the distribution of ith output of the network for a collection of inputs: that is we can examine the joint distribution of \{z_{i}^{1}(\vec{x}^{\alpha = 1}), z_{i}^{1}(\vec{x}^{\alpha = 2}), ..., z_{i}^{1}(\vec{x}^{\alpha = k})\}. Application of the Multidimensional Central Limit Theorem tells us that, in the limit of N_{1} \rightarrow \infty,

\{z_{i}^{1}(\vec{x}^{\alpha = 1}), z_{i}^{1}(\vec{x}^{\alpha = 2}), ...,z_{i}^{1}(\vec{x}^{\alpha = k})\} \sim \mathcal{N}(0, K_{\phi}),

where K_{\phi} \in \mathbb{R}^{k \times k} and K_{\phi}(\vec{x}, \vec{x}') \equiv \sigma_{b}^{2} + \sigma_{w}^{2}C_{\phi}(\vec{x}, \vec{x'}) and C_{\phi}(\vec{x}, \vec{x'}) \equiv \mathbb{E}[x_{i}^{1}(\vec{x})x_{i}^{1}(\vec{x}')].

Since we get a joint distribution of this form for any finite collection of inputs to the network, we can write that z_{i}^{1} \sim \mathcal{GP}(0, K_{\phi}), as this is the very definition of a Gaussian process.

This result was shown in Neal (1994); and the precise form of the kernel K_{\phi} was derived for the error function (a form of sigmoidal activation function) and Gaussian nonlinearities in Williams (1997).

Deep Neural Networks as Gaussian Processes

Lee et al. use similar arguments to those presented in Neal (1994) to show that the ith output of the lth layer of a network with a Gaussian prior over all of the weights and biases is a sample from a Gaussian process in the limit of N_{l} \rightarrow \infty. They use an inductive argument of the form: suppose that z_{j}^{l-1} \sim \mathcal{GP}(0, K_{\phi}^{l-1}) (the jth output of the (l-1)th layer of the network is sampled from a Gaussian process). Then:

z_{i}^{l} \equiv b_{i}^{l} + \sum_{j=1}^{N_{l}}W_{ij}^{l}x_{j}^{l}(\vec{x})

is Gaussian distributed as N_{l} \rightarrow \infty and any finite collection of \{z_{i}^{l}(\vec{x}^{\alpha=1}), ..., z_{i}^{l}(\vec{x}^{\alpha=k})\} will have a joint multivariate Gaussian distribution, i.e., z_{i}^{l} \sim \mathcal{GP}(0, K_{\phi}^{l}) where

K_{\phi}^{l}(\vec{x}, \vec{x'}) \equiv \mathbb{E}[z_{i}^{l}(\vec{x})z_{i}^{l}(\vec{x'})] = \sigma_{b}^{2} + \sigma_{w}^{2} \mathbb{E}_{z_{i}^{l-1}\sim \mathcal{GP}(0, K_{\phi}^{l-1})}[\phi(z_{i}^{l-1}(\vec{x})) \phi(z_{i}^{l-1}(\vec{x'}))].

If we assume a base kernel of the form K^{0}(\vec{x}, \vec{x'}) \equiv \sigma_{b}^{2} + \sigma_{w}^{2}(\dfrac{\vec{x}\cdot \vec{x'}}{d_{in}}), these recurrence relations can be solved in analytic form for the ReLU nonlinearity (as was demonstrated in Cho and Saul (2009)), and they can be solved numerically for other nonlinearities (and Lee et al., give a method for finding the numerical solution efficiently).

Comparison: Gaussian Processes and Finite Width Neural Networks

Lee et al. went on to compare predictions made via Gaussian process regression with the kernels obtained by solving the above recurrence relations (for nonlinearities ReLU and tanh), with the predictions obtained from finite width neural networks trained with SGD. The task was classification (reformulated as a regression problem) of MNIST digits and CIFAR-10 images. Overall, they found that their “NNGP” often outperformed finite width neural networks with the same number of layers for this task; and they also found that the performance of the finite width networks often approached that of the NNGP as the width of these networks was increased:

Figure 1 of Lee et al. The authors compare the performance of their NNGP to finite width neural networks of the same depth and find that, for many tasks, the NNGP outperforms the finite width networks and that the performance of the finite width networks approaches that of the NNGP as the width is increased.

oi-VAE: output interpretable VAEs

We recently read oi-VAE: Output Interpretable VAEs for Nonlinear Group Factor Analysis, which was published at ICML in 2018 by Samuel Ainsworth, Nicholas Foti, Adrian Lee, and Emily Fox. This paper proposes a nonlinear group factor analysis model that is an adaptation of VAEs to data with groups of observations. The goal is to identify nonlinear relationships between groups and to learn latent dimensions that control nonlinear interactions between groups. This encourages disentangled latent representations among sets of functional groups. A prominent example in the paper is motion capture data, where we desire a generative model of human walking and train on groups of observed recorded joint angles. 

Let us consider observations \mathbf{x} that we group into G groups \mathbf{x} = [\mathbf{x}^{(1)}, \cdots, \mathbf{x}^{(G)}].  We note that the paper does not discuss how to choose the groups and assumes that a grouping has already been specified. The generative model maps a set of latent variables to group-specific generator networks via group-specific mapping matrices \mathbf{W}^{(g)}  such that

\mathbf{z} \sim \mathcal{N}(0, \mathbf{I}) \\ \mathbf{x}^{(g)} \sim \mathcal{N}( f_{\theta_g}^{(g)} (\mathbf{W}^{(g)} \mathbf{z} ))

for each g.


oi-VAE generative model (Ainsworth et al., 2018)

For learning interpretable sets of latents that control separate groups, the key feature of this approach is to place a sparsity-inducing prior on the columns of each \mathbf{W}^{(g)}. The authors use a hierarchical Bayesian sparse prior that when analytically marginalized corresponds to optimizing a group lasso penalty on the columns of \mathbf{W}^{(g)}.

The model is trained in the standard VAE approach by optimizing the ELBO with an amortized inference network q_{\phi}(\mathbf{z} | \mathbf{x}), with the addition of the group lasso penalty and a prior on the parameters of the generator networks

\mathcal{L}(\phi, \theta, \mathcal{W}) = \mathbb{E}_{q_\phi(\mathbf{z} | \mathbf{x})} [ \log p(\mathbf{x} | \mathbf{z}, \mathcal{W}, \theta)] - D_{\mathrm{KL}} ( q_\phi(\mathbf{z} | \mathbf{x})  || p(\mathbf{z}) ) + \log p(\theta) - \lambda \sum_{g,j} \| w_j^{(g)} \|_2

where w_j^{(g)} is the j-th column of \mathbf{W}^{(g)} and \lambda is a parameter controlling the sparsity. The prior \log p(\theta) fixes the scaling of the neural network parameters relative to the mapping matrices \mathbf{W}^{(g)}.

The above objective consists of a differentiable term (the ELBO plus log prior on \theta) plus a convex but non-differential term (group LASSO). Therefore the authors use proximal gradient methods to optimize it. First, they update all parameters using the gradients of the ELBO plus log-prior with respect to \phi, \theta, and \mathcal{W}. Then, they apply the proximal operator

w_{j}^{(g)} \leftarrow \frac{ w_{j}^{(g)} } { \| w_{j}^{(g)} \|_2 } ( \| w_{j}^{(g)} \|_2 - \eta \lambda )_+

to each \mathbf{W}^{(g)} to respect the group lasso penalty, where \eta is a step-size. The authors fixed \lambda for all of their experiments fitting oi-VAE, so one question I had reading was how the authors determined \lambda and how varying \lambda affects the results.

The authors validate the approach on a toy example. They generated synthetic bars data, where one row of a square matrix was sampled from a Gaussian with non-zero mean while the rest of the matrix was sampled from zero-mean noise. The authors fit oi-VAE with each group set to a row of observations, and found that the model learned the appropriate sparse mapping where each latent component mapped to one of the rows. This latent space improved on the VAE, which did not have any discernible structure in the latent space. Importantly, oi-VAE still successfully identified the correct number of latent components (8) and sparse mapping even when the model was fit with double the amount of components.



(a) Synthetic bar data example and (b) reconstruction from oi-VAE. The learned oi-VAE latent-group mappings (c) match the true structure, while a VAE (d) does not learn a sparse structure. 


After validating the approach, the authors applied it to motion capture data. Here, the output groups were different joint angles. They trained the oi-VAE model on walking motion capture data. The learned latent dimensions nicely corresponded to intuitive groups of joint angles, such as the left leg (left foot, left lower leg, left upper leg). Next, the imposed structure in the model helped it generate more realistic unconditional samples of walking than the VAE, presumably because the inductive bias allowed oi-VAE to better learn invalid joint angles.


Unconditional walking pose samples from the VAE and oi-VAE models.

These results suggest the oi-VAE is a useful model for discovering nonlinear interactions between groups. In particular, I liked the approach of adding structure in the generative model to gain interpretability, and hypothesize that adding other forms of structure to VAE generative models is a good way to encourage disentangled representations (see a recent example of this in Dieng et al., 2019).

Two questions when using the approach are how to choose \lambda and how to choose the grouping of the data, as this work assumes a grouping has been chosen. In some data we may have prior knowledge about a natural grouping structure but that will not always be the case. However, even without multiple groups, the approach could be useful for learning the number of latent dimensions useful for explaining the amount of data. Finally, we point the reader to factVAE, where the authors further develop this idea to simultaneous learn complementary sparse structure in the inference network and generative model.

Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization

This week we continued the deep generative model theme and discussed Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization. This work is in the context of statistical compressive sensing, which attempts to recover a high-dimensional signal x \in \mathbb { R } ^ { n } from m low dimensional measurements y \in \mathbb { R } ^ { m } in a way that leverages a set of training data X. This is in contrast to traditional compressive sensing, which a priori asserts a sparsity based prior over x and learns the recovery of a single x from a single compressed measurement y.

The central compressive sensing problem can be written as

y = W x + \epsilon.

In traditional compressive sensing,  W is a known random Gaussian matrix that takes linear combinations of the sparse signal x to encode y. Here, \epsilon describes additive Gaussian noise. The goal of compressive sensing is to decode x, given m measurements y. This is traditionally done via LASSO, which minimizes the \ell_1 convex optimization problem \hat { x } = \arg \min _ { x } \| x \| _ { 1 } + \lambda \| y - W x \| _ { 2 } ^ { 2 }. In contrast, statistical compressive sensing is a set of methods which recovers x from y using a procedure that results from training on many examples of x. That is, the decoding of x from y is done in a way that utilizes statistical properties learned from training data X. The encoding, on the other hand, may either involve a known random Gaussian matrix W, as is the case above, or it may also be learned as a part of the training procedure.

There are two statistical compressive sensing methods discussed in this paper. One is prior work that leverages the generative stage of variational autoencoders to reconstruct x from y (Bora et al. 2017). In this paper, the decoding network of learned VAE creates an x from a sample z. This is adapted to a compressive sensing framework of recreating the best x from measurement y by solving the following minimization problem:

\widehat { x } = G \left( \arg \min _ { z } \| y - W G ( z ) \| _ { 2 } \right)

Here, G(z) is the VAE decoding mapping. The optimization is: given some measurements y, we seek to find the z that generates an x such that, when it is compressed, best matches the measurements y.

This VAE-based decompression method is used as a comparison to a new method presented in this paper, called the Uncertainty Autoencoder (UAE) which is able to optionally learn both an encoding process, the mapping of x to y, and a decoding process, the recovery of a reconstructed x from a y. It learns these encoding and decoding mappings by maximizing the mutual information between X and Y, parameterized by encoding and decoding distributions. The objective, derived through information theoretic principles, can be written as:

\max _ { \phi , \theta } \sum _ { x \in \mathcal { D } } \mathbb { E } _ { Q _ { \phi } ( Y | x ) } \left[ \log p _ { \theta } ( x | y ) \right] \stackrel { \mathrm { def } } { = } \mathcal { L } ( \phi , \theta ; \mathcal { D } )

Here, Q _ { \phi } ( Y | X) is an encoding distribution parameterized like the original sparse coding compression mapping \mathcal { N } \left( W  ( X ) , \sigma ^ { 2 } I _ { m } \right), and  \log p _ { \theta } ( x | y ) is a variational distribution that decodes x from y using a deep neural network parameterized by \theta. This objective is maximized by drawing training data samples from what is asserted as a uniform prior over Q, which is simply the data itself, Q_{data}(X).

Using this objective, it is possible to derive an optimal linear encoder W under a Gaussian noise assumption in the limit of infinite noise. This linear encoding, however, is not the same as PCA, which is derived under the assumption of linear decoding. UAE linear compression, instead, makes no assumptions about the nature of the decoding. The authors use this optimal linear compressor W* on MNIST, and use a variety of classification algorithms (k-nearest neighbors, SVMs, etc) on this low-d representation to test classification performance. They find that the UAE linear compression better separates clusters of digits than PCA. It is unclear, however, how this UAE classification performance would compare to linear compression algorithms that are known to work better for classification, such as random projections and ICA. I suspect it will not do as well. Without these comparisons, unclear what use this particular linear mapping provides.

The authors demonstrate that UAE outperforms both LASSO and VAE decoding on reconstruction of MNIST and omniglot images from compressed measurements. They implement two versions of the UAE, one that includes a trained encoding W, and another where W is a random Gaussian matrix, as it is for the other decoding conditions. This allows the reader to distinguish to what extent the UAE decoder p(x|y) does a better job at reconstructing x under the same encoding as the alternative algorithms.

Most interestingly, the authors test the UAE in a transfer compressive sensing condition, where an encoder and decoder is learned on some data, say omniglot, and the decoder is re-learned using compressed measurements from different data, MNIST. In this condition, the training algorithm has no access to the test MNIST signals, but still manages to accurately recover the test signals given their compressed representations. This suggests that reasonably differently structured data may have similar optimal information-preserving linear encodings, and an encoder learned from one type of data can be utilized to create effective reconstruction mappings across a variety of data domains. The applications here may be useful.

It is unclear, however, how well these comparisons hold up in a context where traditional compressive sensing has proven to be very effective, like MRI imaging.

There are interesting parallels to be drawn between the UAE and other recent work in generative modeling and their connections to information theory. As the author’s point out, their objective function corresponds to last week’s \betaVAE objective function with \beta = 0. That is, the UAE objective written above is the same as the VAE objective minus the KL term. Though a VAE with \beta = 0 does not actually satisfy the bound derived from the marginal distribution, and hence is not a valid ELBO, is does satisfy the bound of maximal mutual information. And as Mike derives in the post below, there is an additional connection between rate-distortion theory and the \betaVAE  objective. This suggests that powerful generative models can be derived from either principles of Bayesian modeling or information theory, and connections between the two are just now beginning to be explored.