# Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization

This week we continued the deep generative model theme and discussed Uncertainty Autoencoders: Learning Compressed Representations via Variational Information Maximization. This work is in the context of statistical compressive sensing, which attempts to recover a high-dimensional signal $x \in \mathbb { R } ^ { n }$ from $m$ low dimensional measurements $y \in \mathbb { R } ^ { m }$ in a way that leverages a set of training data $X$. This is in contrast to traditional compressive sensing, which a priori asserts a sparsity based prior over $x$ and learns the recovery of a single $x$ from a single compressed measurement $y$.

The central compressive sensing problem can be written as $y = W x + \epsilon$.

In traditional compressive sensing, $W$ is a known random Gaussian matrix that takes linear combinations of the sparse signal $x$ to encode $y$. Here, $\epsilon$ describes additive Gaussian noise. The goal of compressive sensing is to decode $x$, given $m$ measurements $y$. This is traditionally done via LASSO, which minimizes the $\ell_1$ convex optimization problem $\hat { x } = \arg \min _ { x } \| x \| _ { 1 } + \lambda \| y - W x \| _ { 2 } ^ { 2 }$. In contrast, statistical compressive sensing is a set of methods which recovers $x$ from $y$ using a procedure that results from training on many examples of $x$. That is, the decoding of $x$ from $y$ is done in a way that utilizes statistical properties learned from training data $X$. The encoding, on the other hand, may either involve a known random Gaussian matrix $W$, as is the case above, or it may also be learned as a part of the training procedure.

There are two statistical compressive sensing methods discussed in this paper. One is prior work that leverages the generative stage of variational autoencoders to reconstruct $x$ from $y$ (Bora et al. 2017). In this paper, the decoding network of learned VAE creates an $x$ from a sample $z$. This is adapted to a compressive sensing framework of recreating the best $x$ from measurement $y$ by solving the following minimization problem: $\widehat { x } = G \left( \arg \min _ { z } \| y - W G ( z ) \| _ { 2 } \right)$

Here, $G(z)$ is the VAE decoding mapping. The optimization is: given some measurements $y$, we seek to find the $z$ that generates an $x$ such that, when it is compressed, best matches the measurements $y$.

This VAE-based decompression method is used as a comparison to a new method presented in this paper, called the Uncertainty Autoencoder (UAE) which is able to optionally learn both an encoding process, the mapping of $x$ to $y$, and a decoding process, the recovery of a reconstructed $x$ from a $y$. It learns these encoding and decoding mappings by maximizing the mutual information between $X$ and $Y$, parameterized by encoding and decoding distributions. The objective, derived through information theoretic principles, can be written as: $\max _ { \phi , \theta } \sum _ { x \in \mathcal { D } } \mathbb { E } _ { Q _ { \phi } ( Y | x ) } \left[ \log p _ { \theta } ( x | y ) \right] \stackrel { \mathrm { def } } { = } \mathcal { L } ( \phi , \theta ; \mathcal { D } )$

Here, $Q _ { \phi } ( Y | X)$ is an encoding distribution parameterized like the original sparse coding compression mapping $\mathcal { N } \left( W ( X ) , \sigma ^ { 2 } I _ { m } \right)$, and $\log p _ { \theta } ( x | y )$ is a variational distribution that decodes $x$ from $y$ using a deep neural network parameterized by $\theta$. This objective is maximized by drawing training data samples from what is asserted as a uniform prior over $Q$, which is simply the data itself, $Q_{data}(X)$.

Using this objective, it is possible to derive an optimal linear encoder $W$ under a Gaussian noise assumption in the limit of infinite noise. This linear encoding, however, is not the same as PCA, which is derived under the assumption of linear decoding. UAE linear compression, instead, makes no assumptions about the nature of the decoding. The authors use this optimal linear compressor $W*$ on MNIST, and use a variety of classification algorithms (k-nearest neighbors, SVMs, etc) on this low-d representation to test classification performance. They find that the UAE linear compression better separates clusters of digits than PCA. It is unclear, however, how this UAE classification performance would compare to linear compression algorithms that are known to work better for classification, such as random projections and ICA. I suspect it will not do as well. Without these comparisons, unclear what use this particular linear mapping provides.

The authors demonstrate that UAE outperforms both LASSO and VAE decoding on reconstruction of MNIST and omniglot images from compressed measurements. They implement two versions of the UAE, one that includes a trained encoding $W$, and another where $W$ is a random Gaussian matrix, as it is for the other decoding conditions. This allows the reader to distinguish to what extent the UAE decoder $p(x|y)$ does a better job at reconstructing $x$ under the same encoding as the alternative algorithms.

Most interestingly, the authors test the UAE in a transfer compressive sensing condition, where an encoder and decoder is learned on some data, say omniglot, and the decoder is re-learned using compressed measurements from different data, MNIST. In this condition, the training algorithm has no access to the test MNIST signals, but still manages to accurately recover the test signals given their compressed representations. This suggests that reasonably differently structured data may have similar optimal information-preserving linear encodings, and an encoder learned from one type of data can be utilized to create effective reconstruction mappings across a variety of data domains. The applications here may be useful.

It is unclear, however, how well these comparisons hold up in a context where traditional compressive sensing has proven to be very effective, like MRI imaging.

There are interesting parallels to be drawn between the UAE and other recent work in generative modeling and their connections to information theory. As the author’s point out, their objective function corresponds to last week’s $\beta$VAE objective function with $\beta = 0$. That is, the UAE objective written above is the same as the VAE objective minus the KL term. Though a VAE with $\beta = 0$ does not actually satisfy the bound derived from the marginal distribution, and hence is not a valid ELBO, is does satisfy the bound of maximal mutual information. And as Mike derives in the post below, there is an additional connection between rate-distortion theory and the $\beta$VAE  objective. This suggests that powerful generative models can be derived from either principles of Bayesian modeling or information theory, and connections between the two are just now beginning to be explored.

# Automatic Differentiation in Machine Learning: a Survey

 In lab meeting this week we discussed Automatic Differentiation in Machine Learning: a Survey, which addresses the general technique of autodifferentiation for functions expressed in computer programs. The paper initially outlines three main approaches to the computation of derivatives in computer programs: 1) manually working out derivates by hand and coding the result. 2) using numerical approximations to derivates in the form of assessing function values at small discrete steps, and 3) computer manipulation of symbolic mathematic expressions that automatically generates differential expressions via standard calculus rules. The limitations of approach 1 are that derivatives can involve the calculation of a large number of terms (expression swell) that are both manually cumbersome to deal with and lend themselves to easy algebraic mistakes. The disadvantage of approach 2 is a finite difference approach to numerical differentiation inevitably involves round-off errors and is sensitive to step-size. The disadvantage to 3 is that this approach is computationally cumbersome and often standard calculus rules involve forms of derivates that still need to be manually reduced. In contrast to these methods, the paper introduces autodifferentiation as a set of techniques that operates at the elementary computation level via step-wise implementation of the chain rule. The assessment of a function in a computer program involves an execution of procedures that can be represented via a computational graph where each node in the graph is an intermediary temporary variable. Computers execute this trace of elementary operations via intermediary variables and this procedure can be utilized efficiently to additionally concurrently calculate a derivative. The paper outlines two general approaches to autodifferentiation: forward mode and backward mode. In forward mode, a derivative trace is evaluated alongside the computers execution of the functional trace, which intermediary values from the functional evaluation trace are used in concert with the derivative trace calculation. This execution can be conveniently realized by extending the representation of the real numbers to include a dual component. Analogous to imaginary numbers, this dual component is denoted with an $\epsilon$, where $\epsilon^2 = 0$. That is, each number in the execution of the function on a computer is now extended to include tuple which includes this additional dual component whose expressions evaluate using intermediate variables alongside the evaluation of the function.  Using this simple rule of the dual number ( $\epsilon^2 = 0$), evaluations implement the notion of the chain rule in differentiation to automatically calculate derivatives at every computational step along the evaluation trace. In executing this forward mode autodifferentiation, a derivative trace is initially ‘seeded’ with an input derivative vector as a function is evaluated with an initial input. Whereas symbolic differentiation can be thought of as the mapping of $f(x)$ to $J_{f}(x)$, forward autodifferentiation is the mapping of $f(c)$ to $J_{f}(c) \cdot \vec{x}$. As such, considering a functional mapping from dimension $n$ to $m$, forward mode requires n trace evaluations to calculate the entire Jacobian, each time the seeded derivative vector is an index for a particular dimension in n. Thus, the calculation of the Jacobian can be completed in approximately n times the total number of operations in an evaluation of $f(c)$. Backward mode, by contrast, computes $f(c)$ to $\vec{y} \cdot J_{f}(c)$. That is, one evaluation of the derivative trace in backward mode can similarly compute a Jacobian vector dot product, in this case extracting a row of the Jacobian (as opposed to a column as in forward mode). In this case, the calculation of the full Jacobian can be completed in approximately m times the total number of operations in an evaluation of $f(c)$.  A disadvantage to backward mode autodifferentiation is that each intermediate variable along a function evaluation trace must be stored in memory for use during the backward execution of the derivative trace.