Fast approximate inference for directed graphical model: a Bayesian auto-encoder

In this week’s lab meting, I presented the following paper from Max Welling’s group:

Auto-Encoding Variational Bayes
Diederik P. Kingma, Max Welling
arXiv, 2013.

The paper proposed an efficient inference and learning method for directed probabilistic
models with continuous latent variables (with intractable posterior distributions), for use with large datasets. The directed graphical model under consideration is as follows,Screen Shot 2016-01-13 at 4.06.02 PM

The dataset is \mathbf{X}=\{\mathbf{x}^{(i)}\}_{i=1}^N consisting of N i.i.d. samples of some continuous or discrete variable \mathbf{x}. \mathbf{z} is an unobserved continuous random variable generating the data (solid lines: p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})), where \mathbf{\theta} is the parameter set involved in the generative model. The ultimate task is to learn both \mathbf{\theta} and \mathbf{z}. A general method to solve such a problem is to marginalize out \mathbf{z} to get the marginal likelihood p_\mathbf{\theta}(\mathbf{x})=\int p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x}|\mathbf{z})d\mathbf{z}, and maximize this likelihood  to learn \mathbf{\theta}. However, in many application cases, e.g. a neural network with a nonlinear hidden layer, the integral is intractable. In order to overcome this intractability, sampling-based methods, e.g. Monte Carlo EM, are introduced. But when the dataset is large, batch optimization is too costly and sampling loop per datapoint is very expensive. Therefore, the paper introduced a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case.

First, they defined a recognition model q_\Phi(\mathbf{z}|\mathbf{x}): an approximation to the intractable true posterior p_\mathbf{\theta}(\mathbf{z}|\mathbf{x}), which is interpreted as a probabilistic encoder (dash line in the directed graph), and correspondingly, p_\mathbf{\theta}(\mathbf{x}|\mathbf{z}) is the probabilistic decoder. Given the recognition model, the variational lower bound \mathcal{L}(\mathbf{\theta},\Phi;\mathbf{x}^{(i)}) is defined as


In the paper’s setting,


Therefore, D_{KL}(q_\Phi(\mathbf{z}|\mathbf{x}^{i})||p_\mathbf{\theta}(\mathbf{z})) has an analytical form. The major tricky term is the expectation which usually doesn’t have any closed solution. The usual Monte Carlo estimator for this type of problem exhibits very high variance and is not capable to take derivatives w.r.t. \Phi. Given such a problem, the paper proposed a reparameterization trick of the expectation term yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.

The key reparameterization trick constructs samples \mathbf{z}\sim q_\Phi(\mathbf{z}|\mathbf{x}) in two steps:

  1. \mathbf{\epsilon} \sim p(\mathbf{\epsilon}) (random seed independent of \Phi)
  2.  \mathbf{z}=g(\Phi,\mathbf{\epsilon},\mathbf{x}) (differentiable perturbation)

such that \mathbf{z}\sim q_\Phi(\mathbf{z}|\mathbf{x}) (the correct distribution). This yields an estimator which typically has less variance than the generic estimator:


where \mathbf{z}^{(i,l)}=g(\Phi,\mathbf{\epsilon}^{(i,l)},\mathbf{x}^{(i)}) and \mathbf{\epsilon}^{(l)}\sim p(\mathbf{\epsilon})

A connection with auto-encoders becomes clear when looking at the objective function. The first term is the KL divergence of the approximate posterior from the prior acts as a regularizer, while the second term is a an expected negative reconstruction error.

In the experiment, they set p_\mathbf{\theta}(\mathbf{x}|\mathbf{z}) to be a Bernoulli or Gaussian MLP, depending on the type of data they are modeling. They presented the comparisons of their method to the wake-sleep algorithm and Monte Carlo EM on MNIST and Frey Face datasets.

Overall, I think their contributions are two-fold. First, the reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, they showed that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. The stochastic gradient method helps to parallelize the algorithm so as to improve the efficiency in largescale dataset.



Binary Neurons can have Short Temporal Memory

This (belated) post is about the paper:

Randomly connected networks have short temporal memory,
Wallace, Hamid, & Latham, Neural Computation  (2013),

which I presented a few weeks ago at the Pillow lab group meeting. This paper analyzes the abilities of randomly connected networks  of binary neurons to store memories of network inputs. Network memory is  valuable quantity to bound; long memory indicates that a network is more likely to be able to perform complex operations on streaming inputs. Thus, the ability to recall past inputs provides a proxy for being able to operate on those inputs.  The overall result seems to stand in contrast to much of the literature because it predicts a very short memory (on the order of the logarithm of the number of nodes). The authors mention that this difference in the result is due to their use of more densely connected networks. There seem to be additional differences, though, between this papers’s network construction and those analyzed in related work.

Continue reading

Inferring synaptic plasticity rules from spike counts

In last week’s computational & theoretical neuroscience journal club I presented the following paper from Nicolas Brunel’s group:

Inferring learning rules from distributions of firing rates in cortical neurons.
Lim, McKee, Woloszyn, Amit, Freedman, Sheinberg, & Brunel.
Nature Neuroscience (2015).

The paper seeks to explain experience-dependent changes in IT cortical responses in terms of an underlying synaptic plasticity rule. Continue reading

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

This week in lab meeting we discussed:

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Andrew M. Saxe, James L. McClelland, Surya Ganguli. arxiv (2013).

This work aims to start analyzing a gnawing question in machine learning: How do deep neural networks actually work? Continue reading

Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

Continue reading

Every Neuron is Special

A couple of weeks ago I presented

A category-free neural population supports evolving demands during decision-making

by David Raposo, Matthew Kaufman and Anne Churchland.  By “categories” they are referring to some population of cells whose responses during an experiment seem to be dominated by one or two of the experimental variables. The authors refer to these types of categories as functional categories.

Continue reading