Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.
http://arxiv.org/abs/1411.2581

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

Exponential families have a general form p(x|\eta)=h(x)\mbox{exp}(\eta^T T(x)-a(\eta)), where h(x) is a base measure, T(x) is a vector of sufficient statistics, \eta is the natural parameter and a(\eta) is the log-normalizer. An attractive property of exponential family distributions is that \mathbb{E}[T(x)]=\nabla_{\eta}a(\eta), which will be exploited in DEFs later.

Imitating deep neural net, these models stack multiple exponential family distributions in a hierarchical structure:

Screen Shot 2015-09-24 at 1.27.43 PM

Here, z_l is the latent variable vector at layer l. K_l is the number of variables in the layer l. W_l is the weight matrix connecting layer l+1 and l. The entire generative direction is from top to bottom. The top layer’s latent variables are generated from a prior distribution p(z_L)=\mbox{EXPFAM}_L(z_L,\eta). Then the conditional distribution between layer l+1 and l is p(z_l|z_{l+1},W_l)=\mbox{EXPFAM}_l(z_l,g_l(z_{l+1}^T W_l)). The bottom layer is for observation likelihood which is Poisson in this paper, p(x|z_1,W_0)=\mbox{Poisson}(z_1^T W_0). The link function g_l maps the inner product z_{l+1}^T W_l to the natural parameter to generate z_l. Because the expected sufficient statistics are equal to the gradient of the log normalizer \mathbb{E}[T(z_l)]=\nabla_{\eta_l}a(g_l(z_{l+1}^T W_l)), where \eta_l=g_l(z_{l+1}^T W_l), if we consider g_l(x)=x and T(z_l)=z_l, then \mathbb{E}[z_l] is just the linear function of W_l transformed by \nabla_{\eta_l}a(\cdot), which is one source of non-linearity. The inference method for DEFs is called black box variational inference (Ranganath, et al 2014), which we didn’t have time to discuss.

The paper also provided three concrete examples specified from DEFs which are sparse gamma DEFs, sigmoid belief network and Poisson DEFs. For different exponential families, there are different natural link functions g_l(\cdot) and prior distributions for p(W) summarized as follows,

Screen Shot 2015-09-24 at 1.52.03 PM

They finally evaluated various DEFs on text and combined multiple DEFs into a model for pairwise recommendation data. In an extensive study, they showed that going beyond one layer improves predictions for DEFs. They demonstrated that DEFs find interesting exploratory structure in large data sets, and give better predictive performance than state-of-the-art models.

This paper is interesting because it proposed a general deep hierarchical structure for all kinds of exponential families that avoids complicated inference implementations for various hierarchical generative models, as well as uncovering insight of “dual” relationship between neural net nonlinearity functions and the link functions in Bayesian framework.

Advertisements

One thought on “Deep Exponential Families

  1. One point that I think is worth emphasizing is that these are generative models, designed for unsupervised learning. So closer to deep LDA (as in ‘latent dirichlet allocation’) than today’s discriminatively trained CNNs. But still a very cool idea, exploiting useful properties of exponential families to build deep (where, ok, “deep”=3 layer) hierarchical models.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s