# Deep Exponential Families

In this week’s lab meeting, I presented:

Deep Exponential Families
Rajesh Ranganath, Linpeng Tang, Laurent Charlin and David Blei.
http://arxiv.org/abs/1411.2581

This paper describes a class of latent variable models inspired by deep neural net and hierarchical generative model, called Deep Exponential Families (DEFs). DEFs stack multiple layers of exponential families and connects them with certain link functions to capture the hierarchy of dependencies.

Exponential families have a general form $p(x|\eta)=h(x)\mbox{exp}(\eta^T T(x)-a(\eta))$, where $h(x)$ is a base measure, $T(x)$ is a vector of sufficient statistics, $\eta$ is the natural parameter and $a(\eta)$ is the log-normalizer. An attractive property of exponential family distributions is that $\mathbb{E}[T(x)]=\nabla_{\eta}a(\eta)$, which will be exploited in DEFs later.

Imitating deep neural net, these models stack multiple exponential family distributions in a hierarchical structure:

Here, $z_l$ is the latent variable vector at layer $l$. $K_l$ is the number of variables in the layer $l$. $W_l$ is the weight matrix connecting layer $l+1$ and $l$. The entire generative direction is from top to bottom. The top layer’s latent variables are generated from a prior distribution $p(z_L)=\mbox{EXPFAM}_L(z_L,\eta)$. Then the conditional distribution between layer $l+1$ and $l$ is $p(z_l|z_{l+1},W_l)=\mbox{EXPFAM}_l(z_l,g_l(z_{l+1}^T W_l))$. The bottom layer is for observation likelihood which is Poisson in this paper, $p(x|z_1,W_0)=\mbox{Poisson}(z_1^T W_0)$. The link function $g_l$ maps the inner product $z_{l+1}^T W_l$ to the natural parameter to generate $z_l$. Because the expected sufficient statistics are equal to the gradient of the log normalizer $\mathbb{E}[T(z_l)]=\nabla_{\eta_l}a(g_l(z_{l+1}^T W_l))$, where $\eta_l=g_l(z_{l+1}^T W_l)$, if we consider $g_l(x)=x$ and $T(z_l)=z_l$, then $\mathbb{E}[z_l]$ is just the linear function of $W_l$ transformed by $\nabla_{\eta_l}a(\cdot)$, which is one source of non-linearity. The inference method for DEFs is called black box variational inference (Ranganath, et al 2014), which we didn’t have time to discuss.

The paper also provided three concrete examples specified from DEFs which are sparse gamma DEFs, sigmoid belief network and Poisson DEFs. For different exponential families, there are different natural link functions $g_l(\cdot)$ and prior distributions for $p(W)$ summarized as follows,

They finally evaluated various DEFs on text and combined multiple DEFs into a model for pairwise recommendation data. In an extensive study, they showed that going beyond one layer improves predictions for DEFs. They demonstrated that DEFs find interesting exploratory structure in large data sets, and give better predictive performance than state-of-the-art models.

This paper is interesting because it proposed a general deep hierarchical structure for all kinds of exponential families that avoids complicated inference implementations for various hierarchical generative models, as well as uncovering insight of “dual” relationship between neural net nonlinearity functions and the link functions in Bayesian framework.