# A Hierarchical Pitman-Yor Model of Natural Language

In the lab meeting on 9/17, we discussed the hierarchical, non-parametric Bayesian model for discrete sequence data presented in:

Wood, Archambeau, Gasthaus, James, & Teh,  A Stochastic Memoizer for Sequence Data.  ICML, 2009.

The authors extend previous work that used hierarchically linked Pitman-Yor processes to model the predictive distribution of a word given a context of finite length (an n-gram model), and here consider the distribution of words conditioned on a context of unbounded length (an $\infty$-gram model). The hierarchical structuring allows for the combination of information from contexts of different lengths, and the Pitman-Yor process allows for power-law distributions of words similar to those seen in natural language.  The authors develop the sequence memoizer and use coagulation and fragmentation operators to marginalize and reduce the computational complexity and create a collapsed graphical model on which inference is more efficient. The model is shown to perform well (i.e. have low perplexity) compared to existing models when applied to New York Times and Associated Press data.

# Revivifying the NP Bayes Reading Group

After a nearly 1-year hiatus, we’ve restarted our reading group on non-parametric (NP) Bayesian methods, focused on models for discrete data based on generalizations of the Dirichlet and other stick-breaking processes.

Thursday (9/20) was our first meeting, and Karin led a discussion of:

Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor
processes. Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the
Association for Computational Linguistics. 985-992

In the first meeting, we made it only as far as describing the Pitman-Yor (PY) process, a stochastic process whose samples are random probability distributions, and two methods for sampling from it:

1. Chinese Restaurant sampling (aka “Blackwell-MacQueen urn scheme”), which directly provides samples $\{X_i\}$ from distribution $G \sim PY$ with G marginalized out.
2. Stick-breaking, which samples the distribution $G = \sum \pi_i \delta_{\phi_i}$ explicitly, using iid draws of Beta random variables to obtain stick weights $\pi_i$.

We briefly discussed the intuition for the hierarchical PY process, which uses PY process as base measure for PY process priors at deeper levels of the hierarchy (applied here to develop an n-gram model for natural language).

Next week: We’ve decided to go a bit further back in time to read:

Teh, Y. W.; Jordan, M. I.; Beal, M. J. & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101:1566-1581.

Time: Thursday (9/27), 4:00pm.
Location: Pillow lab
Presenter: Karin

note: if you’d like to be added to the email announcement list for this group, please send email to pillow AT mail.utexas.edu.

# Using size-biased sampling for certain expectations

Let $\{\pi_i\}_i$ be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: $E\left[ \sum_i f(\pi_i) \right]$ for some function $f$ (we are especially interested when $f = -\log$, which gives us Shannon’s entropy). Following [1], we can re-write it as

$E\left[ \sum_i \frac{f(\pi_i)}{\pi_i} \pi_i \right] = E\left[ E[ \frac{f(X)}{X} | \{\pi_i\}]\right]$

where $X$ is a random variable that takes the value $\pi_i$ with probability $\pi_i$. This random variable $X$ is better known as the first size-biased sample $\tilde{\pi_1}$. It is defined by $\Pr[ \tilde \pi_1 = \pi_i | \{\pi_i\}_i] = \pi_i$. In other words, it takes one of the probabilities $\pi_i$ among $\{\pi_i\}_i$ with probability $\pi_i$.

For Pitman-Yor process (PY) with discount parameter $d$ and concentration parameter $\alpha$ (Dirichlet process is a special case where $d = 0$), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables $V_n$ distributed as $Beta(1-d, \alpha+n d)$, if we define $\pi_i = \prod_{k=1}^{i-1} (1 - V_k) V_i$, then the set of $\{\pi_i\}_i$ is invariant to size biased permutation [2], and they form a sequence of size-biased samples. In our case, we only need the first size biased sample which is simply distributed as $V_1$.

Using this trick, we can compute the entropy of PY without the complicated simplex integrals. We used this and its extension for computing the PY based entropy estimator.

1. Jim Pitman, Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855-900, doi:10.1214/aop/1024404422
2. Mihael Perman, Jim Pitman, Marc Yor. Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 21-39, doi:10.1007/BF01205234