In today’s NP Bayes discussion group we returned to the 2006 Hierarchical Dirichlet Process (HDP) paper by Teh et al to discuss sampling-based inference. We spent most of our time sorting through the notational soup needed to specify the HDP variables and their relationships between one another. This led to a brief discussion of implementation issues, and finally a description of the three Gibbs sampling techniques presented in the paper.
Our second NPB reading group meeting took aim at the seminal 2006 paper (with >1000 citations!) by Teh, Jordan, Beal & Blei on Hierarchical Dirichlet Processes. We were joined by newcomers Piyush Rai (newly arrived SSC postdoc), and Ph.D. students Dan Garrette (CS) and Liang Sun (mathematics), both of whom have experience with natural language models.
We established a few basic properties of the hierarchical DP, such as the the fact that it involves creating dependencies between DPs by endowing them with a common base measure, which is itself sampled from a DP. That is:
- (“global measure” sampled from DP with base measure and concentration ).
- (sequence of conditionally independent random measures with common base measure , e.g., are distributions over clusters from data collected on different days)
Beyond this, we got bogged down in confusion over metaphors and interpretations, unclear whether ‘s were topics or documents or tables or restaurants or ethnicities, and were hampered by having two different version of the manuscript floating around with different page numbers and figures.
This week: we’ll take up where we left off, focusing on Section 4 (“Hierarchical Dicirhlet Processes”) with discussion led by Piyush. We’ll agree to show up with the same (“official journal”) version of the manuscript, available: here.
Time: 4:00 PM, Thursday, Oct 4.
Location: SEA 5.106
Please email pillow AT mail.utexas.edu if you’d like to be added to the announcement list.
After a nearly 1-year hiatus, we’ve restarted our reading group on non-parametric (NP) Bayesian methods, focused on models for discrete data based on generalizations of the Dirichlet and other stick-breaking processes.
Thursday (9/20) was our first meeting, and Karin led a discussion of:
Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor
processes. Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the
Association for Computational Linguistics. 985-992
In the first meeting, we made it only as far as describing the Pitman-Yor (PY) process, a stochastic process whose samples are random probability distributions, and two methods for sampling from it:
- Chinese Restaurant sampling (aka “Blackwell-MacQueen urn scheme”), which directly provides samples from distribution with G marginalized out.
- Stick-breaking, which samples the distribution explicitly, using iid draws of Beta random variables to obtain stick weights .
We briefly discussed the intuition for the hierarchical PY process, which uses PY process as base measure for PY process priors at deeper levels of the hierarchy (applied here to develop an n-gram model for natural language).
Next week: We’ve decided to go a bit further back in time to read:
Teh, Y. W.; Jordan, M. I.; Beal, M. J. & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association 101:1566-1581.
Time: Thursday (9/27), 4:00pm.
Location: Pillow lab
note: if you’d like to be added to the email announcement list for this group, please send email to pillow AT mail.utexas.edu.
Let be a well defined infinite discrete probability distribution (e.g., a draw from Dirichlet process (DP)). We are interested in evaluating the following form of expectations: for some function (we are especially interested when , which gives us Shannon’s entropy). Following , we can re-write it as
where is a random variable that takes the value with probability . This random variable is better known as the first size-biased sample . It is defined by . In other words, it takes one of the probabilities among with probability .
For Pitman-Yor process (PY) with discount parameter and concentration parameter (Dirichlet process is a special case where ), the size biased samples are naturally obtained by the stick breaking construction. Given a sequence of independent random variables distributed as , if we define , then the set of is invariant to size biased permutation , and they form a sequence of size-biased samples. In our case, we only need the first size biased sample which is simply distributed as .
- Jim Pitman, Marc Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, Vol. 25, No. 2. (April 1997), pp. 855-900, doi:10.1214/aop/1024404422
- Mihael Perman, Jim Pitman, Marc Yor. Size-biased sampling of Poisson point processes and excursions. Probability Theory and Related Fields, Vol. 92, No. 1. (21 March 1992), pp. 21-39, doi:10.1007/BF01205234
Continuing from last week (Generative model diagram for stick breaking), we established that the Chinese restaurant process (CRP) is exchangeable, and the underlying process from de Finetti theorem is Dirichlet process (DP), that is, CRP is the marginal distribution of DP. The simple form of conditional distribution of CRP provides easy way of sampling. Then, we discussed the form of posterior of DP, whose expectation (marginal) coincides with the CRP. Next week, Kenneth will lead the discussion on applying the theory we learned so far to practical clustering algorithms.
Continuing from last week, we discussed the formulation of generative clustering (mixture model) with fixed number of clusters K using Dirichlet distribution as a prior for cluster size distribution following Jordan’s slides. The definition of Dirichlet process (DP) and its existence was briefly shown via Kolmogorov extension theorem. Following (Sethuraman, 1994), we discussed the stick breaking construction of DP. Stick breaking provides the sample-biased permutation of Poisson-Dirichlet distribution obtained by Kingman limit (Kingman, 1975). The following fun facts about (extended) Dirichlet distribution are from (Sethuraman, 1994).
Next week, we will continue on the discussion of DP as a prior for nonparameteric Bayesian clustering, posterior of DP and how to do inference with DP. (Jordan slide #45)
Possible further exploration:
- Sampling from Poisson-Dirichlet distribution (Donnelly-Tavaré-Griffiths sampling?)
- Proof of Lemma 3.2 from Sethuraman 1994
We’ve started a reading group to come to grips with some of the recent developments in non-parametric (NP) Bayesian modeling, in particular, hierarchical Bayesian models for discrete data. The defining characteristic of NP models are that the number of parameters scale with the amount of data (leading to an infinite number of parameters in the limit of infinite data). Although these have sparked a mini-revolution in cognitive psychology (e.g., Tenenbaum, Griffiths & Kemp 2006), they do not appear to have found much application to statistical analysis of neural data (with the exception of spike sorting — see, e.g. Wood & Black 2008).
Our first assignment is to go through the slides from Michael Jordan’s 2005 NIPS tutorial (slides.ps). Last week we began, and made it through slide #23, covering the basic ideas of non-parametric models, exchangeability, De Finetti’s theorem, conjugate priors, Gibbs sampling, graphical models, Dirichlet & Beta distributions.
A few issues set aside for further exploration:
- proof of De Finetti’s theorm (Evan)
- relationship between CRP and stick-breaking (JP)
- slide 13: “A short calculation shows…” (Joe)
- proof that # of occupied tables is O(log n). (Memming)
- aggregation property (Ken)