October 11th NP Bayes Meetup

In today’s NP Bayes discussion group we returned to the 2006 Hierarchical Dirichlet Process (HDP) paper by┬áTeh et al to discuss sampling-based inference. We spent most of our time sorting through the notational soup needed to specify the HDP variables and their relationships between one another. This led to a brief discussion of implementation issues, and finally a description of the three Gibbs sampling techniques presented in the paper.

Incorporating dynamics into statistical models

This week in lab meeting we discussed “Improved predictions of lynx trappings using a biological model” by Cavan Reilly and Angelique Zeringue, available for download on Andrew Gelman’s blog.

Many popular statistical timeseries models are designed to be both computationally tractable and applicable in many situations. In contrast, Reilly and Zeringue illustrate in a case study that incorporating a simple model of a timeseries’ underlying dynamics can yield a significantly better model – although at the cost of generality and computational tractability.

They consider a classic (though new to me!) dataset in timeseries analysis: recordings of the numbers of Canadian lynx trapped yearly from 1821 to 1934. Rather than directly modeling lynx trappings, they model the latent lynx population along with the population of its sole food source, the snowshoe hare. They assume the population dynamics are governed by the classic Lotka-Volterra equations, place a distribution on the proportion of the lynx population captured in a given year, and choose priors on several parameters based on ecological literature.

Unfortunately, computing with their model does seem like it was a bit of a chore. While computing the likelihood of a given parameter (requiring just an integration of the Lotka-Volterra equations) is straightforward enough, they were unable to get their Monte Carlo simulations to converge and instead used simulated annealing to locate local modes in parameter space.

What’s surprising is that despite these imperfections, the model predictions from the local mode they give in the paper (augmented by an AR(1) process model) fit the data well. They outperform their declared contender, the SETAR model, despite its having many more parameters and being fit to more data.

In the end we were left a bit confused by some of their choices both in modeling and computationally. Overall it made for a lively, if sometimes puzzled, discussion.

Comp Neuro JC: “Implicit Encoding of Prior Probabilities in Optimal Neural Populations”

For Wednesday’s Computational & Theoretical Neuroscience Journal Club I presented a paper by Deep Ganguli and Eero Simoncelli on priors in optimal population coding,

The paper considers a population of neurons which encode a single, scalar stimulus variable, s. Given that each stimulus value s has some probability p(s) of occurring in the environment, what is the best possible population code?

The paper frames this question as a mathematical optimization problem, quantifying the notion of “best population” using Fisher Information
as a measure of the amount of information carried about a given stimulus in the population. Through the use of some careful assumptions and approximations, and by parameterizing the solutions through a very clever “warping” transform of a simple population, they’re able to obtain an optimization program that’s analytically solvable. The end result are predictions of a population’s density (neurons/stimulus, roughly) and gain (mean spike rate) as a function of the prior, p(s) – an implicit encoding of the prior in the population. Comparisons of measured prior distributions of spatial frequency and orientation (in natural images) with predictions based on experimentally-recorded densities yield good matches.

NP Bayes Reading Group: 4th meeting

Last Friday, while everyone else was toweling dry and/or knocking small crustaceans out of their ears, I presented a proof of de Finetti’s theorem given in (Heath and Sudderth, 1976).

When presented with an infinite sequence of coinflips, X_1, X_2, \dots, we as frequentists presume that each coin is drawn independently from the same distribution. That is, X_k \sim \text{Bernoulli}(\theta), for some \theta, and the joint probability of our data, as we receive it, factorizes as p(x_1, \dots,x_N) = \prod_{k=1}^N p(x_k). Although we don’t know what \theta is when the coins start falling, \theta is not random. Our uncertainty is due to finite sample-size; in fact, \lim_{N\rightarrow \infty} \frac{1}{N}\sum_{k=1}^N X_k = \theta.

On the other hand, as Bayesians, \theta is a random variable, and its distribution p(\theta) reflects our prior uncertainty about its value. To obtain our joint probability we marginalize over \theta and end up with p(x_1, \dots,x_N) = \int \prod_{k=1}^N p(x_k | \theta) p(\theta)d\theta. Now the sequence is X_1, X_2,\dots is not independent, but it is infinitely-exchangeable: although the X_k are not independent, their ordering doesn’t matter.

So, mixing conditionally-independent distributions with a prior over \theta results in an exchangeable sequence of random variables. The de Finetti theorem states that the converse is also true:

Theorem 1 (de Finetti) A sequence of Bernoulli random variables {X_1, X_2, \dots} is infinitely-exchangeable if and only if there exists a random variable {\theta} with distribution function {F(\theta)} such that the joint probability has the form

\displaystyle  p(x_1, \dots, x_n) = \int \prod_{i=0}^{n} p(x_i|\theta) dF(\theta) \ \ \ \ \ (1)

This is seen as validating the Bayesian point of view by its implication that, when the data is (infinitely) exchangeable, there must exist a parameter with a prior and a likelihood p(x|\theta) with respect to which the data is conditionally independent.

Exchangeablity crops up frequently in our DP discussions (for instance, the DP and CRP are exchangeable), which indicates that, although we only discussed and proved the theorem in the above form, it’s also true in much more general contexts.

Next week, Kenneth will lead the discussion on applying the theory we learned so far to practical clustering algorithms (he had a pass this week).