This aggression will not stand

# Lab meeting 6/27/2011

This week I followed up on the previous week’s meeting about state-space models with a tutorial on Kalman filtering / smoothing.  We started with three Gaussian “fun facts” about linear transformations of Gaussian random variables and products of Gaussian densities.  Then we derived the Kalman filtering equations, the EM algorithm, and discussed a simple implementation of Kalman smoothing using sparse matrices and the “backslash” operator in matlab.

Here’s how to do Kalman smoothing in one line of matlab:
Xmap = (Qinv + speye(nsamps) / varY) \  (Y / varY +  Qinv * muX);

where the latent variable X has prior mean muX and inverse covariance Qinv, and Y | X is Gaussian with mean X and variance varY * I.  Note Qinv is tri-diagonal and can be formed with a single call to “spdiags”.

# Lab Boat Trip (June 25)

Following an especially harrowing session of Dirichlet Process reading group, Memming & Joe organized a lab boat outing on Lake Travis, providing a much needed break from infinite-dimensional distributions…

# Comp Neuro JC on “Probabilistic Neural Representations”

Wednesday (June 22), I presented the 3rd segment in a 4-part series on “Probabilistic Representations in the Brain” in the Computational & Theoretical Neuroscience Journal Club.  This summer, Comp JC has been re-configured to allow each lab to present a bloc of papers on a single topic. Our lab (which got stuck going first) decided to focus on a recent controversy over representations of uncertainty in the brain, namely: do neural responses represent parameters of or samples from probability distributions?  (I’ll try to unpack this distinction in a moment). These competing theories generated a lively and entertaining debate at the  Cosyne 2010 workshops, and we thought it would be fun to delve into some of the primary literature.

The two main competitors are:

1. “Probabilistic Population Codes” (PPC) – advocated by Ma, Beck, Pouget, Latham and colleagues and (more recently, in a related but not identical form), Jazayeri, Movshon, Graf and Kohn.
basic idea:  the log-probability distribution over stimuli is a linear combination of “kernels” (i.e., things that look kinda like tuning curves) weighted by neural spike counts. Each neuron has its own kernel, so the vector of population activity gives rise to a weighted sum of kernels that can have variable width, peak location, etc.  This log-linear representation of probabilities sits well with “Poisson-like” variability observed in cortex, and makes it easy to perform Bayesian inference (e.g., combine information from two different populations) using purely linear operations.
key paper
:
• Ma et al, Bayesian inference with probabilistic population codes. Nature Neuroscience (2006)

2. “Sampling Hypothesis” – proposed by Fiser, Berkes, Orban & Lengyel.
basic idea: Holds that neurons represent stimulus features, i.e., “causes” underlying sensory stimuli, which the brain would like to extract. Each neuron represents a particular feature, and higher spiking corresponds to more of that feature in a particular image. In this scheme, probabilities are represented by the variability in neural responses themselves: neurons sample their spike count from the probability distribution over the presence of the corresponding feature. So for example, a neuron that emits 75 spikes in every time bin has high certainty that the corresponding feature is present; a neuron that emits 4 spikes in every time bin carries high certainty that the corresponding feature is not present in the image; a neuron with variable spike count ranging between 0 and 100 spikes in each bin represents a high level of uncertainty about the presence or absence of the corresponding feature. This scheme is better suited to representing high-dimensional probability distributions, and makes interesting predictions about learning and spontaneous activity.
key papers:

This week I presented the two (Fiser and Berkes) papers on the sampling hypothesis (slides: keynote, pdf). I have a few niggling complaints, which I may try to outline in a later post, but overall I think it’s a pretty cool idea and a very nice pair of papers. The idea that we should think about spontaneous activity as “sampling from the prior” seems interesting and original.

Who will ultimately win out?  It’s contest between a group of wild and woolly Magyars (“Hungarians”, in the parlance of our times) and an international coalition of would-be cynics led by an irascible Frenchman (humanized only by a laconic Dutchman with philanthropic bona fides). Since neither group enjoys a reputation for martial triumph, this conflict may play out for a while.  But our 4-part series will wrap up next week with a paper from Graf et al (presented by Kenneth) that puts the PPC theory to the test with neural data from visual cortex.

# Lab meeting 6/20/2011

Today I presented a paper from Liam’s group:  “A new look at state-space models for neural data”, Paninski et al,  JCNS 2009

The paper presents a high-level overview of state-space models for neural data, with an emphasis on statistical inference methods.  The basic setup of these models is the following:

• Latent variable $Q$  defined by dynamics distribution:  $P(q_{t+1}|q_t)$
• Observed variable $Y$ defined by observation distribution: $P(y_t | q_t)$.

These two ingredients ensure that the joint probability of latents and observed variables is
$P(Q,Y) = P(q_1 ) P(y_1|q_1) \prod_{t=2}^T P(y_t | q_t) P(q_{t}|q_{t-1})$.
A variety of applications are illustrated (e.g., $Q$ = common input noise; $Y$ = multi-neuron spike trains).

The two problems we’re interested in solving, in general, are:
(1) Filtering / Smoothing:  inferring $Q$ from noisy observations $Y$, given the model parameters $\theta$.
(2) Parameter Fitting: inferring $\theta$ from observations $Y$.

The “standard” approach to these problems involves: (1) recursive approximate inference methods that involve updating a Gaussian approximation to $P(q_t|Y)$ using its first two moments; and (2) Expectation-Maximization (EM) for inferring $\theta$.  By contrast, this paper emphasizes: (1) exact maximization for $Q$, which is tractable in $O(T)$ via Newton’s Method, due to the banded nature of the Hessian; and (2) direct inference for $\theta$ using the Laplace approximation to $P(Y|\theta)$.  When the dynamics are linear and the noise is Gaussian, the two methods are exactly the same (since a Gaussian’s maximum is the same as its mean; the forward and backward recursions in Kalman Filtering/Smoothing are the same set of operations needed by Newton’s method). But for non-Gaussian noise or non-linear dynamics, the latter method may (the paper argues) provide much more accurate answers with approximately the same computational cost.

Key ideas of the paper are:

• exact maximization of a log-concave posterior
• $O(T)$ computational cost, due to sparse (tridiagonal or banded) Hessian.
• the Laplace approximation (Gaussian approximation to the posterior using its maximum and second-derivative matrix), which is (more likely to be) justified for log-concave posteriors
• log-boundary method for constrained problems (which preserves sparsity)

Next week: we’ll do a basic tutorial on Kalman Filtering / Smoothing (and perhaps, EM).

# NP Bayes Reading Group: 1st meeting

We’ve started a reading group to come to grips with some of the recent developments in non-parametric (NP) Bayesian modeling, in particular, hierarchical Bayesian models for discrete data.  The defining characteristic of NP models are that the number of parameters scale with the amount of data (leading to an infinite number of parameters in the limit of infinite data).  Although these  have sparked a mini-revolution in cognitive psychology (e.g., Tenenbaum, Griffiths & Kemp 2006), they do not appear to have found much application to statistical analysis of neural data (with the exception of spike sorting — see, e.g. Wood & Black 2008).

Our first assignment is to go through the slides from Michael Jordan’s 2005 NIPS tutorial (slides.ps).  Last week we began, and made it through slide #23, covering the basic ideas of non-parametric models, exchangeability, De Finetti’s theorem, conjugate priors, Gibbs sampling, graphical models, Dirichlet & Beta distributions.

A few issues set aside for further exploration:

• proof of De Finetti’s theorm (Evan)
• relationship between CRP and stick-breaking (JP)
• slide 13: “A short calculation shows…” (Joe)
• proof that # of occupied tables is O(log n).  (Memming)
• aggregation property (Ken)

Next meeting: today 3pm (Jun 17, 2011).  Memming to lead…

# Lab Meeting 3/30/11 (Wed)

Mijung will present the following paper this Wednesday during the lab meeting:

Sequential Optimal Design of Neurophysiology Experiments
Jeremy Lewi, Robert Butera, Liam Paninski
Neural Computation 21, 619-687 (2007) (pdf)

“Since this is a long paper (69 papers in total, I guess this is almost the same as the author’s PhD thesis), I will aim to summarize the math part and look at the simulation results, which will be chapter 1 – 5.”