# Discussion of Fox & Friends, “Bayesian Structure Learning for Stationary Time Series”

[latexpage]

Last week in lab meeting I discussed a paper from Emily Fox’s group [TFF15], Bayesian Structure Learning for Stationary Time Series.

If you’ve read Carvalho and Scott’s paper on Bayesian inference for graphical models [CS09], then you’ll find [TFF15] a natural extension to [CS09] and reading the latter is surely the best way to get your head wrapped around the former.

From my view [TFF15] makes 2 main contributions:
1) Formulating the graph inference problem in terms of time series by borrowing from [CS09]
2) Formulating the problem in the Fourier domain, making the problem tractable.

On the way to making the problem properly specified, [TFF15] introduced a new conjugate prior on complex normal distributions defined on graphs, the hyper- complex inverse Wishart distribution.

To give you a little background, [CS09] used an algorithm (FINCS, originally developed in [CS08]) for inferring a undirected graph that defines conditional dependence between random variables when

1. variables are Gaussian distributed
2. observations are independent
3. the admissible graphs are “decomposable,” which mean that the graph can be unambiguously described by a bunch of cliques and their intersecting nodes (called “separators”).

Solving the problem in these terms provides some convenient simplifications.

The first simplification, owing to the variables being Gaussian, is a well-known result by Dempster [Demp72],

Suppose $X = (X_1,\dots,X_p)^T\sim \mathcal{N}(0,\Sigma)$, and $Z_{ij}\equiv \{\text{all } X_k, k\in (1,\dots,p)| k\neq i,j\}$. Then $[\Sigma]_{ij}^{-1} = 0$ implies $P(X_i,X_j|Z_{ij})=P(X_i|Z_{ij})P(X_j|Z_{ij})$.

In other words, $X_i$ and $X_j$ are independent, conditionally on all other variables, when the inverse covariance matrix has a 0 in the $i,j^\text{th}$ element. Thus, the problem of evaluating the likelihood of a graph becomes the problem of evaluating the likelihood of a covariance that satisfies the corresponding pattern of 0’s in $\Sigma^{-1}$.

The simplification relies upon the restriction of searching over decomposable graphs on more way than one. Not only does FINCS relyupon this assumption but decomposability provides for a convenient factorization of the prior. Conditioned on the graph $G$, the prior can be decomposed in terms of cliques $\mathcal{C}$ and separators $\mathcal{S}$: $p(\Sigma|G) = \frac{\prod_{C\in\mathcal{C}}p(\Sigma_C|\delta,W_C)}{\prod_{S\in\mathcal{S}}p(\Sigma_S|delta,W_S)},$

where $(b, D)$ are the parameters of the hyper- inverse Wishart distribution; the conjugate prior for the covariance parameters of a Gaussian distribution that obeys the conditional independence laws defined by the graph. This dramatically reduces the complexity since operations need only be performed on covariance matrices that are the size of the cliques and separators, and not on the full dimensionality of the observations.

The snag that kept [TFF15] from using [CS09] more or less out of the box is that time series do not have independent observations, except in the trivial case (negating any simplification afforded by assumption (2), in the above list).  Thus, the likelihood over times series does not factorize over observations, thereby requiring us to contend with potentially very large covariance matrices. The key insight that [TFF15] made is that moving to the Fourier domain allowed them to make use of the Whittle approximation, which states that Fourier coefficients of stationary times series are asymptotically independent. This is a clever trick that allowed [TFF15] to operate as if the observations $d$ (now indexed by frequency rather than time) were independent. The one wrinkle is that the observations were now complex.

The complex likelihood required [TFF15] to derive a new hyper-Markov conjugate prior over the covariance matrices (or rather, spectral density matrics $S$), following the work of [DaLa93]. The result was a complex generalization of the hyper- inverse Wishart distribution, aptly named, the “hyper- complex inverse Wishart” (the “hyper-” part denotes that the prior obeys conditional independence laws that are specified by the graph).

After specifying the prior over the spectral densities, [TFF15] specify a hyperprior over the graph, which determines the number of edges $k$.  They also go further in placing a Beta $(a,b)$ prior over the probability of an edge.  The full graphical model is schematized in the left panel below.  It turns out you can marginalize out both the probability of edges $r$ and the spectral density of cliques and separators so that the final model is greatly simplified (right panel above) while still having the flexibility of a proper Bayesian formulation.

You can check out the analysis results for yourselves but I think that they look pretty awesome and make a great argument in favor of their approach.

Now, I’m not an expert on graph theory but I think that the most dubious of the assumptions that [TFF15] inherited from [CS09] is the requirement that all admissible graphs be decomposable. Its not clear to me how (or if) this is a very strong restriction and I’m curious how the algorithm adapts when the model is misspecified. When does it matter? How could this kind of bias influence scientific conclusions drawn from the inferred graphs? I’m also interested in how one might leverage these techniques to infer directed graphs.

If others have intuition about these points then I’m curious to know.

References:

[TFF15] Tank A, Foti NJ, and Fox E. Bayesian Structure Learning for Stationary Time Series. Uncertainty in Artificial Intelligence (UAI), 2015.

[CS09] C. M. Carvalho and J. G. Scott. Objective Bayesian model selection in Gaussian graphical models. Biometrika, 96(3):497?512, 2009

[CS08] C. M. Carvalho and J. G. Scott. Feature-inclusion stochastic search for Gaussian graphical models. J. Computational and Graphical Statistics, 17(4):790-808, 2008

[Demp72] Dempster, A. Covariance selection. Bometics, 28, 157-175

[DaLa93] A. P. Dawid and S. L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist., 21(3):1272–1317, 1993.

# How can single neurons predict behavior?

Pitkow et al., Neuron, 87, 411-423, 2015

A couple of weeks ago I presented Xaq Pitkow et al.’s paper examining the convoluted relationship between choice probabilities (CP), information-limiting correlations (ILC), and suboptimal coding.

# Every Neuron is Special

A couple of weeks ago I presented

A category-free neural population supports evolving demands during decision-making

by David Raposo, Matthew Kaufman and Anne Churchland.  By “categories” they are referring to some population of cells whose responses during an experiment seem to be dominated by one or two of the experimental variables. The authors refer to these types of categories as functional categories.

# Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares
Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014 In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.