Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares
Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014 In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

I’ve done some work on time-frequency representations in the past and often times an investigator expects to find signals that are narrow-band, but possibly amplitude-modulated, and are therefore limited in resolution by the time-frequency uncertainty principle.  However, when you formulate the problem of time-frequency representation as a Bayesian estimation problem, you can get a lot more bang for you buck.

The basic problem is to estimate the Fourier coefficients $x_n$ at time step $n$ for observations $y_n$ described by $\bold{y}_n = \bold{F}_n \bold{x}_n + \bold{v}_n$

where $\bold{F}_n$ is the real-valued DFT matrix and $\bold{v}\sim \mathcal{N}(0,\sigma^2)$.  The usual way of doing business is to let $n$ be subsamples of the observation times and estimate $x_n$ over some window centered on $n$. Smoothness in time is achieved by having a large overlap between neighboring windows.  Ba et. al take a different approach, which obviates the need for large overlap between windows, reducing the size of $\bold{y}_n$ substantially for each time step.  The authors instead model the relationship between neighboring (in time) Fourier coefficients by $\bold{x}_n = \bold{x}_{n-1}+\bold{w}_n$

where $\bold{w}_n$ is drawn from some distribution $p_W(w)$.

Thus, the trick is to bring prior information to bare in the least-squares problem, by specifying the structure of the solution that you expect through $p_W(w)$.

The authors offered two possibilities: $\log p_1(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}w^2_{n,k} + \epsilon^2\right)^{1/2} +c_1$,

and $\log p_2(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}\left(w^2_{n,k} + \epsilon^2\right)^{1/2}\right)^{1/2} +c_2$.

Now, I admit that these priors look kind of funny, but there is a good reason why these were the priors specified in this paper.  These priors are “epsilon-perturbed” $\ell_\nu$ norms, given by $L(\bold{x}) = -\frac{1}{2}\sum^M_{i=1}(\epsilon^2+x_i^2)^{\nu/2}$.

This”epsilon-perturbed” version of the $\ell_\nu$ norms have rounded out corners in their level curves, like in the following picture from the authors’ earlier paper: It turns out that distributions with this form come from the normal/independent family, commonly used for robust regression and which lend themselves to EM by IRLS!

Thus $\log p_1$ is approximately an $\ell_1$ prior over frequencies (indexed by $k$), and a $\ell_2$ prior over changes in $x_n$ across time (indexed by $n$).  The latter can also be interpreted as a discrete version of the total variation norm.  The result is that $\log p_1$ biases the solutions toward things that are sparse in frequency but smooth in time.  Alternatively, $\log p_2$ is like a $\ell_1$ norm over frequencies, and a $\ell_1$ norm on transitions, also called a “fused LASSO”.  This prior will bias solutions toward things that are sparse infrequency and change abruptly in time.

The full MAP estimate of $\bold{x}_n$ can be found by IRLS, where the algorithm ends up being repeated application of the Kalman filtering equations forward in time followed by the Kalman smoother equations backward in time until convergence. The results for a simulated example of a 10Hz and 11Hz component, where one has amplitude that is ramping up exponentially fast, and the other is pulsing, is shown in the Figure at the top of the page.  An example using real EEG data looks like this: The authors didn’t really provide any validation other than the eye-ball test, and it wasn’t clear how to evaluate performance of the method for real data sets.  However, the results do look pretty good for the test cases they illustrated. I also would have liked to get more details about choosing window width and frequency resolution for data analysis.

I think that the application of the IRLS algorithm for solving this problem is really clever and there are a lot of possibilities for generalizing to different priors that, for example, accommodate frequency modulation. I hope to see some follow-up papers were they use the method to do some fun data analysis and maybe flesh out these details a bit more.