Bayesian time-frequency representations

This week in lab meeting I presented

Robust spectrotemporal decomposition by iteratively reweighted least squares
Demba Ba, Behtash Babadi, Patrick L Purdon, and Emery N Brown. PNAS, 2014

BaFig1

In this paper, the authors proposed an algorithm for fast, robust estimation of a time-frequency representation.  The robustness properties were achieved by applying a group-sparsity prior across frequencies and a “fused LASSO” prior over time. However, the real innovation that they were able leverage was from an earlier paper, in which the authors proved that the MAP estimation problem could be solved by iteratively re-weighted least squares (IRLS), which turns out to be a version of the EM algorithm.

I’ve done some work on time-frequency representations in the past and often times an investigator expects to find signals that are narrow-band, but possibly amplitude-modulated, and are therefore limited in resolution by the time-frequency uncertainty principle.  However, when you formulate the problem of time-frequency representation as a Bayesian estimation problem, you can get a lot more bang for you buck.

The basic problem is to estimate the Fourier coefficients x_n at time step n for observations y_n described by

\bold{y}_n = \bold{F}_n \bold{x}_n + \bold{v}_n

where \bold{F}_n is the real-valued DFT matrix and \bold{v}\sim \mathcal{N}(0,\sigma^2).  The usual way of doing business is to let n be subsamples of the observation times and estimate x_n over some window centered on n. Smoothness in time is achieved by having a large overlap between neighboring windows.  Ba et. al take a different approach, which obviates the need for large overlap between windows, reducing the size of \bold{y}_n substantially for each time step.  The authors instead model the relationship between neighboring (in time) Fourier coefficients by

\bold{x}_n = \bold{x}_{n-1}+\bold{w}_n

where \bold{w}_n is drawn from some distribution p_W(w).

Thus, the trick is to bring prior information to bare in the least-squares problem, by specifying the structure of the solution that you expect through p_W(w).

The authors offered two possibilities:

 \log p_1(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}w^2_{n,k} + \epsilon^2\right)^{1/2} +c_1,

and

\log p_2(\bold{w}_n,\bold{w}_{n-1},\dots) = -\alpha\sum^K_{k=1}\left(\sum^N_{n=1}\left(w^2_{n,k} + \epsilon^2\right)^{1/2}\right)^{1/2} +c_2.

Now, I admit that these priors look kind of funny, but there is a good reason why these were the priors specified in this paper.  These priors are “epsilon-perturbed” \ell_\nu norms, given by

L(\bold{x}) = -\frac{1}{2}\sum^M_{i=1}(\epsilon^2+x_i^2)^{\nu/2}.

This”epsilon-perturbed” version of the \ell_\nu norms have rounded out corners in their level curves, like in the following picture from the authors’ earlier paper:

BaFig2

It turns out that distributions with this form come from the normal/independent family, commonly used for robust regression and which lend themselves to EM by IRLS!

Thus \log p_1 is approximately an \ell_1 prior over frequencies (indexed by k), and a \ell_2 prior over changes in x_n across time (indexed by n).  The latter can also be interpreted as a discrete version of the total variation norm.  The result is that \log p_1 biases the solutions toward things that are sparse in frequency but smooth in time.  Alternatively, \log p_2 is like a \ell_1 norm over frequencies, and a \ell_1 norm on transitions, also called a “fused LASSO”.  This prior will bias solutions toward things that are sparse infrequency and change abruptly in time.

The full MAP estimate of \bold{x}_n can be found by IRLS, where the algorithm ends up being repeated application of the Kalman filtering equations forward in time followed by the Kalman smoother equations backward in time until convergence. The results for a simulated example of a 10Hz and 11Hz component, where one has amplitude that is ramping up exponentially fast, and the other is pulsing, is shown in the Figure at the top of the page.  An example using real EEG data looks like this:

BaFig3

The authors didn’t really provide any validation other than the eye-ball test, and it wasn’t clear how to evaluate performance of the method for real data sets.  However, the results do look pretty good for the test cases they illustrated. I also would have liked to get more details about choosing window width and frequency resolution for data analysis.

I think that the application of the IRLS algorithm for solving this problem is really clever and there are a lot of possibilities for generalizing to different priors that, for example, accommodate frequency modulation. I hope to see some follow-up papers were they use the method to do some fun data analysis and maybe flesh out these details a bit more.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s