Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is:

\quad y = \phi(x)^T w + \epsilon, \quad  \epsilon \sim \mathcal{N}(0,\sigma^2)

and an independent zero-mean Gaussian prior on each element of w:

w_i \sim \mathcal{N}(0,1/\alpha_i),

where \alpha_i denotes the prior precision of each weight.

The basic idea is to set the hyperparameters \{\alpha_i\}  by maximizing marginal likelihood P(Y|X,\{\alpha_i\}). The resulting estimate for w under the prior P(w | \{\alpha_i\}) will be sparse because many of the \alpha_i will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters \alpha_i and \sigma^2 iteratively, but the draw back is the initial \alpha vector would be a huge, and every matrix operation would be time consuming with large scale data. When an \alpha_i reaches infinity, it’s abandoned and \alpha vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned \alpha would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets \alpha_i‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each \alpha_i from all other hyperparameters, labelled “\alpha_{-i}“. The authors analyze l(\alpha_i) to show that L(\alpha) has a unique maximum with respect to \alpha_i. Then they update \alpha_i given:
\alpha_i=\frac{s_i^2}{q_i^2-s_i},\mbox{ if }q_i^2>s_i
\alpha_i=\infty,\mbox{ if }q_i^2\le s_i
where s_i and q_i are statistics needed for computing \alpha_i. Let \theta_i=q_i^2-s_i, then
If \theta_i>0 and \alpha_i<\infty, re-estimate \alpha_i;
If \theta_i>0 and \alpha_i=\infty, add \phi_i to the model with updated \alpha_i;
If \theta_i\le 0 and \alpha_i<\infty, delete \phi_i from the model and set \alpha_i=\infty;

After updating \alpha_i, it recomputes \Sigma and \mu (covariance and mean of posterior of w) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for w in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the \alpha_i sequentially. It could also be borrowed to solve similar problems with independent \alpha vector.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s