# Fast Marginal Likelihood Maximisation for Sparse Bayesian Models

In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is: $\quad y = \phi(x)^T w + \epsilon, \quad$ $\epsilon \sim \mathcal{N}(0,\sigma^2)$

and an independent zero-mean Gaussian prior on each element of $w$: $w_i \sim \mathcal{N}(0,1/\alpha_i)$,

where $\alpha_i$ denotes the prior precision of each weight.

The basic idea is to set the hyperparameters $\{\alpha_i\}$  by maximizing marginal likelihood $P(Y|X,\{\alpha_i\})$. The resulting estimate for $w$ under the prior $P(w | \{\alpha_i\})$ will be sparse because many of the $\alpha_i$ will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters $\alpha_i$ and $\sigma^2$ iteratively, but the draw back is the initial $\alpha$ vector would be a huge, and every matrix operation would be time consuming with large scale data. When an $\alpha_i$ reaches infinity, it’s abandoned and $\alpha$ vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned $\alpha$ would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets $\alpha_i$‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each $\alpha_i$ from all other hyperparameters, labelled “ $\alpha_{-i}$“. The authors analyze $l(\alpha_i)$ to show that $L(\alpha)$ has a unique maximum with respect to $\alpha_i$. Then they update $\alpha_i$ given: $\alpha_i=\frac{s_i^2}{q_i^2-s_i},\mbox{ if }q_i^2>s_i$ $\alpha_i=\infty,\mbox{ if }q_i^2\le s_i$
where $s_i$ and $q_i$ are statistics needed for computing $\alpha_i$. Let $\theta_i=q_i^2-s_i$, then
If $\theta_i>0$ and $\alpha_i<\infty$, re-estimate $\alpha_i$;
If $\theta_i>0$ and $\alpha_i=\infty$, add $\phi_i$ to the model with updated $\alpha_i$;
If $\theta_i\le 0$ and $\alpha_i<\infty$, delete $\phi_i$ from the model and set $\alpha_i=\infty$;

After updating $\alpha_i$, it recomputes $\Sigma$ and $\mu$ (covariance and mean of posterior of $w$) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for $w$ in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the $\alpha_i$ sequentially. It could also be borrowed to solve similar problems with independent $\alpha$ vector.