In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).

The basic setup is:

and an independent zero-mean Gaussian prior on each element of :

,

where denotes the prior precision of each weight.

The basic idea is to set the hyperparameters by maximizing marginal likelihood . The resulting estimate for under the prior will be sparse because many of the will go to infinity (meaning prior variance for the corresponding weights is set to zero).

The OLD way to optimize this model (“ARD”) is to update hyperparameters and iteratively, but the draw back is the initial vector would be a huge, and every matrix operation would be time consuming with large scale data. When an reaches infinity, it’s abandoned and vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned would not be re-estimated.

In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets ‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.

The key to this approach is to separate each from all other hyperparameters, labelled ““. The authors analyze to show that has a unique maximum with respect to . Then they update given:

where and are statistics needed for computing . Let , then

If and , **re-estimate** ;

If and , **add** to the model with updated ;

If and , **delete** from the model and set ;

After updating , it recomputes and (covariance and mean of posterior of ) and all related statistics then continue until convergence.

The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the sequentially. It could also be borrowed to solve similar problems with independent vector.