In this week’s lab meeting, we read “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models, by Tipping & Faul (2003). This paper proposes a highly accelerated algorithm for performing maximum marginal likelihood (or “evidence optimization”) in a sparse linear model (also known as the Relevance vector machine).
The basic setup is:
and an independent zero-mean Gaussian prior on each element of :
,
where denotes the prior precision of each weight.
The basic idea is to set the hyperparameters by maximizing marginal likelihood
. The resulting estimate for
under the prior
will be sparse because many of the
will go to infinity (meaning prior variance for the corresponding weights is set to zero).
The OLD way to optimize this model (“ARD”) is to update hyperparameters and
iteratively, but the draw back is the initial
vector would be a huge, and every matrix operation would be time consuming with large scale data. When an
reaches infinity, it’s abandoned and
vector has fewer and fewer elements. Thus it’s a reduction process and any abandoned
would not be re-estimated.
In contrast, THIS paper proposed an adding and modifying process. It initializes with an empty model and sequentially adds basis functions (i.e., sets ‘s to something finite) to increase the marginal likelihood. Within the same framework, it can increase the objective function by deleting functions which subsequently become redundant.
The key to this approach is to separate each from all other hyperparameters, labelled “
“. The authors analyze
to show that
has a unique maximum with respect to
. Then they update
given:
where and
are statistics needed for computing
. Let
, then
If and
, re-estimate
;
If and
, add
to the model with updated
;
If and
, delete
from the model and set
;
After updating , it recomputes
and
(covariance and mean of posterior of
) and all related statistics then continue until convergence.
The paper shows that this algorithm achieves similar MSE performance and finds an even sparser estimate for in less time than the old method. It’s quite clever to use the separable property of marginal likelihood to optimize the
sequentially. It could also be borrowed to solve similar problems with independent
vector.