# Fast Kronecker trick for Gaussian Process regression with “expressive” kernels

On May 11th, I presented the following paper in lab meeting:

Fast Kernel Learning for Multidimensional Pattern Extrapolation
Andrew Gordon Wilson, Elad Gilboa, Arye Nehorai and John P. Cunningham

This paper presents a method for scaling up structured spectral mixture (SM) kernels (from Wilson et al 2013) for Gaussian Process regression to multi-dimensional settings in which many (but not all) of the input points live on a grid (e.g., a 2D image in which some pixels are missing).  The spectral mixture is a very expressive kernel that enhances representation learning, but the problem comes in applying it to large scale data.

To address this problem, the authors developed the spectral mixture product (SMP) kernel, which is just a multiplication of 1D kernels for each dimension. This allows the full kernel matrix for points defined on a $P$-dimensional grid to be represented with a Kronecker product of kernel matrices along each dimension: $K = k_1 \otimes \cdots \otimes k_P$. This allows the quadratic term and the log-determinant term in the GP log-likelihood to be evaluated very efficiently using singular value decompositions of these component matrices, namely: where $Q=Q^1\otimes Q^2\otimes ...\otimes Q^P$ (eigenvectors), and $V=V^1\otimes V^2\otimes ...\otimes V^P$ (eigenvalues).

The key contribution of this paper is to show that if inputs are not on a grid — for example, some pixels from a training image are missing — one can insert “imaginary” points to complete the grid, and place infinite measurement noise on these observations so they have no effect on inference. Assuming there’s a dataset of M observations which are not necessarily on a grid, they form a complete grid using W imaginary observations, $y_W\sim\mathcal{N}(f_W,\epsilon^{-1}I_W)$, $\epsilon\rightarrow 0$. The total observation vector $y = [y_M, y_W ]^\top$ has $N = M + W$ entries: $y\sim\mathcal{N} (f, D_N)$, where the noise covariance matrix $D_N =\mbox{diag}(D_M, \epsilon^{-1}I_W), D_M = \sigma^2 I_M$. It’s simple to prove that the imaginary points have no effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution.

However, the relevant matrices are no longer have Kronecker structure due to the fact that the observations do not form a complete grid. The authors get around this using the conjugate gradient method, an iterative method for solving linear equations, to compute the quadratic term of the GP log-likelihood. They compute the log-determinant term using an approximation.

In particular, the authors use preconditioned conjugate gradients to compute $(K_N + D_N)^{ -1} y$ by solving $(K_N + D_N)x = y$ for $x$. For the log-determinant term, they use the approximation: With these implementation tricks, they can scale up highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, to datasets exceeding $N = 10^5$ training instances, in minutes to 10s of minutes. They obtain many beautiful results, including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery.

The take home message for me is that smart implementation tricks can allow GPs with expressive covariances to be applied to large scale problems with non-gridded input.