# Deep kernel with recurrent structure

This week in lab meeting, we discussed the following work:

Learning Scalable Deep Kernels with Recurrent Structure
Al-Shedivat, Wilson, Saatchi, Hu, & Xing, JMLR 2017

This paper addressed the problem of learning a regression function that maps sequences to real-valued target vectors. Formally, the sequences of inputs are vectors of measurements $\overline{\mathbf{x}_1}=[\mathbf{x}^1]$$\overline{\mathbf{x}_2}=[\mathbf{x}^1,\mathbf{x}^2]$, …, $\overline{\mathbf{x}_n}=[\mathbf{x}^1,\mathbf{x}^2,\dots,\mathbf{x}^n]$, where $\mathbf{x}^i\in\mathcal{X}$, are indexed by time and would be of growing lengths. Let $\mathbf{y}=\{\mathbf{y}_i\}_{i=1}^n$, $\mathbf{y}_i\in\mathbb{R}^d$, be a collection of the corresponding real-valued target vectors. Assuming that only the most recent $L$ steps of a sequence are predictive of the targets, the sequences can be written as $\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]$, $i=1,...,n$. The goal is to learn a function, $f:\mathcal{X}^L\rightarrow\mathbb{R}^d$, based on the available data. The approach to learning the mapping function is to use Gaussian process.

The Gaussian process (GP) is a Bayesian nonparametric model that generalizes the Gaussian distributions to functions. They assumed the mapping function $f$ is sampled from a GP. Then a standard GP regression would be performed to learn $f$ as well as the posterior predictive distribution over $\mathbf{y}$. The inputs to the GP are $\{\overline{\mathbf{x}_i}\}_{i=1}^n$ and the output function values are  $\{{\mathbf{y}_i}\}_{i=1}^n$. However, standard kernels, e.g. RBF, Matern, or periodic kernel, are not able to take input variables with varying length. Moreover, standard kernels won’t fully utilize the structure in an input sequence.  Therefore, this paper proposed to use recurrent models to convert an input sequence $\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1}, \mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]$ to a latent representation $\mathbf{h}^L_i\in\mathcal{H}$, then map $\mathbf{h}_i^L$ to $\mathbf{y}_i$. Formally, a recurrent model expresses the mapping $f:\mathcal{X}^L\rightarrow \mathbb{R}^d$ as:

$\mathbf{y}_i=\psi(\mathbf{h}_i^L)+\epsilon^t, \mathbf{h}_i^t=\phi(\mathbf{h}^{t-1}_i+\mathbf{x}^{i-L+t})+\delta^t,t=1,...,L$

Combining the recurrent model with GP, the idea of deep kernel with recurrent model is that let $\phi:\mathcal{X}^L\rightarrow\mathcal{H}$ be a deterministic transformation, which is a recurrent model, mapping $\{\overline{\mathbf{x}_i}=[\mathbf{x}^{i-L+1},\mathbf{x}^{i-L+2},\dots,\mathbf{x}^i]\}_{i=1}^n$ to $\{\mathbf{h}^L_i\}_{i=1}^n$. Sequential structure in $\{\overline{\mathbf{x}_i}\}_{i=1}^n$ is integrated into the corresponding latent representations. Then a standard GP regression can be performed with input $\{\mathbf{h}^L_i\}_{i=1}^n$ and output $\{\mathbf{y}_i\}_{i=1}^n$.

If we denote $\tilde{k}:(\mathcal{X}^L)^2\rightarrow \mathbb{R}$ to be the kernel over the sequence space, and $k:\mathcal{H}^2\rightarrow \mathbb{R}$ to be the kernel defined on the latent space $\mathcal{H}$, then

$\tilde{k}(\overline{\mathbf{x}},\overline{\mathbf{x}}')=k(\phi(\overline{\mathbf{x}}),\phi(\overline{\mathbf{x}}'))=k(\mathbf{h}^L,{\mathbf{h}^L}')$

where $k(\mathbf{h}^L,{\mathbf{h}^L}')$ denotes the covariance between $\mathbf{y}$ and $\mathbf{y}'$ in GP.

By now deep kernel with general recurrent structure has been constructed. With respect to recurrent model, there are multiple choices. Recurrent neural networks (RNNs) model recurrent processes by using linear parametric maps followed by nonlinear activations. A major disadvantage of the vanilla RNNs is that their training is nontrivial due to the so-called vanishing gradient problem: the error back-propagated through t time steps diminishes exponentially which makes learning long-term relationships nearly impossible. Thus in their paper, they chose the long short-term memory (LSTM) mechanism as the recurrent model. LSTM places a memory cell into each hidden unit and uses differentiable gating variables. The gating mechanism not only improves the flow of errors through time, but also, allows the the network to decide whether to keep, erase, or overwrite certain memorized information based on the forward flow of inputs and the backward flow of errors. This mechanism adds stability to the network’s memory. Therefore, their final model is a combination of GP and LSTM, named as GP-LSTM.

To solve the model, they needed to infer two sets of parameters: the base kernel hyperparameters, $\theta$, and the parameters of the recurrent neural transformation, denoted $W$. In order to update $\theta$, the algorithm needs to invert the entire kernel matrix $K$ which requires the full dataset. However, once the kernel matrix $K$ is fixed, update of $W$ turns into a weighted sum of independent functions of each data point. One could compute a stochastic update for $W$ on a mini-batch of training points by only using the corresponding sub-matrix of $K$. Hence, they proposed to optimize GPs with recurrent kernels in a semi-stochastic fashion, alternating between updating the kernel hyperparameters, $\theta$, on the full data first, and then updating the weights of the recurrent network, $W$, using stochastic steps. Moreover, they observed that when the stochastic updates of $W$ are small enough, $K$ does not change much between mini-batches, and hence they can perform multiple stochastic steps for $W$ before re-computing the kernel matrix, $K$, and still converge. The final version of the optimization algorithm is given as follows:

To evaluate their model, they applied it to a number of tasks, including system identification, energy forecasting, and self-driving car applications. Quantitatively, the model was assessed on the data ranging in size from hundreds of points to almost a million with various signal-to-noise ratios demonstrating state-of-the-art performance and linear scaling of their approach. Qualitatively, the model was tested on consequential self-driving applications: lane estimation and lead vehicle position prediction. Overall, I think this paper achieved state-of-the-art performance on consequential applications involving sequential data, following straightforward and scalable approaches to building highly flexible Gaussian process.