An theoretical introduction to Gaussian processes¶

Motivation¶

Before training a neural network (deterministic or probabilistic) we have to select its architecture. For example, in the case of a multilayer perceptron (MLP), we need to choose the number of layers (depth) and the number of nodes (neurons) per layer (width)

Important

With this we are defining the neural network as a function with fixed structure

Increasing the width and depth gives the model more flexibility. But in general, we don’t know “how much flexibility” is needed for a particular problem

Note

The architecture is a collection of hyper-parameters

The good (and typical) practice is to find the “best structure” using a validation (holdout) dataset. But what if, instead of testing several architectures, we use a (non-parametric) model with no fixed structure?

This lesson presents the Gaussian Process (GP), a bayesian non-parametric model that can be seen as a neural network with one hidden layer and potentially infinite width

Note

In general, non-parametric models automatically grow in complexity (width) with data

Note that non-parametric models do have prior distributions and tunable hyper-parameters. The difference is that the distribution of its parameters lives in an infinite dimensional space. We use non-parametric models by integrating out (marginalizing) this infinite distribution

Defining the Gaussian process¶

Consider the probabilistic linear regression problem from previous lessons

\[ y_i = \sum_{k=1}^M \phi_k(x_i) \theta_k + \epsilon_i \quad \forall i=1,2,\ldots,N \]

where \(\epsilon_i\) is iid Gausian and \(\phi_k\) is a collection of \(M\) (non-linear) basis functions

We can write this in matrix form as

\[ Y = \Phi(X) \theta + E \]

where \(E\) is a diagonal matrix and \(\Phi(X) \in \mathbb{R}^{N \times M}\)

Let’s set a Gaussian prior for \(\theta\) :

\[ p(\theta) = \mathcal{N}(0, \sigma_\theta^2 I) \]

What is the distribution of \(f_\theta(X) = \Phi(X) \theta\) in this case? If \(\Phi\) is a deterministic transformation then the distribution of \(f\) is also Gaussian

By our previous definitions, the mean of \(p(f)\) is

\[ \mathbb{E}[f_\theta(X)] = \Phi(X)\mathbb{E}[\theta] = 0 \]

and its covariance is

\[ \mathbb{E}[ f_\theta(X) f_\theta(X)^T] = \Phi(X)\mathbb{E}[\theta \theta^T ] \Phi(X)^T = \sigma_\theta^2 \Phi(X) \Phi(X)^T = K \]

where \(K \in \mathbb{R}^{N\times N}\) is a symmetric and positive-definite matrix called the Gram matrix or Gramian matrix.

The \(ij\)-th element of the gram matrix is

\[ K_{ij} = \sum_{k=1}^M \phi_k(x_i) \phi_k(x_j) = \left \langle \vec \phi(x_i) , \vec \phi(x_j) \right \rangle = \kappa(x_i, x_j) \]

where \(\kappa(\cdot, \cdot): \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is called the kernel. In general we will forget about \(\{\phi_k(\cdot)\}\) and work only with the kernel.

With all these we can finally write

\[ p(f) = \mathcal{N}(0, K) \]

which is a “prior over functions”.

Note

We have dropped the dependence on \(\theta\)

Important

\(f\) is a multivariate random variable (random process) with joint Gaussian distribution: \(f\) is a Gaussian Process

Inference with a GP¶

Consider a dataset \(\textbf{x}=(x_1, x_2)\), \(\textbf{y}=(y_1, y_2)\). With this dataset we are interested in predicting (inferring) for a new observation \(x^*\): \(f(x^*)\) or \(f^*\) for short. This corresponds to the posterior \(p(f^*|\textbf{y}, \textbf{x}, x^*)\)

As before we can write the joint (Gaussian) distribution between the dataset and the new sample as

\[ p(\textbf{y}, f^*|\textbf{x}, x^*) = \mathcal{N}(0, K^+) \]

where

\[\begin{split} K^+ = \begin{pmatrix} K_{\textbf{x}\textbf{x}} + \sigma_\epsilon^2 I & K_{\textbf{x}x^*} \\ K_{\textbf{x}x^*}^T & K_{x^*x^*} \end{pmatrix} \end{split}\]

is a block matrix and

\[\begin{split} K_{\textbf{x}\textbf{x}} = \begin{pmatrix} \kappa(x_1, x_1) & \kappa(x_1, x_2) \\ \kappa(x_1, x_2) & \kappa(x_2, x_2)\end{pmatrix}, \quad K_{\textbf{x}x^*} = \begin{pmatrix} \kappa(x_1, x^*) \\ \kappa(x_2, x^*) \end{pmatrix} \end{split}\]

The Gaussian distribution is closed under conditioning, i.e. if we have a joint gaussian distribution the conditional distribution of a variable given the others is gaussian (nice step by step example). Here we use this property to write

\[\begin{split} \begin{split} p(f^*|\textbf{y}, \textbf{x}, x^*) = \mathcal{N}(&K_{\textbf{x}x^*} (K_{\textbf{x}\textbf{x}}+I\sigma_\epsilon^2)^{-1} \textbf{y}, \\ & K_{x^*x^*} - K_{\textbf{x}x^*} (K_{\textbf{x}\textbf{x}}+I\sigma_\epsilon^2)^{-1} K_{\textbf{x}x^*}^T ) \end{split} \end{split}\]

which gives us the result we seek

We can use Gaussian conditioning to predict on several “new data points” at the same time, we only need to compute the sub gram matrices between and within the training set and the test set

The kernel¶

The GP is mainly defined by its covariance also known as the gram matrix

\[\begin{split} K = \begin{pmatrix} \kappa(x_1, x_1)& \kappa(x_1, x_2)& \ldots & \kappa(x_1, x_N) \\ \kappa(x_2, x_1)& \kappa(x_2, x_2)& \ldots & \kappa(x_2, x_N) \\ \vdots& \vdots& \ddots & \vdots \\ \kappa(x_N, x_1)& \kappa(x_N, x_2)& \ldots & \kappa(x_N, x_N) \\ \end{pmatrix} \end{split}\]

where the following relation between the kernel and the basis function

\[ \kappa(x_i, x_j) = \left \langle \vec \phi(x_i) , \vec \phi(x_j) \right \rangle \]

is known as the “kernel trick”.

Before we defined a finite dimensional \(\vec \phi\) and obtained \(\kappa\). But in general it is more interesting to skip \(\phi\) and design \(\kappa\) directly. We only need \(\kappa\) to be a symmetric and positive-definite function

The broadly used Gaussian kernel complies with these restrictions

\[ \kappa(x_i, x_j) = \sigma^2 \exp \left ( \frac{\|x_i - x_j \|^2}{2\ell^2} \right) \]

where hyperparameter \(\sigma\) controls the amplitude and \(\ell\) controls the length-scale of the interactions between samples.

Using a taylor expansion we can show that the (non-linear) basis function of this kernel is

\[ \phi(x) = \lim_{M\to\infty} \sigma^2 \exp\left({-\frac{\|x\|^2}{2\ell^2}}\right) \begin{pmatrix} 1 & \frac{x}{\ell} & \frac{x^2}{\ell^2 \sqrt{2}} & \cdots & \frac{x^M}{\ell^M \sqrt{M!}} \end{pmatrix} \]

i.e. the Gaussian kernel induces an infinite-dimensional basis function.

Note

A Gaussian process with Gaussian kernel has an implicit infinite dimensional parameter vector \(\theta\)

We are not anymore explicitely choosing the structure of the function, but by selecting a kernel we are choosing a general “behavior”. For example, the Gaussian kernels encodes the property of locality, i.e. closer data samples should have similar predictions.

Several other valid kernels exist which encode other properties such as trends and periodicity. The following picture from Mackay’s book shows some of them:

The top plots are Gaussian kernels with different lengthscales
The bottom left plot is a periodic kernel
The bottom right plot is a Gaussian kernel plus a linear kernel

Note

Sums of products of valid kernels are also valid kernels

“Training” the GP¶

Fitting a GP to a dataset corresponds to finding the best combination of kernel hyperparameters which can be done by maximizing the marginal likelihood

For regression with iid Gaussian noise the marginal likelihood \(y\) is also Gaussian

\[ p(\textbf{y}|\textbf{x}) = \int p(\textbf{y}|f) p(f) \,df = \mathcal{N}(0, K + \sigma_\epsilon^2 I ) \]

where the hyperparameter \(\sigma_\epsilon^2\) is the variance of the noise

Note

It is equivalent and much easier to maximize the log marginal likelihood:

\[ \log p(\textbf{y}|\textbf{x}) = -\frac{1}{2} \textbf{y}^T (K + \sigma_\epsilon^2I)^{-1} \textbf{y} - \frac{1}{2} \log | 2\pi (K + \sigma_\epsilon^2I) | \]

from which we compute derivatives to update the hyperparameters through gradient descent

The following picture from Barber’s book shows three examples drawn from the GP prior (gaussian kernel) on the left and the mean/variance of the GP posterior after training on the right

Relation between GPs and Neural Networks¶

(Neil 1994) showed that a fully connected neural network with one hidden layer tends to a Gaussian process in the limit of infinite hidden units as a consequence of the central limit theorem. The following shows a mumary of the proof

More recently, several works have explored the relationship between GPs and Deep Neural Networks:

A recursive GP kernel that is related to infinitely-wide multi-layer dense neural networks. A similar relation is explored here. This paper relates the training of GP and BNN and show that a GP kernel can be obtained from VI (gaussian distributions), hence relating the posteriors
Deep Gaussian processes: Composition (stacks) of GPs. Note that this is not the same as function (kernel)composition. Excellent tutorial by Neil Lawrence that covers sparse and deep GPs. DGP are less used in practice than BNN due to their higher cost but they can be combined as shown here and also more efficient methods to train them have been proposed. Finally, the following is a pyro demonstration of a two layer DGP trained with MCMC: https://fehiepsi.github.io/blog/deep-gaussian-process/
Neural Tangent Kernel (NTK): A modification to deep ensemble training so that they can be interpreted as a GP predictive posterior in the infinite width limit. The NTK was introduced in (Jacot, Gabriel and Hongler, 2018) as a description of the training evolution of deep NNs using kernel methods
Deep kernel learning: The inputs of a spectral mixture kernel are transformed with deep neural networks. Solution has a closed form and can replace existing kernels easily. Available as a pyro example

Bayesian Learning and Neural Networks

An theoretical introduction to Gaussian processes

Contents

An theoretical introduction to Gaussian processes¶

Motivation¶

Defining the Gaussian process¶

Inference with a GP¶

The kernel¶

“Training” the GP¶

Relation between GPs and Neural Networks¶