Gaussian Processes in Shogun

By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de. Based on the GP framework of the Google summer of code 2013 project of Roman Votyakov - votjakovr@gmail.com - github.com/votjakovr, and the Google summer of code 2012 project of Jacob Walker - walke434@gmail.com - github.com/puffin444

This notebook is about Bayesian regression and classification models with Gaussian Process (GP) priors in Shogun. After providing a semi-formal introduction, we illustrate how to efficiently train them, use them for predictions, and automatically learn parameters.

In []:
# import all shogun classes
from modshogun import *

Some Formal Background (Skip if you just want code examples)

This notebook is about Bayesian regression models with Gaussian Process priors. A Gaussian Process (GP) over real valued functions on some domain \(\mathcal{X}\), \(f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}\), written as

\(\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')),\)

defines a distribution over real valued functions with mean value \(m(\mathbf{x})=\mathbb{E}[f(\mathbf{x})]\) and inter-function covariance \(k(\mathbf{x},\mathbf{x}')=\mathbb{E}[(f(\mathbf{x})-m(\mathbf{x}))^T(f(\mathbf{x})-m(\mathbf{x})]\). This intuitively means tha the function value at any point \(\mathbf{x}\), i.e., \(f(\mathbf{x})\) is a random variable with mean \(m(\mathbf{x})\); if you take the average of infinitely many functions from the Gaussian Process, and evaluate them at \(\mathbf{x}\), you will get this value. Similarily, the function values at two different points \(\mathbf{x}, \mathbf{x}'\) have covariance \(k(\mathbf{x}, \mathbf{x}')\). The formal definition is that Gaussian Process is a collection of random variables (may be infinite) of which any finite subset have a joint Gaussian distribution.

One can model data with Gaussian Processes via defining a joint distribution over

The joint distribution takes the form

\(p(\mathbf{f},\mathbf{y},\theta)=p(\boldsymbol{\theta})p(\mathbf{f}|\boldsymbol{\theta})p(\mathbf{y}|\mathbf{f}),\)

where \(\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta)\) is the joint Gaussian distribution for the GP variables, with mean \(\mathbf{m}_\boldsymbol{\theta}\) and covariance \(\mathbf{C}_\theta\). The \((i,j)\)-th entriy of \(\mathbf{C}_\boldsymbol{\theta}\) is given by the covariance or kernel between the \((i,j)\)-th covariates \(k(\mathbf{x}_i, \mathbf{x}_j)\). Examples for kernel and mean functions are given later in the notebook.

Mean and covariance are both depending on hyper-parameters coming from a prior distribution \(\boldsymbol{\theta}\sim p(\boldsymbol{\theta})\). The data itself \(\mathbf{y}\in \mathcal{Y}^n\) (no assumptions on \(\mathcal{Y}\) for now) is modelled by a likelihood function \(p(\mathbf{y}|\mathbf{f})\), which gives the probability of the data \(\mathbf{y}\) given a state of the latent Gaussian variables \(\mathbf{f}\), i.e. \(p(\mathbf{y}|\mathbf{f}):\mathcal{Y}^n\rightarrow [0,1]\).

In order to do inference for a new, unseen covariate \(\mathbf{x}^*\in\mathcal{X}\), i.e., predicting its label \(y^*\in\mathcal{Y}\) or in particular computing the predictive distribution for that label, we have integrate over the posterior over the latent Gaussian variables (assume fixed \(\boldsymbol{\theta}\) for now, which means you can just ignore the symbol in the following if you want),

\(p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}.\)

This posterior, \(p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})\), can be obtained using standard Bayes-Rule as

\(p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})=\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})}{p(\mathbf{y}|\boldsymbol{\theta})},\)

with the so called evidence or marginal likelihood \(p(\mathbf{y}|\boldsymbol{\theta})\) given as another integral over the prior over the latent Gaussian variables

\(p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}\).

In order to solve the above integrals, Shogun offers a variety of approximations. Don't worry, you will not have to deal with these nasty integrals on your own, but everything is hidden within Shogun. Though, if you like to play with these objects, you will be able to compute only parts.

Note that in the above description, we did not make any assumptions on the input space \(\mathcal{X}\). As long as you define mean and covariance functions, and a likelihood, your data can have any form you like. Shogun in fact is able to deal with standard dense numerical data, sparse data, and strings of any type, and many more out of the box. We will provide some examples below.

To gain some intuition how these latent Gaussian variables behave, and how to model data with them, see the regression part of this notebook.

Non-Linear Bayesian Regression

Bayesian regression with Gaussian Processes is among the most fundamental applications of latent Gaussian models. As usual, the oberved data come from a contintous space, i.e. \(\mathbf{y}\in\mathbb{R}^n\), which is represented in the Shogun class CRegressionLabels. We assume that these observations come from some distribution \(p(\mathbf{y}|\mathbf{f)}\) that is based on a fixed state of latent Gaussian response variables \(\mathbf{f}\in\mathbb{R}^n\). In fact, we assume that the true model is the latent Gaussian response variable (which defined a distribution over functions; plus some Gaussian observation noise which is modelled by the likelihood as

\(p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{f},\sigma^2\mathbf{I})\)

This simple likelihood is implemented in the Shogun class CGaussianLikelihood. It is the well known bell curve. Below, we plot the likelihood as a function of \(\mathbf{y}\), for \(n=1\).

In []:
# plot likelihood for three different noise lebels $\sigma$ (which is not yet squared)
sigmas=array([0.5,1,2])

# likelihood instance
lik=GaussianLikelihood()

# A set of labels to consider
lab=RegressionLabels(linspace(-4.0,4.0, 200))

# A single 1D Gaussian response function, repeated once for each label
# this avoids doing a loop in python which would be slow
F=zeros(lab.get_num_labels())

# plot likelihood for all observations noise levels
figure(figsize=(12, 4))
for sigma in sigmas:
    # set observation noise, this is squared internally
    lik.set_sigma(sigma)
    
    # compute log-likelihood for all labels
    log_liks=lik.get_log_probability_f(lab, F)
    
    # plot likelihood functions, exponentiate since they were computed in log-domain
    plot(lab.get_labels(), exp(log_liks))
    
ylabel("$p(y_i|f_i)$")
xlabel("$y_i$")
title("Regression Likelihoods for different observation noise levels")
_=legend(["sigma=$%.1f$" % sigma for sigma in sigmas])

Apart from its apealling form, this curve has the nice property of given rise to analytical solutions to the required integrals. Recall these are given by

\(p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},\)

and

\(p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}\).

Since all involved elements, the likelihood \(p(\mathbf{y}|\mathbf{f})\), the GP prior \(p(\mathbf{f}|\boldsymbol{\theta})\) are Gaussian, the same follows for the GP posterior \(p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})\), and the marginal likelihood \(p(\mathbf{y}|\boldsymbol{\theta})\). Therefore, we just need to sit down with pen and paper to derive the resulting forms of the Gaussian distributions of these objects (see references). Luckily, everything is already implemented in Shogun.

In order to get some intuition about Gaussian Processes in general, let us first have a look at these latent Gaussian variables, which define a probability distribution over real values functions \(f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}\), where in the regression case, \(\mathcal{X}=\mathbb{R}\).

As mentioned above, the joint distribution of a finite number (say \(n\)) of variables \(\mathbf{f}\in\mathbb{R}^n\) from a Gaussian Process \(\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}'))\), takes the form

\(\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta),\)

where \(\mathbf{m}_\theta\) is the mean function's mean and \(\mathbf{C}_\theta\) is the pairwise covariance or kernel matrix of the input covariates \(\mathbf{x}_i\). This means, we can easily sample function realisations \(\mathbf{f}^{(j)}\) from the Gaussian Process, and more important, visualise them.

To this end, let us consider the well-known and often used Gaussian Kernel or squared exponential covariance, which is implemented in the Shogun class CGaussianKernel in the parametric from (note that there are other forms in the literature)

\(k(\mathbf{x}, \mathbf{x}')=\exp\left( -\frac{||\mathbf{x}-\mathbf{x}'||_2^2}{\tau}\right),\)

where \(\tau\) is a hyper-parameter of the kernel. We will also use the constant CZeroMean mean function, which is suitable if the data's mean is zero (which can be achieved via removing it).

Let us consider some toy regression data in the form of a sine wave, which is observed at random points with some observations noise.

In []:
def generate_regression_toy_data(n=50, n_test=100, x_range=15, x_range_test=20, noise_var=0.4):
    # training and test sine wave, test one has more points
    X_train = rand(n)*x_range
    X_test = linspace(0,x_range_test, 500)
    
    # add noise to training observations
    y_test = sin(X_test)
    y_train = sin(X_train)+random.randn(n)*noise_var
    
    return X_train, y_train, X_test, y_test
In []:
X_train, y_train, X_test, y_test = generate_regression_toy_data()
figure(figsize=(16,4))
plot(X_train, y_train, 'ro')
plot(X_test, y_test)
legend(["Noisy observations", "True model"])
title("One-Dimensional Toy Regression Data")
xlabel("$\mathbf{x}$")
_=ylabel("$\mathbf{y}$")

First, we compute the kernel matrix \(\mathbf{C}_\boldsymbol{\theta}\) using the CGaussianKernel with hyperparameter \(\boldsymbol{\theta}=\{\tau\}\) with a few differnt values. Note that in Gaussian Processes, kernels usually have a scaling parameter. We skip this one for now and cover it later.

In []:
# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)

# compute covariances for different kernel parameters
taus=asarray([.1,4.,32.])
Cs=zeros(((len(X_train), len(X_train), len(taus))))
for i in range(len(taus)):
    # compute unscalled kernel matrix (first parameter is maximum size in memory and not very important)
    kernel=GaussianKernel(10, taus[i])
    kernel.init(feats_train, feats_train)
    Cs[:,:,i]=kernel.get_kernel_matrix()


# plot
figure(figsize=(16,5))
for i in range(len(taus)):
    subplot(1,len(taus),i+1)
    imshow(Cs[:,:,i], interpolation="nearest")
    xlabel("Covariate index")
    ylabel("Covariate index")
    _=title("tau=%.1f" % taus[i])

This matrix, as any kernel or covariance matrix, is positive semi-definite and symmetric. It can be viewed as a similarity matrix. Here, elements on the diagonal (corresponding to \(\mathbf{x}=\mathbf{x}'\)) have largest similarity. For increasing kernel bandwidth \(\tau\), more and more elements are similar. This matrix fully specifies a distribution over functions \(f(\mathbf{x}):\mathcal{X}\rightarrow\mathbb{R}\) over a finite set of latent Gaussian variables \(\mathbf{f}\), which we can sample from and plot. To this end, we use the Shogun class CStatistics, which offers a method to sample from multivariate Gaussians.

In []:
figure(figsize=(16,5))
suptitle("Random Samples from GP prior")
for i in range(len(taus)):
    subplot(1,len(taus),i+1)
    
    # sample a bunch of latent functions from the Gaussian Process
    # note these vectors are stored row-wise
    F=Statistics.sample_from_gaussian(zeros(len(X_train)), Cs[:,:,i], 3)
    
    for j in range(len(F)):
        # sort points to connect the dots with lines
        sorted_idx=X_train.argsort()

        plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
    
    xlabel("$\mathbf{x}_i$")
    ylabel("$f(\mathbf{x}_i)$")
    _=title("tau=%.1f" % taus[i])

Note how the functions are exactly evaluated at the training covariates \(\mathbf{x}_i\) which are randomly distributed on the x-axis. Even though these points do not visualise the full functions (we can only evaluate them at a finite number of points, but we connected the points with lines to make it more clear), this reveils that larger values of the kernel bandwidth \(\tau\) lead to smoother latent Gaussian functions.

In the above plots all functions are equally possible. That is, the prior of the latent Gaussian variables \(\mathbf{f}|\boldsymbol{\theta}\) does not favour any particular function setups. Computing the posterior given our training data, the distribution ober \(\mathbf{f}|\mathbf{y},\boldsymbol{\theta}\) then corresponds to restricting the above distribution over functions to those that explain the training data (up to observation noise). We will now use the Shogun class CExactInferenceMethod to do exactly this. The class is the general basis of exact GP regression in Shogun. We have to define all parts of the Gaussian Process for the inference method.

In []:
figure(figsize=(16,5))
suptitle("Random Samples from GP posterior")
for i in range(len(taus)):
    subplot(1,len(taus),i+1)

    # create inference method instance with very small observation noise to make 
    inf=ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
    
    C_post=inf.get_posterior_covariance()
    m_post=inf.get_posterior_mean()

    # sample a bunch of latent functions from the Gaussian Process
    # note these vectors are stored row-wise
    F=Statistics.sample_from_gaussian(m_post, C_post, 5)
    
    for j in range(len(F)):
        # sort points to connect the dots with lines
        sorted_idx=sorted(range(len(X_train)),key=lambda x:X_train[x])
        plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
        plot(X_train, y_train, 'r*')
    
    xlabel("$\mathbf{x}_i$")
    ylabel("$f(\mathbf{x}_i)$")
    _=title("tau=%.1f" % taus[i])

Note how the above function samples are constrained to go through our training data labels (up to observation noise), as much as their smoothness allows them. In fact, these are already samples from the predictive distribution, which gives a probability for a label \(\mathbf{y}^*\) for any covariate \(\mathbf{x}^*\). These distributions are Gaussian (!), nice to look at and extremely useful to understand the GP's underlying model. Let's plot them. We finally use the Shogun class CGaussianProcessRegression to represent the whole GP under an interface to perform inference with. In addition, we use the helper class class CGaussianDistribution to evaluate the log-likelihood for every test point's \(\mathbf{x}^*_j\) value \(\mathbf{y}_j^*\).

In []:
# helper function that plots predictive distribution and data
def plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances):
    # evaluate predictive distribution in this range of y-values and preallocate predictive distribution
    y_values=linspace(-3,3)
    D=zeros((len(y_values), len(X_test)))
    
    # evaluate normal distribution at every prediction point (column)
    for j in range(shape(D)[1]):
        # create gaussian distributio instance, expects mean vector and covariance matrix, reshape
        gauss=GaussianDistribution(array(means[j]).reshape(1,), array(variances[j]).reshape(1,1))
    
        # evaluate predictive distribution for test point, method expects matrix
        D[:,j]=exp(gauss.log_pdf_multiple(y_values.reshape(1,len(y_values))))
    
    pcolor(X_test,y_values,D)
    colorbar()
    contour(X_test,y_values,D)
    plot(X_test,y_test, 'b', linewidth=3)
    plot(X_test,means, 'm--', linewidth=3)
    plot(X_train, y_train, 'ro')
    legend(["Truth", "Prediction", "Data"])
In []:
figure(figsize=(18,10))
suptitle("GP inference for different kernel widths")
for i in range(len(taus)):
    subplot(len(taus),1,i+1)
    
    # create GP instance using inference method and train
    # use Shogun objects from above
    inf.set_kernel(GaussianKernel(10,taus[i]))
    gp=GaussianProcessRegression(inf)
    gp.train()
    
    # predict labels for all test data (note that this produces the same as the below mean vector)
    means = gp.apply(feats_test)
    
    # extract means and variance of predictive distribution for all test points
    means = gp.get_mean_vector(feats_test)
    variances = gp.get_variance_vector(feats_test)
    
    # note: y_predicted == means
    
    # plot predictive distribution and training data
    plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
    _=title("tau=%.1f" % taus[i])

The question now is: Which set of hyper-parameters \(\boldsymbol{\theta}=\{\tau, \gamma, \sigma\}\) to take, where \(\gamma\) is the kernel scaling (which we omitted so far), and \(\sigma\) is the observation noise (which we left at its defaults value of one)? The question of model-selection will be handled in a bit more depth in the binary classification case. For now we just show code how to do it as a black box. See below for explanations.

In []:
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
gp = GaussianProcessRegression(inf)

# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)
grad.set_function(inf)

# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)

# search for best parameters and store them
best_combination = grad_search.select_model()

# apply best parameters to GP, train
best_combination.apply_to_machine(gp)

# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()
print "Selected tau (kernel bandwidth):", best_width
print "Selected gamma (kernel scaling):", best_scale
print "Selected sigma (observation noise):", best_sigma
Selected tau (kernel bandwidth): 1.94226404034
Selected gamma (kernel scaling): 0.779501690081
Selected sigma (observation noise): 0.330432818981

Now we can output the best parameters and plot the predictive distribution for those.

In []:
# train gp
gp.train()

# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)

# plot predictive distribution
figure(figsize=(18,5))
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=title("Maximum Likelihood II based inference")