In [1]:

```
%matplotlib inline
# import all shogun classes
from modshogun import *
```

This notebook is about Bayesian regression models with Gaussian Process priors. A Gaussian Process (GP) over real valued functions on some domain $\mathcal{X}$, $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, written as

$\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')),$

defines a distribution over real valued functions with mean value $m(\mathbf{x})=\mathbb{E}[f(\mathbf{x})]$ and inter-function covariance $k(\mathbf{x},\mathbf{x}')=\mathbb{E}[(f(\mathbf{x})-m(\mathbf{x}))^T(f(\mathbf{x})-m(\mathbf{x})]$. This intuitively means tha the function value at any point $\mathbf{x}$, i.e., $f(\mathbf{x})$ is a random variable with mean $m(\mathbf{x})$; if you take the average of infinitely many functions from the Gaussian Process, and evaluate them at $\mathbf{x}$, you will get this value. Similarily, the function values at two different points $\mathbf{x}, \mathbf{x}'$ have covariance $k(\mathbf{x}, \mathbf{x}')$. The formal definition is that Gaussian Process is a collection of random variables (may be infinite) of which any finite subset have a joint Gaussian distribution.

One can model data with Gaussian Processes via defining a joint distribution over

- $n$ data (labels in Shogun) $\mathbf{y}\in \mathcal{Y}^n$, from a $n$ dimensional continous (regression) or discrete (classification) space. These data correspond to $n$ covariates $\mathbf{x}_i\in\mathcal{X}$ (features in Shogun) from the input space $\mathcal{X}$.
- Hyper-parameters $\boldsymbol{\theta}$ which depend on the used model (details follow).
- Latent Gaussian variables $\mathbf{f}\in\mathbb{R}^n$, coming from a GP, i.e., they have a joint Gaussian distribution. Every entry $f_i$ corresponds to the GP function $f(\mathbf{x_i})$ evaluated at covariate $\mathbf{x}_i$ for $1\leq i \leq n$.

The joint distribution takes the form

$p(\mathbf{f},\mathbf{y},\theta)=p(\boldsymbol{\theta})p(\mathbf{f}|\boldsymbol{\theta})p(\mathbf{y}|\mathbf{f}),$

where $\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta)$ is the joint Gaussian distribution for the GP variables, with mean $\mathbf{m}_\boldsymbol{\theta}$ and covariance $\mathbf{C}_\theta$. The $(i,j)$-th entriy of $\mathbf{C}_\boldsymbol{\theta}$ is given by the covariance or kernel between the $(i,j)$-th covariates $k(\mathbf{x}_i, \mathbf{x}_j)$. Examples for kernel and mean functions are given later in the notebook.

Mean and covariance are both depending on hyper-parameters coming from a prior distribution $\boldsymbol{\theta}\sim p(\boldsymbol{\theta})$. The data itself $\mathbf{y}\in \mathcal{Y}^n$ (no assumptions on $\mathcal{Y}$ for now) is modelled by a likelihood function $p(\mathbf{y}|\mathbf{f})$, which gives the probability of the data $\mathbf{y}$ given a state of the latent Gaussian variables $\mathbf{f}$, i.e. $p(\mathbf{y}|\mathbf{f}):\mathcal{Y}^n\rightarrow [0,1]$.

In order to do inference for a new, unseen covariate $\mathbf{x}^*\in\mathcal{X}$, i.e., predicting its label $y^*\in\mathcal{Y}$ or in particular computing the predictive distribution for that label, we have integrate over the posterior over the latent Gaussian variables (assume fixed $\boldsymbol{\theta}$ for now, which means you can just ignore the symbol in the following if you want),

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}.$

This posterior, $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, can be obtained using standard Bayes-Rule as

$p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})=\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})}{p(\mathbf{y}|\boldsymbol{\theta})},$

with the so called *evidence* or *marginal likelihood* $p(\mathbf{y}|\boldsymbol{\theta})$ given as another integral over the prior over the latent Gaussian variables

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

In order to solve the above integrals, Shogun offers a variety of approximations. Don't worry, you will not have to deal with these nasty integrals on your own, but everything is hidden within Shogun. Though, if you like to play with these objects, you will be able to compute only parts.

Note that in the above description, we did not make any assumptions on the input space $\mathcal{X}$. As long as you define mean and covariance functions, and a likelihood, your data can have *any* form you like. Shogun in fact is able to deal with standard dense numerical data, sparse data, and strings of *any* type, and many more *out of the box*. We will provide some examples below.

To gain some intuition how these latent Gaussian variables behave, and how to model data with them, see the regression part of this notebook.

Bayesian regression with Gaussian Processes is among the most fundamental applications of latent Gaussian models. As usual, the oberved data come from a contintous space, i.e. $\mathbf{y}\in\mathbb{R}^n$, which is represented in the Shogun class CRegressionLabels. We assume that these observations come from some distribution $p(\mathbf{y}|\mathbf{f)}$ that is based on a fixed state of latent Gaussian response variables $\mathbf{f}\in\mathbb{R}^n$. In fact, we assume that the true model *is* the latent Gaussian response variable (which defined a *distribution* over functions; plus some Gaussian observation noise which is modelled by the likelihood as

$p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{f},\sigma^2\mathbf{I})$

This simple likelihood is implemented in the Shogun class CGaussianLikelihood. It is the well known bell curve. Below, we plot the likelihood as a function of $\mathbf{y}$, for $n=1$.

In [2]:

```
# plot likelihood for three different noise lebels $\sigma$ (which is not yet squared)
sigmas=array([0.5,1,2])
# likelihood instance
lik=GaussianLikelihood()
# A set of labels to consider
lab=RegressionLabels(linspace(-4.0,4.0, 200))
# A single 1D Gaussian response function, repeated once for each label
# this avoids doing a loop in python which would be slow
F=zeros(lab.get_num_labels())
# plot likelihood for all observations noise levels
figure(figsize=(12, 4))
for sigma in sigmas:
# set observation noise, this is squared internally
lik.set_sigma(sigma)
# compute log-likelihood for all labels
log_liks=lik.get_log_probability_f(lab, F)
# plot likelihood functions, exponentiate since they were computed in log-domain
plot(lab.get_labels(), exp(log_liks))
ylabel("$p(y_i|f_i)$")
xlabel("$y_i$")
title("Regression Likelihoods for different observation noise levels")
_=legend(["sigma=$%.1f$" % sigma for sigma in sigmas])
```

Apart from its apealling form, this curve has the nice property of given rise to *analytical* solutions to the required integrals. Recall these are given by

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},$

and

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

Since all involved elements, the likelihood $p(\mathbf{y}|\mathbf{f})$, the GP prior $p(\mathbf{f}|\boldsymbol{\theta})$ are Gaussian, the same follows for the GP posterior $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, and the marginal likelihood $p(\mathbf{y}|\boldsymbol{\theta})$. Therefore, we just need to sit down with pen and paper to derive the resulting forms of the Gaussian distributions of these objects (see references). Luckily, everything is already implemented in Shogun.

In order to get some intuition about Gaussian Processes in general, let us first have a look at these latent Gaussian variables, which define a probability distribution over real values functions $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, where in the regression case, $\mathcal{X}=\mathbb{R}$.

As mentioned above, the joint distribution of a finite number (say $n$) of variables $\mathbf{f}\in\mathbb{R}^n$ from a Gaussian Process $\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}'))$, takes the form

$\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta),$

where $\mathbf{m}_\theta$ is the mean function's mean and $\mathbf{C}_\theta$ is the pairwise covariance or kernel matrix of the input covariates $\mathbf{x}_i$. This means, we can easily sample function realisations $\mathbf{f}^{(j)}$ from the Gaussian Process, and more important, visualise them.

To this end, let us consider the well-known and often used *Gaussian Kernel* or *squared exponential covariance*, which is implemented in the Shogun class CGaussianKernel in the parametric from (note that there are other forms in the literature)

$ k(\mathbf{x}, \mathbf{x}')=\exp\left( -\frac{||\mathbf{x}-\mathbf{x}'||_2^2}{\tau}\right),$

where $\tau$ is a hyper-parameter of the kernel. We will also use the constant CZeroMean mean function, which is suitable if the data's mean is zero (which can be achieved via removing it).

Let us consider some toy regression data in the form of a sine wave, which is observed at random points with some observations noise.

In [3]:

```
def generate_regression_toy_data(n=50, n_test=100, x_range=15, x_range_test=20, noise_var=0.4):
# training and test sine wave, test one has more points
X_train = rand(n)*x_range
X_test = linspace(0,x_range_test, 500)
# add noise to training observations
y_test = sin(X_test)
y_train = sin(X_train)+random.randn(n)*noise_var
return X_train, y_train, X_test, y_test
```

In [4]:

```
X_train, y_train, X_test, y_test = generate_regression_toy_data()
figure(figsize=(16,4))
plot(X_train, y_train, 'ro')
plot(X_test, y_test)
legend(["Noisy observations", "True model"])
title("One-Dimensional Toy Regression Data")
xlabel("$\mathbf{x}$")
_=ylabel("$\mathbf{y}$")
```

In [5]:

```
# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)
# compute covariances for different kernel parameters
taus=asarray([.1,4.,32.])
Cs=zeros(((len(X_train), len(X_train), len(taus))))
for i in range(len(taus)):
# compute unscalled kernel matrix (first parameter is maximum size in memory and not very important)
kernel=GaussianKernel(10, taus[i])
kernel.init(feats_train, feats_train)
Cs[:,:,i]=kernel.get_kernel_matrix()
# plot
figure(figsize=(16,5))
for i in range(len(taus)):
subplot(1,len(taus),i+1)
imshow(Cs[:,:,i], interpolation="nearest")
xlabel("Covariate index")
ylabel("Covariate index")
_=title("tau=%.1f" % taus[i])
```

In [6]:

```
figure(figsize=(16,5))
suptitle("Random Samples from GP prior")
for i in range(len(taus)):
subplot(1,len(taus),i+1)
# sample a bunch of latent functions from the Gaussian Process
# note these vectors are stored row-wise
F=Statistics.sample_from_gaussian(zeros(len(X_train)), Cs[:,:,i], 3)
for j in range(len(F)):
# sort points to connect the dots with lines
sorted_idx=X_train.argsort()
plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
xlabel("$\mathbf{x}_i$")
ylabel("$f(\mathbf{x}_i)$")
_=title("tau=%.1f" % taus[i])
```

Note how the functions are exactly evaluated at the training covariates $\mathbf{x}_i$ which are randomly distributed on the x-axis. Even though these points do not visualise the full functions (we can only evaluate them at a *finite* number of points, but we connected the points with lines to make it more clear), this reveils that larger values of the kernel bandwidth $\tau$ lead to smoother latent Gaussian functions.

In the above plots all functions are equally possible. That is, the prior of the latent Gaussian variables $\mathbf{f}|\boldsymbol{\theta}$ does not favour any particular function setups. Computing the posterior given our training data, the distribution ober $\mathbf{f}|\mathbf{y},\boldsymbol{\theta}$ then corresponds to restricting the above distribution over functions to those that explain the training data (up to observation noise). We will now use the Shogun class CExactInferenceMethod to do exactly this. The class is the general basis of exact GP regression in Shogun. We have to define all parts of the Gaussian Process for the inference method.

In [7]:

```
figure(figsize=(16,5))
suptitle("Random Samples from GP posterior")
for i in range(len(taus)):
subplot(1,len(taus),i+1)
# create inference method instance with very small observation noise to make
inf=ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
C_post=inf.get_posterior_covariance()
m_post=inf.get_posterior_mean()
# sample a bunch of latent functions from the Gaussian Process
# note these vectors are stored row-wise
F=Statistics.sample_from_gaussian(m_post, C_post, 5)
for j in range(len(F)):
# sort points to connect the dots with lines
sorted_idx=sorted(range(len(X_train)),key=lambda x:X_train[x])
plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
plot(X_train, y_train, 'r*')
xlabel("$\mathbf{x}_i$")
ylabel("$f(\mathbf{x}_i)$")
_=title("tau=%.1f" % taus[i])
```

In [8]:

```
# helper function that plots predictive distribution and data
def plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances):
# evaluate predictive distribution in this range of y-values and preallocate predictive distribution
y_values=linspace(-3,3)
D=zeros((len(y_values), len(X_test)))
# evaluate normal distribution at every prediction point (column)
for j in range(shape(D)[1]):
# create gaussian distributio instance, expects mean vector and covariance matrix, reshape
gauss=GaussianDistribution(array(means[j]).reshape(1,), array(variances[j]).reshape(1,1))
# evaluate predictive distribution for test point, method expects matrix
D[:,j]=exp(gauss.log_pdf_multiple(y_values.reshape(1,len(y_values))))
pcolor(X_test,y_values,D)
colorbar()
contour(X_test,y_values,D)
plot(X_test,y_test, 'b', linewidth=3)
plot(X_test,means, 'm--', linewidth=3)
plot(X_train, y_train, 'ro')
legend(["Truth", "Prediction", "Data"])
```

In [9]:

```
figure(figsize=(18,10))
suptitle("GP inference for different kernel widths")
for i in range(len(taus)):
subplot(len(taus),1,i+1)
# create GP instance using inference method and train
# use Shogun objects from above
inf.set_kernel(GaussianKernel(10,taus[i]))
gp=GaussianProcessRegression(inf)
gp.train()
# predict labels for all test data (note that this produces the same as the below mean vector)
means = gp.apply(feats_test)
# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
# note: y_predicted == means
# plot predictive distribution and training data
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=title("tau=%.1f" % taus[i])
```

In [10]:

```
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
gp = GaussianProcessRegression(inf)
# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)
grad.set_function(inf)
# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)
# search for best parameters and store them
best_combination = grad_search.select_model()
# apply best parameters to GP, train
best_combination.apply_to_machine(gp)
# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()
print "Selected tau (kernel bandwidth):", best_width
print "Selected gamma (kernel scaling):", best_scale
print "Selected sigma (observation noise):", best_sigma
```

Now we can output the best parameters and plot the predictive distribution for those.

In [11]:

```
# train gp
gp.train()
# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
# plot predictive distribution
figure(figsize=(18,5))
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=title("Maximum Likelihood II based inference")
```

Now the predictive distribution is very close to the true data generating process.

In binary classification, the observed data comes from a space of discrete, binary labels, i.e. $\mathbf{y}\in\mathcal{Y}^n=\{-1,+1\}^n$, which are represented via the Shogun class CBinaryLabels. To model these observations with a GP, we need a likelihood function $p(\mathbf{y}|\mathbf{f})$ that maps a set of such discrete observations to a probability, given a fixed response $\mathbf{f}$ of the Gaussian Process.

In regression, this way straight-forward, as we could simply use the response variable $\mathbf{f}$ itself, plus some Gaussian noise, which gave rise to a probability distribution. However, now that the $\mathbf{y}$ are discrete, we cannot do the same thing. We rather need a function that squashes the Gaussian response variable itself to a probability, given some data. This is a common problem in Machine Learning and Statistics and is usually done with some sort of *Sigmoid* function of the form $\sigma:\mathbb{R}\rightarrow[0,1]$. One popular choicefor such a function is the *Logit* likelihood, given by

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_i|f_i)=\prod_{i=1}^n \frac{1}{1-\exp(-y_i f_i)}.$

This likelihood is implemented in Shogun under CLogitLikelihood and using it is sometimes refered to as *logistic regression*. Using it with GPs results in non-linear Bayesian logistic regression. We can easily use the class to illustrate the sigmoid function for a 1D example and a fixed data point with label $+1$

In [12]:

```
# two classification likelihoods in Shogun
logit=LogitLikelihood()
probit=ProbitLikelihood()
# A couple of Gaussian response functions, 1-dimensional here
F=linspace(-5.0,5.0)
# Single observation label with +1
lab=BinaryLabels(array([1.0]))
# compute log-likelihood for all values in F
log_liks_logit=zeros(len(F))
log_liks_probit=zeros(len(F))
for i in range(len(F)):
# Shogun expects a 1D array for f, not a single number
f=array(F[i]).reshape(1,)
log_liks_logit[i]=logit.get_log_probability_f(lab, f)
log_liks_probit[i]=probit.get_log_probability_f(lab, f)
# in fact, loops are slow and Shogun offers a method to compute the likelihood for many f. Much faster!
log_liks_logit=logit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))
log_liks_probit=probit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))
# plot the sigmoid functions, note that Shogun computes it in log-domain, so we have to exponentiate
figure(figsize=(12, 4))
plot(F, exp(log_liks_logit))
plot(F, exp(log_liks_probit))
ylabel("$p(y_i|f_i)$")
xlabel("$f_i$")
title("Classification Likelihoods")
_=legend(["Logit", "Probit"])
```

Note how the logit function maps any input value to $[0,1]$ in a continuous way. The other plot above is for another classification likelihood is implemented in Shogun is the Gaussian CDF function

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_i|f_i)=\prod_{i=1}^n \Phi(y_i f_i),$

where $\Phi:\mathbb{R}\rightarrow [0,1]$ is the cumulative distribution function (CDF) of the standard Gaussian distribution $\mathcal{N}(0,1)$. It is implemented in the Shogun class CProbitLikelihood and using it is refered to as *probit regression*. While the Gaussian CDF has some convinient properties for integrating over it (and thus allowing some different modelling decisions), it doesn not really matter what you use in Shogun in most cases. However, for the sake of completeness, it is also potted above, being very similar to the logit likelihood.

TODO: Show a function squashed through the logit likelihood

Recall that in order to do inference, we need to solve two integrals (in addition to the Bayes rule, see above)

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},$

and

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

In classification, the second integral is not available in closed form since it is the convolution of a Gaussian, $p(\mathbf{f}|\boldsymbol{\theta})$, and a non-Gaussian, $p(\mathbf{y}|\mathbf{f})$, distribution. Therefore, we have to rely on approximations in order to compute and integrate over the posterior $p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})$. Shogun offers various standard methods from the literature to deal with this problem, including the Laplace approximation (CLaplacianInferenceMethod), Expectation Propagation (CEPInferenceMethod) for inference and evaluatiing the marginal likelihood. These two approximations give rise to a Gaussian posterior $p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})$, which can then be easily computed and integrated over (all this is done by Shogun for you).

While the Laplace approximation is quite fast, EP usually has a better accuracy, in particular if one is not just interetsed in binary decisions but also in certainty values for these predictions. Go for Laplace if interested in binary decisions, and for EP otherwise.

TODO, add references to inference methods.

In [13]:

```
def generate_classification_toy_data(n_train=100, mean_a=asarray([0, 0]), std_dev_a=1.0, mean_b=3, std_dev_b=0.5):
# positive examples are distributed normally
X1 = (random.randn(n_train, 2)*std_dev_a+mean_a).T
# negative examples have a "ring"-like form
r = random.randn(n_train)*std_dev_b+mean_b
angle = random.randn(n_train)*2*pi
X2 = array([r*cos(angle)+mean_a[0], r*sin(angle)+mean_a[1]])
# stack positive and negative examples in a single array
X_train = hstack((X1,X2))
# label positive examples with +1, negative with -1
y_train = zeros(n_train*2)
y_train[:n_train] = 1
y_train[n_train:] = -1
return X_train, y_train
def plot_binary_data(X_train, y_train):
plot(X_train[0, argwhere(y_train == 1)], X_train[1, argwhere(y_train == 1)], 'ro')
plot(X_train[0, argwhere(y_train == -1)], X_train[1, argwhere(y_train == -1)], 'bo')
```

In [14]:

```
X_train, y_train=generate_classification_toy_data()
plot_binary_data(X_train, y_train)
_=title("2D Toy classification problem")
```

In [15]:

```
# for building combinations of arrays
from itertools import product
# convert training data into Shogun representation
train_features = RealFeatures(X_train)
train_labels = BinaryLabels(y_train)
# generate all pairs in 2d range of testing data (full space), discretisation resultion is n_test
n_test=50
x1 = linspace(X_train[0,:].min()-1, X_train[0,:].max()+1, n_test)
x2 = linspace(X_train[1,:].min()-1, X_train[1,:].max()+1, n_test)
X_test = asarray(list(product(x1, x2))).T
# convert testing features into Shogun representation
test_features = RealFeatures(X_test)
# create Gaussian kernel with width = 2.0
kernel = GaussianKernel(10, 2)
# create zero mean function
zero_mean = ZeroMean()
# you can easily switch between probit and logit likelihood models
# by uncommenting/commenting the following lines:
# create probit likelihood model
# lik = ProbitLikelihood()
# create logit likelihood model
lik = LogitLikelihood()
# you can easily switch between Laplace and EP approximation by
# uncommenting/commenting the following lines:
# specify Laplace approximation inference method
#inf = LaplacianInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)
# specify EP approximation inference method
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)
# create and train GP classifier, which uses Laplace approximation
gp = GaussianProcessClassification(inf)
gp.train()
test_labels=gp.apply(test_features)
# plot data and decision boundary
plot_binary_data(X_train, y_train)
pcolor(x1, x2, test_labels.get_labels().reshape(n_test, n_test))
_=title('Decision boundary')
```

In [16]:

```
# obtain probabilities for
p_test = gp.get_probabilities(test_features)
# create figure
title('Training data, predictive probability and decision boundary')
# plot training data
plot_binary_data(X_train, y_train)
# plot decision boundary
contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))
# plot probabilities
pcolor(x1, x2, p_test.reshape(n_test, n_test))
_=colorbar()
```

In [17]:

```
# generate some non-negative kernel widths
widths=2**linspace(-5,6,20)
# compute marginal likelihood under Laplace apprixmation for every width
# use Shogun objects from above
marginal_likelihoods=zeros(len(widths))
for i in range(len(widths)):
# note that GP training is automatically done/updated if a parameter is changed. No need to call train again
kernel.set_width(widths[i])
marginal_likelihoods[i]=-inf.get_negative_log_marginal_likelihood()
# plot marginal likelihoods as a function of kernel width
plot(log2(widths), marginal_likelihoods)
title("Log Marginal likelihood for different kernels")
xlabel("Kernel Width in log-scale")
_=ylabel("Log-Marginal Likelihood")
print "Width with largest marginal likelihood:", widths[marginal_likelihoods.argmax()]
```

*maximum likelihood II*. Let's have a closer look.

First, let us have a look at the predictive distributions of some of the above kernel widths

In [18]:

```
# again, use Shogun objects from above, but a few extremal widths
widths_subset=array([widths[0], widths[marginal_likelihoods.argmax()], widths[len(widths)-1]])
figure(figsize=(18, 5))
for i in range(len(widths_subset)):
subplot(1,len(widths_subset),i+1)
kernel.set_width(widths_subset[i])
# obtain and plot predictive distribution
p_test = gp.get_probabilities(test_features)
title_str="Width=%.2f, " % widths_subset[i]
if i is 0:
title_str+="too complex, overfitting"
elif i is 1:
title_str+="just right"
else:
title_str+="too smooth, underfitting"
title(title_str)
plot_binary_data(X_train, y_train)
contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))
pcolor(x1, x2, p_test.reshape(n_test, n_test))
_=colorbar()
```

In the above plots, it is quite clear that the maximum of the marginal likelihood corresponds to the best single setting of the parameters. To give some more intuition: The interpretation of the marginal likelihood

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$

is the probability of the data given the model parameters $\boldsymbol{\theta}$. Note that this is averaged over *all* possible configurations of the latent Gaussian variables $\mathbf{f}|\boldsymbol{\theta}$ given a fixed configuration of parameters. However, since this is probability distribution, it has to integrate to $1$. This means that models that are too complex (and thus being able to explain too many different data configutations) and models that are too simple (and thus not able to explain the current data) give rise to a small marginal likelihood. Only when the model is just complex enough to explain the data well (but not more complex), the marginal likelihood is maximised. This is an implementation of a concept called Occam's razor, and is a nice motivation why you should be Bayesian if you can -- overfitting doesn't happen that quickly.

As mentioned before, Shogun is able to automagically learn all of the hyper-parameters $\boldsymbol{\theta}$ using gradient based optimisation on the marginal likelihood (whose derivatives are computed internally). To this is, we use the class CGradientModelSelection. Note that we could also use CGridSearchModelSelection to do a standard grid-search, such as is done for Support Vector Machines. However, this is highly ineffective, in particular when the number of parameters grows. In addition, order to evaluate parameter states, we have to use the classes CGradientEvaluation, and GradientCriterion, which is also much cheaper than the usual CCrossValidation, since it just evaluates the gradient of the marginal likelihood rather than performing many training and testing runs. This is another very nice motivation for using Gaussian Processes: optimising parameters is much easier. In the following, we demonstrate how to select *all* parameters of the used model. In Shogun, parameter configurations (corresponding to $\boldsymbol{\theta}$ are stored in instances of CParameterCombination, which can be applied to machines.

This approach is known as *maximum likelihood II* (the 2 is for the second level, averaging over all possible $\mathbf{f}|\boldsymbol{\theta}$), or *evidence maximisation*.

In [19]:

```
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)
gp = GaussianProcessClassification(inf)
# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, train_features, train_labels, GradientCriterion(), False)
grad.set_function(inf)
# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)
# search for best parameters and store them
best_combination = grad_search.select_model()
# apply best parameters to GP
best_combination.apply_to_machine(gp)
# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
print "Selected kernel bandwidth:", best_width
print "Selected kernel scale:", best_scale
```

In [20]:

```
# train gp
gp.train()
# visualise predictive distribution
p_test = gp.get_probabilities(test_features)
plot_binary_data(X_train, y_train)
contour(x1, x2, reshape(p_test, (n_test, n_test)), levels=[0.5], colors=('black'))
pcolor(x1, x2, p_test.reshape(n_test, n_test))
_=colorbar()
```

In [21]:

```
# parameter space, increase resolution if you want finer plots, takes long though
resolution=5
widths=2**linspace(-4,10,resolution)
scales=2**linspace(-5,10,resolution)
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = EPInferenceMethod(kernel, train_features, zero_mean, train_labels, lik)
gp = GaussianProcessClassification(inf)
# compute marginal likelihood for every parameter combination
# use Shogun objects from above
marginal_likelihoods=zeros((len(widths), len(scales)))
for i in range(len(widths)):
for j in range(len(scales)):
kernel.set_width(widths[i])
inf.set_scale(scales[j])
marginal_likelihoods[i,j]=-inf.get_negative_log_marginal_likelihood()
# contour plot of marginal likelihood as a function of kernel width and scale
contour(log2(widths), log2(scales), marginal_likelihoods)
colorbar()
xlabel("Kernel width (log-scale)")
ylabel("Kernel scale (log-scale)")
_=title("Log Marginal Likelihood")
# plot our found best parameters
_=plot([log2(best_width)], [log2(best_scale)], 'r*', markersize=20)
```

One "problem" with the classical method of Gaussian Process based inference is the computational complexity of $\mathcal{O}(n^3)$, where $n$ is the number of training examples. This is caused by matrix inversion, Cholesky factorization, etc. Up to a few thousand points, this is feasible. You will quickly run into memory and runtime problems for very large problems.

One way of approaching very large problems is called *Fully Independent Training Components*, which is a low-rank plus diagonal approximation to the exact covariance. The rough idea is to specify a set of $m\ll n$ *inducing points* and to base all computations on the covariance between training/test and inducing points only, which intuitively corresponds to combining various training points around an inducing point. This reduces the computational complexity to $\mathcal{O}(nm^2)$, where again $n$ is the number of training points, and $m$ is the number of inducing points. This is quite a significant decrease, in particular if the number of inducing points is much smaller than the number of examples.

The optimal way to specify inducing points is to densely and uniformly place them in the input space. However, this might be quickly become infeasible in high dimensions. In this case, a random subset of the training data might be a good idea.

In Shogun, the class CFITCInferenceMethod handles inference for regression with the CGaussianLikelihood. Below, we demonstrate its usage on a toy example and compare to exact regression. Note that CGradientModelSelection still works as before. We compare the runtime for inference with both GP.

First, note that changing the inference method only requires the change of a single line of code

In [22]:

```
# for measuring runtime
import time
# simple regression data
X_train, y_train, X_test, y_test = generate_regression_toy_data(n=1000)
# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)
# inducing features (here: a random grid over the input space, try out others)
n_inducing=10
#X_inducing=linspace(X_train.min(), X_train.max(), n_inducing)
X_inducing=rand(X_train.min()+n_inducing)*X_train.max()
feats_inducing=RealFeatures(X_inducing.reshape(1,len(X_inducing)))
# create FITC inference method and GP instance
inf = FITCInferenceMethod(GaussianKernel(10, 1), feats_train, ZeroMean(), labels_train, \
GaussianLikelihood(), feats_inducing)
gp = GaussianProcessRegression(inf)
# and of course, this can be combined with ML2 model-selection
grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)
grad.set_function(inf)
grad_search = GradientModelSelection(grad)
# search for best parameters and store them
best_combination = grad_search.select_model(True)
# apply best parameters to GP, train
best_combination.apply_to_machine(gp)
# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()
# for comparison, create exact GP with same parameters
# also re-create old GP for fair time measurements
# FITC GP
start=time.time()
inf = FITCInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \
GaussianLikelihood(best_sigma), feats_inducing)
inf.set_scale(best_scale)
gp = GaussianProcessRegression(inf)
gp.train()
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
print "FITC inference took %.2f seconds" % (time.time()-start)
# exact GP
start=time.time()
inf_exact = ExactInferenceMethod(GaussianKernel(10, best_width), feats_train, ZeroMean(), labels_train, \
GaussianLikelihood(best_sigma))
inf_exact.set_scale(best_scale)
gp_exact = GaussianProcessRegression(inf_exact)
gp_exact.train()
means_exact = gp_exact.get_mean_vector(feats_test)
variances_exact = gp_exact.get_variance_vector(feats_test)
print "Exact inference took %.2f seconds" % (time.time()-start)
# comparison plot FITC and exact inference, plot 95% confidence of both predictive distributions
figure(figsize=(18,5))
plot(X_test, y_test, color="black", linewidth=3)
plot(X_test, means, 'r--', linewidth=3)
plot(X_test, means_exact, 'b--', linewidth=3)
plot(X_train, y_train, 'ro')
plot(X_inducing, zeros(len(X_inducing)), 'g*', markersize=15)
# tube plot of 95% confidence
error=1.96*sqrt(variances)
plot(X_test,means-error, color='red', alpha=0.3, linewidth=3)
fill_between(X_test,means-error,means+error,color='red', alpha=0.3)
error_exact=1.96*sqrt(variances_exact)
plot(X_test,means_exact-error_exact, color='blue', alpha=0.3, linewidth=3)
fill_between(X_test,means_exact-error_exact,means_exact+error_exact,color='blue', alpha=0.3)
# plot upper confidence lines later due to legend
plot(X_test,means+error, color='red', alpha=0.3, linewidth=3)
plot(X_test,means_exact+error_exact, color='blue', alpha=0.3, linewidth=3)
legend(["True", "FITC prediction", "Exact prediction", "Data", "Inducing points", "95% FITC", "95% Exact"])
_=title("Comparison FITC and Exact Regression")
```

- Overview of all base classes for GPs
- Excurs: Automatic relevance determination
- Real life examples for regression and classification, including classification of sequences
- Fully Bayesian treatment of \theta

In [23]:

```
```