In [1]:

```
%matplotlib inline
# import all shogun classes
from modshogun import *
import random
import numpy as np
import matplotlib.pyplot as plt
from math import exp
```

This notebook is about Bayesian regression models with Gaussian Process priors. A Gaussian Process (GP) over real valued functions on some domain $\mathcal{X}$, $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, written as

$\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}')),$

defines a distribution over real valued functions with mean value $m(\mathbf{x})=\mathbb{E}[f(\mathbf{x})]$ and inter-function covariance $k(\mathbf{x},\mathbf{x}')=\mathbb{E}[(f(\mathbf{x})-m(\mathbf{x}))^T(f(\mathbf{x})-m(\mathbf{x})]$. This intuitively means tha the function value at any point $\mathbf{x}$, i.e., $f(\mathbf{x})$ is a random variable with mean $m(\mathbf{x})$; if you take the average of infinitely many functions from the Gaussian Process, and evaluate them at $\mathbf{x}$, you will get this value. Similarily, the function values at two different points $\mathbf{x}, \mathbf{x}'$ have covariance $k(\mathbf{x}, \mathbf{x}')$. The formal definition is that Gaussian Process is a collection of random variables (may be infinite) of which any finite subset have a joint Gaussian distribution.

One can model data with Gaussian Processes via defining a joint distribution over

- $n$ data (labels in Shogun) $\mathbf{y}\in \mathcal{Y}^n$, from a $n$ dimensional continous (regression) or discrete (classification) space. These data correspond to $n$ covariates $\mathbf{x}_i\in\mathcal{X}$ (features in Shogun) from the input space $\mathcal{X}$.
- Hyper-parameters $\boldsymbol{\theta}$ which depend on the used model (details follow).
- Latent Gaussian variables $\mathbf{f}\in\mathbb{R}^n$, coming from a GP, i.e., they have a joint Gaussian distribution. Every entry $f_i$ corresponds to the GP function $f(\mathbf{x_i})$ evaluated at covariate $\mathbf{x}_i$ for $1\leq i \leq n$.

The joint distribution takes the form

$p(\mathbf{f},\mathbf{y},\theta)=p(\boldsymbol{\theta})p(\mathbf{f}|\boldsymbol{\theta})p(\mathbf{y}|\mathbf{f}),$

where $\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta)$ is the joint Gaussian distribution for the GP variables, with mean $\mathbf{m}_\boldsymbol{\theta}$ and covariance $\mathbf{C}_\theta$. The $(i,j)$-th entriy of $\mathbf{C}_\boldsymbol{\theta}$ is given by the covariance or kernel between the $(i,j)$-th covariates $k(\mathbf{x}_i, \mathbf{x}_j)$. Examples for kernel and mean functions are given later in the notebook.

Mean and covariance are both depending on hyper-parameters coming from a prior distribution $\boldsymbol{\theta}\sim p(\boldsymbol{\theta})$. The data itself $\mathbf{y}\in \mathcal{Y}^n$ (no assumptions on $\mathcal{Y}$ for now) is modelled by a likelihood function $p(\mathbf{y}|\mathbf{f})$, which gives the probability of the data $\mathbf{y}$ given a state of the latent Gaussian variables $\mathbf{f}$, i.e. $p(\mathbf{y}|\mathbf{f}):\mathcal{Y}^n\rightarrow [0,1]$.

In order to do inference for a new, unseen covariate $\mathbf{x}^*\in\mathcal{X}$, i.e., predicting its label $y^*\in\mathcal{Y}$ or in particular computing the predictive distribution for that label, we have integrate over the posterior over the latent Gaussian variables (assume fixed $\boldsymbol{\theta}$ for now, which means you can just ignore the symbol in the following if you want),

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}.$

This posterior, $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, can be obtained using standard Bayes-Rule as

$p(\mathbf{f}|\mathbf{y},\boldsymbol{\theta})=\frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})}{p(\mathbf{y}|\boldsymbol{\theta})},$

with the so called *evidence* or *marginal likelihood* $p(\mathbf{y}|\boldsymbol{\theta})$ given as another integral over the prior over the latent Gaussian variables

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

In order to solve the above integrals, Shogun offers a variety of approximations. Don't worry, you will not have to deal with these nasty integrals on your own, but everything is hidden within Shogun. Though, if you like to play with these objects, you will be able to compute only parts.

Note that in the above description, we did not make any assumptions on the input space $\mathcal{X}$. As long as you define mean and covariance functions, and a likelihood, your data can have *any* form you like. Shogun in fact is able to deal with standard dense numerical data, sparse data, and strings of *any* type, and many more *out of the box*. We will provide some examples below.

To gain some intuition how these latent Gaussian variables behave, and how to model data with them, see the regression part of this notebook.

Bayesian regression with Gaussian Processes is among the most fundamental applications of latent Gaussian models. As usual, the oberved data come from a contintous space, i.e. $\mathbf{y}\in\mathbb{R}^n$, which is represented in the Shogun class CRegressionLabels. We assume that these observations come from some distribution $p(\mathbf{y}|\mathbf{f)}$ that is based on a fixed state of latent Gaussian response variables $\mathbf{f}\in\mathbb{R}^n$. In fact, we assume that the true model *is* the latent Gaussian response variable (which defined a *distribution* over functions; plus some Gaussian observation noise which is modelled by the likelihood as

$p(\mathbf{y}|\mathbf{f})=\mathcal{N}(\mathbf{f},\sigma^2\mathbf{I})$

This simple likelihood is implemented in the Shogun class CGaussianLikelihood. It is the well known bell curve. Below, we plot the likelihood as a function of $\mathbf{y}$, for $n=1$.

In [2]:

```
# plot likelihood for three different noise lebels $\sigma$ (which is not yet squared)
sigmas=np.array([0.5,1,2])
# likelihood instance
lik=GaussianLikelihood()
# A set of labels to consider
lab=RegressionLabels(np.linspace(-4.0,4.0, 200))
# A single 1D Gaussian response function, repeated once for each label
# this avoids doing a loop in python which would be slow
F=np.zeros(lab.get_num_labels())
# plot likelihood for all observations noise levels
plt.figure(figsize=(12, 4))
for sigma in sigmas:
# set observation noise, this is squared internally
lik.set_sigma(sigma)
# compute log-likelihood for all labels
log_liks=lik.get_log_probability_f(lab, F)
# plot likelihood functions, exponentiate since they were computed in log-domain
plt.plot(lab.get_labels(), map(exp,log_liks))
plt.ylabel("$p(y_i|f_i)$")
plt.xlabel("$y_i$")
plt.title("Regression Likelihoods for different observation noise levels")
_=plt.legend(["sigma=$%.1f$" % sigma for sigma in sigmas])
```

Apart from its apealling form, this curve has the nice property of given rise to *analytical* solutions to the required integrals. Recall these are given by

$p(y^*|\mathbf{y}, \boldsymbol{\theta})=\int p(\mathbf{y}^*|\mathbf{f})p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta},$

and

$p(\mathbf{y}|\boldsymbol{\theta})=\int p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\boldsymbol{\theta})d\mathbf{f}|\boldsymbol{\theta}$.

Since all involved elements, the likelihood $p(\mathbf{y}|\mathbf{f})$, the GP prior $p(\mathbf{f}|\boldsymbol{\theta})$ are Gaussian, the same follows for the GP posterior $p(\mathbf{f}|\mathbf{y}, \boldsymbol{\theta})$, and the marginal likelihood $p(\mathbf{y}|\boldsymbol{\theta})$. Therefore, we just need to sit down with pen and paper to derive the resulting forms of the Gaussian distributions of these objects (see references). Luckily, everything is already implemented in Shogun.

In order to get some intuition about Gaussian Processes in general, let us first have a look at these latent Gaussian variables, which define a probability distribution over real values functions $f(\mathbf{x}):\mathcal{X} \rightarrow \mathbb{R}$, where in the regression case, $\mathcal{X}=\mathbb{R}$.

As mentioned above, the joint distribution of a finite number (say $n$) of variables $\mathbf{f}\in\mathbb{R}^n$ from a Gaussian Process $\mathcal{GP}(m(\mathbf{x}), k(\mathbf{x},\mathbf{x}'))$, takes the form

$\mathbf{f}|\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{m}_\theta, \mathbf{C}_\theta),$

where $\mathbf{m}_\theta$ is the mean function's mean and $\mathbf{C}_\theta$ is the pairwise covariance or kernel matrix of the input covariates $\mathbf{x}_i$. This means, we can easily sample function realisations $\mathbf{f}^{(j)}$ from the Gaussian Process, and more important, visualise them.

To this end, let us consider the well-known and often used *Gaussian Kernel* or *squared exponential covariance*, which is implemented in the Shogun class CGaussianKernel in the parametric from (note that there are other forms in the literature)

$ k(\mathbf{x}, \mathbf{x}')=\exp\left( -\frac{||\mathbf{x}-\mathbf{x}'||_2^2}{\tau}\right),$

where $\tau$ is a hyper-parameter of the kernel. We will also use the constant CZeroMean mean function, which is suitable if the data's mean is zero (which can be achieved via removing it).

Let us consider some toy regression data in the form of a sine wave, which is observed at random points with some observations noise.

In [3]:

```
def generate_regression_toy_data(n=50, n_test=100, x_range=15, x_range_test=20, noise_var=0.4):
# training and test sine wave, test one has more points
X_train = np.random.rand(n)*x_range
X_test = np.linspace(0,x_range_test, 500)
# add noise to training observations
y_test = np.sin(X_test)
y_train = np.sin(X_train)+np.random.randn(n)*noise_var
return X_train, y_train, X_test, y_test
```

In [4]:

```
X_train, y_train, X_test, y_test = generate_regression_toy_data()
plt.figure(figsize=(16,4))
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test)
plt.legend(["Noisy observations", "True model"])
plt.title("One-Dimensional Toy Regression Data")
plt.xlabel("$\mathbf{x}$")
_=plt.ylabel("$\mathbf{y}$")
```

In [5]:

```
# bring data into shogun representation (features are 2d-arrays, organised as column vectors)
feats_train=RealFeatures(X_train.reshape(1,len(X_train)))
feats_test=RealFeatures(X_test.reshape(1,len(X_test)))
labels_train=RegressionLabels(y_train)
# compute covariances for different kernel parameters
taus=np.asarray([.1,4.,32.])
Cs=np.zeros(((len(X_train), len(X_train), len(taus))))
for i in range(len(taus)):
# compute unscalled kernel matrix (first parameter is maximum size in memory and not very important)
kernel=GaussianKernel(10, taus[i])
kernel.init(feats_train, feats_train)
Cs[:,:,i]=kernel.get_kernel_matrix()
# plot
plt.figure(figsize=(16,5))
for i in range(len(taus)):
plt.subplot(1,len(taus),i+1)
plt.imshow(Cs[:,:,i], interpolation="nearest")
plt.xlabel("Covariate index")
plt.ylabel("Covariate index")
_=plt.title("tau=%.1f" % taus[i])
```

In [6]:

```
plt.figure(figsize=(16,5))
plt.suptitle("Random Samples from GP prior")
for i in range(len(taus)):
plt.subplot(1,len(taus),i+1)
# sample a bunch of latent functions from the Gaussian Process
# note these vectors are stored row-wise
F=Statistics.sample_from_gaussian(np.zeros(len(X_train)), Cs[:,:,i], 3)
for j in range(len(F)):
# sort points to connect the dots with lines
sorted_idx=X_train.argsort()
plt.plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
plt.xlabel("$\mathbf{x}_i$")
plt.ylabel("$f(\mathbf{x}_i)$")
_=plt.title("tau=%.1f" % taus[i])
```

Note how the functions are exactly evaluated at the training covariates $\mathbf{x}_i$ which are randomly distributed on the x-axis. Even though these points do not visualise the full functions (we can only evaluate them at a *finite* number of points, but we connected the points with lines to make it more clear), this reveils that larger values of the kernel bandwidth $\tau$ lead to smoother latent Gaussian functions.

In the above plots all functions are equally possible. That is, the prior of the latent Gaussian variables $\mathbf{f}|\boldsymbol{\theta}$ does not favour any particular function setups. Computing the posterior given our training data, the distribution ober $\mathbf{f}|\mathbf{y},\boldsymbol{\theta}$ then corresponds to restricting the above distribution over functions to those that explain the training data (up to observation noise). We will now use the Shogun class CExactInferenceMethod to do exactly this. The class is the general basis of exact GP regression in Shogun. We have to define all parts of the Gaussian Process for the inference method.

In [7]:

```
plt.figure(figsize=(16,5))
plt.suptitle("Random Samples from GP posterior")
for i in range(len(taus)):
plt.subplot(1,len(taus),i+1)
# create inference method instance with very small observation noise to make
inf=ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
C_post=inf.get_posterior_covariance()
m_post=inf.get_posterior_mean()
# sample a bunch of latent functions from the Gaussian Process
# note these vectors are stored row-wise
F=Statistics.sample_from_gaussian(m_post, C_post, 5)
for j in range(len(F)):
# sort points to connect the dots with lines
sorted_idx=sorted(range(len(X_train)),key=lambda x:X_train[x])
plt.plot(X_train[sorted_idx], F[j,sorted_idx], '-', markersize=6)
plt.plot(X_train, y_train, 'r*')
plt.xlabel("$\mathbf{x}_i$")
plt.ylabel("$f(\mathbf{x}_i)$")
_=plt.title("tau=%.1f" % taus[i])
```

In [8]:

```
# helper function that plots predictive distribution and data
def plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances):
# evaluate predictive distribution in this range of y-values and preallocate predictive distribution
y_values=np.linspace(-3,3)
D=np.zeros((len(y_values), len(X_test)))
# evaluate normal distribution at every prediction point (column)
for j in range(np.shape(D)[1]):
# create gaussian distributio instance, expects mean vector and covariance matrix, reshape
gauss=GaussianDistribution(np.array(means[j]).reshape(1,), np.array(variances[j]).reshape(1,1))
# evaluate predictive distribution for test point, method expects matrix
D[:,j]=np.exp(gauss.log_pdf_multiple(y_values.reshape(1,len(y_values))))
plt.pcolor(X_test,y_values,D)
plt.colorbar()
plt.contour(X_test,y_values,D)
plt.plot(X_test,y_test, 'b', linewidth=3)
plt.plot(X_test,means, 'm--', linewidth=3)
plt.plot(X_train, y_train, 'ro')
plt.legend(["Truth", "Prediction", "Data"])
```

In [9]:

```
plt.figure(figsize=(18,10))
plt.suptitle("GP inference for different kernel widths")
for i in range(len(taus)):
plt.subplot(len(taus),1,i+1)
# create GP instance using inference method and train
# use Shogun objects from above
inf.set_kernel(GaussianKernel(10,taus[i]))
gp=GaussianProcessRegression(inf)
gp.train()
# predict labels for all test data (note that this produces the same as the below mean vector)
means = gp.apply(feats_test)
# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
# note: y_predicted == means
# plot predictive distribution and training data
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=plt.title("tau=%.1f" % taus[i])
```

In [10]:

```
# re-create inference method and GP instance to start from scratch, use other Shogun structures from above
inf = ExactInferenceMethod(GaussianKernel(10, taus[i]), feats_train, ZeroMean(), labels_train, GaussianLikelihood())
gp = GaussianProcessRegression(inf)
# evaluate our inference method for its derivatives
grad = GradientEvaluation(gp, feats_train, labels_train, GradientCriterion(), False)
grad.set_function(inf)
# handles all of the above structures in memory
grad_search = GradientModelSelection(grad)
# search for best parameters and store them
best_combination = grad_search.select_model()
# apply best parameters to GP, train
best_combination.apply_to_machine(gp)
# we have to "cast" objects to the specific kernel interface we used (soon to be easier)
best_width=GaussianKernel.obtain_from_generic(inf.get_kernel()).get_width()
best_scale=inf.get_scale()
best_sigma=GaussianLikelihood.obtain_from_generic(inf.get_model()).get_sigma()
print "Selected tau (kernel bandwidth):", best_width
print "Selected gamma (kernel scaling):", best_scale
print "Selected sigma (observation noise):", best_sigma
```

Now we can output the best parameters and plot the predictive distribution for those.

In [11]:

```
# train gp
gp.train()
# extract means and variance of predictive distribution for all test points
means = gp.get_mean_vector(feats_test)
variances = gp.get_variance_vector(feats_test)
# plot predictive distribution
plt.figure(figsize=(18,5))
plot_predictive_regression(X_train, y_train, X_test, y_test, means, variances)
_=plt.title("Maximum Likelihood II based inference")
```

Now the predictive distribution is very close to the true data generating process.

In binary classification, the observed data comes from a space of discrete, binary labels, i.e. $\mathbf{y}\in\mathcal{Y}^n=\{-1,+1\}^n$, which are represented via the Shogun class CBinaryLabels. To model these observations with a GP, we need a likelihood function $p(\mathbf{y}|\mathbf{f})$ that maps a set of such discrete observations to a probability, given a fixed response $\mathbf{f}$ of the Gaussian Process.

In regression, this way straight-forward, as we could simply use the response variable $\mathbf{f}$ itself, plus some Gaussian noise, which gave rise to a probability distribution. However, now that the $\mathbf{y}$ are discrete, we cannot do the same thing. We rather need a function that squashes the Gaussian response variable itself to a probability, given some data. This is a common problem in Machine Learning and Statistics and is usually done with some sort of *Sigmoid* function of the form $\sigma:\mathbb{R}\rightarrow[0,1]$. One popular choicefor such a function is the *Logit* likelihood, given by

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_i|f_i)=\prod_{i=1}^n \frac{1}{1-\exp(-y_i f_i)}.$

This likelihood is implemented in Shogun under CLogitLikelihood and using it is sometimes refered to as *logistic regression*. Using it with GPs results in non-linear Bayesian logistic regression. We can easily use the class to illustrate the sigmoid function for a 1D example and a fixed data point with label $+1$

In [12]:

```
# two classification likelihoods in Shogun
logit=LogitLikelihood()
probit=ProbitLikelihood()
# A couple of Gaussian response functions, 1-dimensional here
F=np.linspace(-5.0,5.0)
# Single observation label with +1
lab=BinaryLabels(np.array([1.0]))
# compute log-likelihood for all values in F
log_liks_logit=np.zeros(len(F))
log_liks_probit=np.zeros(len(F))
for i in range(len(F)):
# Shogun expects a 1D array for f, not a single number
f=np.array(F[i]).reshape(1,)
log_liks_logit[i]=logit.get_log_probability_f(lab, f)
log_liks_probit[i]=probit.get_log_probability_f(lab, f)
# in fact, loops are slow and Shogun offers a method to compute the likelihood for many f. Much faster!
log_liks_logit=logit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))
log_liks_probit=probit.get_log_probability_fmatrix(lab, F.reshape(1,len(F)))
# plot the sigmoid functions, note that Shogun computes it in log-domain, so we have to exponentiate
plt.figure(figsize=(12, 4))
plt.plot(F, np.exp(log_liks_logit))
plt.plot(F, np.exp(log_liks_probit))
plt.ylabel("$p(y_i|f_i)$")
plt.xlabel("$f_i$")
plt.
```