Based on the model selection framework of his Google summer of code 2011 project | Saurabh Mahindre - github.com/Saurabh7 as a part of Google Summer of Code 2014 project mentored by - Heiko Strathmann

In [1]:

```
%pylab inline
%matplotlib inline
# include all Shogun classes
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from modshogun import *
# generate some ultra easy training data
gray()
n=20
title('Toy data for binary classification')
X=hstack((randn(2,n), randn(2,n)+1))
Y=hstack((-ones(n), ones(n)))
_=scatter(X[0], X[1], c=Y , s=100)
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Class 1", "Class 2"], loc=2)
# training data in Shogun representation
features=RealFeatures(X)
labels=BinaryLabels(Y)
```

In [2]:

```
k=5
normal_split=CrossValidationSplitting(labels, k)
```

In [3]:

```
stratified_split=StratifiedCrossValidationSplitting(labels, k)
```

Let us visualize the generated folds on the toy data.

In [4]:

```
split_strategies=[stratified_split, normal_split]
```

In [5]:

```
#code to visualize splitting
def get_folds(split, num):
split.build_subsets()
x=[]
y=[]
lab=[]
for j in range(num):
indices=split.generate_subset_indices(j)
x_=[]
y_=[]
lab_=[]
for i in range(len(indices)):
x_.append(X[0][indices[i]])
y_.append(X[1][indices[i]])
lab_.append(Y[indices[i]])
x.append(x_)
y.append(y_)
lab.append(lab_)
return x, y, lab
def plot_folds(split_strategies, num):
for i in range(len(split_strategies)):
x, y, lab=get_folds(split_strategies[i], num)
figure(figsize=(18,4))
gray()
suptitle(split_strategies[i].get_name(), fontsize=12)
for j in range(0, num):
subplot(1, num, (j+1), title='Fold %s' %(j+1))
scatter(x[j], y[j], c=lab[j], s=100)
_=plot_folds(split_strategies, 4)
```

Following the example from above, we will tune the performance of a SVM on the binary classification problem. We will

- demonstrate how to evaluate a loss function or metric on a given algorithm
- then learn how to estimate this metric for the algorithm performing on unseen data
- and finally use those techniques to tune the parameters to obtain the best possible results.

The involved methods are

- LibSVM as the binary classification algorithms
- the area under the ROC curve (AUC) as performance metric
- three different kernels to compare

In [6]:

```
# define SVM with a small rbf kernel (always normalise the kernel!)
C=1
kernel=GaussianKernel(2, 0.001)
kernel.init(features, features)
kernel.set_normalizer(SqrtDiagKernelNormalizer())
classifier=LibSVM(C, kernel, labels)
# train
_=classifier.train()
```

In [7]:

```
# instanciate a number of Shogun performance measures
metrics=[ROCEvaluation(), AccuracyMeasure(), ErrorRateMeasure(), F1Measure(), PrecisionMeasure(), RecallMeasure(), SpecificityMeasure()]
for metric in metrics:
print metric.get_name(), metric.evaluate(classifier.apply(features), labels)
```

Now, the training error is zero. This seems good at first. But is this setting of the parameters a good idea? No! A good performance on the training data alone does not mean anything. A simple look up table is able to produce zero error on training data. What we want is that our methods generalises the input data somehow to perform well on unseen data. We will now use cross-validation to estimate the performance on such.

We will use CStratifiedCrossValidationSplitting, which accepts a reference to the labels and the number of partitions as parameters. This instance is then passed to the class CCrossValidation, which does the estimation using the desired splitting strategy. The latter class can take all algorithms that are implemented against the CMachine interface.

In [8]:

```
metric=AccuracyMeasure()
cross=CrossValidation(classifier, features, labels, stratified_split, metric)
# perform the cross-validation, note that this call involved a lot of computation
result=cross.evaluate()
# the result needs to be casted to CrossValidationResult
result=CrossValidationResult.obtain_from_generic(result)
# this class contains a field "mean" which contain the mean performance metric
print "Testing", metric.get_name(), result.mean
```

Now this is incredibly bad compared to the training error. In fact, it is very close to random performance (0.5). The lesson: Never judge your algorithms based on the performance on training data!

Note that for small data sizes, the cross-validation estimates are quite noisy. If we run it multiple times, we get different results.

In [9]:

```
print "Testing", metric.get_name(), [CrossValidationResult.obtain_from_generic(cross.evaluate()).mean for _ in range(10)]
```

In [10]:

```
# 25 runs and 95% confidence intervals
cross.set_num_runs(25)
# perform x-validation (now even more expensive)
cross.evaluate()
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
print "Testing cross-validation mean %.2f " \
% (result.mean)
```

In [11]:

```
widths=2**linspace(-5,25,10)
results=zeros(len(widths))
for i in range(len(results)):
kernel.set_width(widths[i])
result=CrossValidationResult.obtain_from_generic(cross.evaluate())
results[i]=result.mean
plot(log2(widths), results, 'blue')
xlabel("log2 Kernel width")
ylabel(metric.get_name())
_=title("Accuracy for different kernel widths")
print "Best Gaussian kernel width %.2f" % widths[results.argmax()], "gives", results.max()
# compare this with a linear kernel
classifier.set_kernel(LinearKernel())
lin_k=CrossValidationResult.obtain_from_generic(cross.evaluate())
plot([log2(widths[0]), log2(widths[len(widths)-1])], [lin_k.mean,lin_k.mean], 'r')
# please excuse this horrible code :)
print "Linear kernel gives", lin_k.mean
_=legend(["Gaussian", "Linear"], loc="lower center")
```

In [12]:

```
feats=RealFeatures(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/fm_housing.dat')))
labels=RegressionLabels(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/housing_label.dat')))
preproc=RescaleFeatures()
preproc.init(feats)
feats.add_preprocessor(preproc)
feats.apply_preprocessor(True)
#Regression models
ls=LeastSquaresRegression(feats, labels)
tau=1
rr=LinearRidgeRegression(tau, feats, labels)
width=1
tau=1
kernel=GaussianKernel(feats, feats, width)
kernel.set_normalizer(SqrtDiagKernelNormalizer())
krr=KernelRidgeRegression(tau, kernel, labels)
regression_models=[ls, rr, krr]
```

In [13]:

```
n=30
taus = logspace(-4, 1, n)
#5-fold cross-validation
k=5
split=CrossValidationSplitting(labels, k)
metric=MeanSquaredError()
cross=CrossValidation(rr, feats, labels, split, metric)
cross.set_num_runs(50)
errors=[]
for tau in taus:
#set necessary parameter
rr.set_tau(tau)
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
#Enlist mean error for all runs
errors.append(result.mean)
figure(figsize=(20,6))
suptitle("Finding best (tau) parameter using cross-validation", fontsize=12)
p=subplot(121)
title("Ridge Regression")
plot(taus, errors, linewidth=3)
p.set_xscale('log')
p.set_ylim([0, 80])
xlabel("Taus")
ylabel("Mean Squared Error")
cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(50)
errors=[]
for tau in taus:
krr.set_tau(tau)
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
#print tau, "error", result.mean
errors.append(result.mean)
p2=subplot(122)
title("Kernel Ridge regression")
plot(taus, errors, linewidth=3)
p2.set_xscale('log')
xlabel("Taus")
_=ylabel("Mean Squared Error")
```

In [14]:

```
n=50
widths=logspace(-2, 3, n)
krr.set_tau(0.1)
metric=MeanSquaredError()
k=5
split=CrossValidationSplitting(labels, k)
cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(10)
errors=[]
for width in widths:
kernel.set_width(width)
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
#print width, "error", result.mean
errors.append(result.mean)
figure(figsize=(15,5))
p=subplot(121)
title("Finding best width using cross-validation")
plot(widths, errors, linewidth=3)
p.set_xscale('log')
xlabel("Widths")
_=ylabel("Mean Squared Error")
```

In [15]:

```
n=40
taus = logspace(-3, 0, n)
widths=logspace(-1, 4, n)
cross=CrossValidation(krr, feats, labels, split, metric)
cross.set_num_runs(1)
x, y=meshgrid(taus, widths)
grid=array((ravel(x), ravel(y)))
print grid.shape
errors=[]
for i in range(0, n*n):
krr.set_tau(grid[:,i][0])
kernel.set_width(grid[:,i][1])
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
errors.append(result.mean)
errors=array(errors).reshape((n, n))
```

In [16]:

```
from mpl_toolkits.mplot3d import Axes3D
#taus = logspace(0.5, 1, n)
jet()
fig=figure(figsize(15,7))
ax=subplot(121)
c=pcolor(x, y, errors)
_=contour(x, y, errors, linewidths=1, colors='black')
_=colorbar(c)
xlabel('Taus')
ylabel('Widths')
ax.set_xscale('log')
ax.set_yscale('log')
ax1=fig.add_subplot(122, projection='3d')
ax1.plot_wireframe(log10(y),log10(x), errors, linewidths=2, alpha=0.6)
ax1.view_init(30,-40)
xlabel('Taus')
ylabel('Widths')
_=ax1.set_zlabel('Error')
```

In [17]:

```
#use the best parameters
rr.set_tau(1)
krr.set_tau(0.05)
kernel.set_width(2)
title_='Performance on Boston Housing dataset'
print "%50s" %title_
for machine in regression_models:
metric=MeanSquaredError()
cross=CrossValidation(machine, feats, labels, split, metric)
cross.set_num_runs(25)
result=cross.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
print "-"*80
print "|", "%30s" % machine.get_name(),"|", "%20s" %metric.get_name(),"|","%20s" %result.mean ,"|"
print "-"*80
```

`build_values()`

method.

In [18]:

```
#Root
param_tree_root=ModelSelectionParameters()
#Parameter tau
tau=ModelSelectionParameters("tau")
param_tree_root.append_child(tau)
# also R_LINEAR/R_LOG is available as type
min_value=0.01
max_value=1
type_=R_LINEAR
step=0.05
base=2
tau.build_values(min_value, max_value, type_, step, base)
```

In [19]:

```
#kernel object
param_gaussian_kernel=ModelSelectionParameters("kernel", kernel)
gaussian_kernel_width=ModelSelectionParameters("log_width")
gaussian_kernel_width.build_values(0.1, 6.0, R_LINEAR, 0.5, 2.0)
#kernel parameter
param_gaussian_kernel.append_child(gaussian_kernel_width)
param_tree_root.append_child(param_gaussian_kernel)
```

In [20]:

```
# cross validation instance used
cross_validation=CrossValidation(krr, feats, labels, split, metric)
cross_validation.set_num_runs(1)
```

In [21]:

```
# model selection instance
model_selection=GridSearchModelSelection(cross_validation, param_tree_root)
```

In [22]:

```
print_state=False
# TODO: enable it once crossval has been fixed
#best_parameters=model_selection.select_model(print_state)
#best_parameters.apply_to_machine(krr)
#best_parameters.print_tree()
result=cross_validation.evaluate()
result=CrossValidationResult.obtain_from_generic(result)
print 'Error with Best parameters:', result.mean
```