The training of a KNN model basically does nothing but memorizing all the training points and the associated labels, which is very cheap in computation but costly in storage. The prediction is implemented by finding the K nearest neighbors of the query point, and voting. Here K is a hyper-parameter for the algorithm. Smaller values for K give the model low bias but high variance; while larger values for K give low variance but high bias.

In `SHOGUN`

, you can use CKNN to perform KNN learning. To construct a KNN machine, you must choose the hyper-parameter K and a distance function. Usually, we simply use the standard CEuclideanDistance, but in general, any subclass of CDistance could be used. For demonstration, in this tutorial we select a random subset of 1000 samples from the USPS digit recognition dataset, and run 2-fold cross validation of KNN with varying K.

First we load and init data split:

In [1]:

```
import numpy as np
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from scipy.io import loadmat, savemat
from numpy import random
from os import path
mat = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))
Xall = mat['data']
Yall = np.array(mat['label'].squeeze(), dtype=np.double)
# map from 1..10 to 0..9, since shogun
# requires multiclass labels to be
# 0, 1, ..., K-1
Yall = Yall - 1
random.seed(0)
subset = random.permutation(len(Yall))
Xtrain = Xall[:, subset[:5000]]
Ytrain = Yall[subset[:5000]]
Xtest = Xall[:, subset[5000:6000]]
Ytest = Yall[subset[5000:6000]]
Nsplit = 2
all_ks = range(1, 21)
print Xall.shape
print Xtrain.shape
print Xtest.shape
```

Let us plot the first five examples of the train data (first row) and test data (second row).

In [2]:

```
%matplotlib inline
import pylab as P
def plot_example(dat, lab):
for i in xrange(5):
ax=P.subplot(1,5,i+1)
P.title(int(lab[i]))
ax.imshow(dat[:,i].reshape((16,16)), interpolation='nearest')
ax.set_xticks([])
ax.set_yticks([])
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xtrain, Ytrain)
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xtest, Ytest)
```

Then we import shogun components and convert the data to shogun objects:

In [3]:

```
from modshogun import MulticlassLabels, RealFeatures
from modshogun import KNN, EuclideanDistance
labels = MulticlassLabels(Ytrain)
feats = RealFeatures(Xtrain)
k=3
dist = EuclideanDistance()
knn = KNN(k, dist, labels)
labels_test = MulticlassLabels(Ytest)
feats_test = RealFeatures(Xtest)
knn.train(feats)
pred = knn.apply_multiclass(feats_test)
print "Predictions", pred[:5]
print "Ground Truth", Ytest[:5]
from modshogun import MulticlassAccuracy
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(pred, labels_test)
print "Accuracy = %2.2f%%" % (100*accuracy)
```

In [4]:

```
idx=np.where(pred != Ytest)[0]
Xbad=Xtest[:,idx]
Ybad=Ytest[idx]
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xbad, Ybad)
```

In [5]:

```
knn.set_k(13)
multiple_k=knn.classify_for_multiple_k()
print multiple_k.shape
```

We have the prediction for each of the 13 k's now and can quickly compute the accuracies:

In [6]:

```
for k in xrange(13):
print "Accuracy for k=%d is %2.2f%%" % (k+1, 100*np.mean(multiple_k[:,k]==Ytest))
```

So k=3 seems to have been the optimal choice.

`SHOGUN`

will use all available CPU cores to parallelize this computation it might still be slow when you have big data sets. In `SHOGUN`

, you can use *Cover Trees* to speed up the nearest neighbor searching process in KNN. Just call `set_use_covertree`

on the KNN machine to enable or disable this feature. We also show the prediction time comparison with and without Cover Tree in this tutorial. So let's just have a comparison utilizing the data above:

In [7]:

```
from modshogun import Time, KNN_COVER_TREE, KNN_BRUTE
start = Time.get_curtime()
knn.set_k(3)
knn.set_knn_solver_type(KNN_BRUTE)
pred = knn.apply_multiclass(feats_test)
print "Standard KNN took %2.1fs" % (Time.get_curtime() - start)
start = Time.get_curtime()
knn.set_k(3)
knn.set_knn_solver_type(KNN_COVER_TREE)
pred = knn.apply_multiclass(feats_test)
print "Covertree KNN took %2.1fs" % (Time.get_curtime() - start)
```

In [8]:

```
def evaluate(labels, feats, use_cover_tree=False):
from modshogun import MulticlassAccuracy, CrossValidationSplitting
import time
split = CrossValidationSplitting(labels, Nsplit)
split.build_subsets()
accuracy = np.zeros((Nsplit, len(all_ks)))
acc_train = np.zeros(accuracy.shape)
time_test = np.zeros(accuracy.shape)
for i in range(Nsplit):
idx_train = split.generate_subset_inverse(i)
idx_test = split.generate_subset_indices(i)
for j, k in enumerate(all_ks):
#print "Round %d for k=%d..." % (i, k)
feats.add_subset(idx_train)
labels.add_subset(idx_train)
dist = EuclideanDistance(feats, feats)
knn = KNN(k, dist, labels)
knn.set_store_model_features(True)
if use_cover_tree:
knn.set_knn_solver_type(KNN_COVER_TREE)
else:
knn.set_knn_solver_type(KNN_BRUTE)
knn.train()
evaluator = MulticlassAccuracy()
pred = knn.apply_multiclass()
acc_train[i, j] = evaluator.evaluate(pred, labels)
feats.remove_subset()
labels.remove_subset()
feats.add_subset(idx_test)
labels.add_subset(idx_test)
t_start = time.clock()
pred = knn.apply_multiclass(feats)
time_test[i, j] = (time.clock() - t_start) / labels.get_num_labels()
accuracy[i, j] = evaluator.evaluate(pred, labels)
feats.remove_subset()
labels.remove_subset()
return {'eout': accuracy, 'ein': acc_train, 'time': time_test}
```

Evaluate KNN with and without Cover Tree. This takes a few seconds:

In [9]:

```
labels = MulticlassLabels(Ytest)
feats = RealFeatures(Xtest)
print("Evaluating KNN...")
wo_ct = evaluate(labels, feats, use_cover_tree=False)
wi_ct = evaluate(labels, feats, use_cover_tree=True)
print("Done!")
```

Generate plots with the data collected in the evaluation:

In [10]:

```
import matplotlib
fig = P.figure(figsize=(8,5))
P.plot(all_ks, wo_ct['eout'].mean(axis=0), 'r-*')
P.plot(all_ks, wo_ct['ein'].mean(axis=0), 'r--*')
P.legend(["Test Accuracy", "Training Accuracy"])
P.xlabel('K')
P.ylabel('Accuracy')
P.title('KNN Accuracy')
P.tight_layout()
fig = P.figure(figsize=(8,5))
P.plot(all_ks, wo_ct['time'].mean(axis=0), 'r-*')
P.plot(all_ks, wi_ct['time'].mean(axis=0), 'b-d')
P.xlabel("K")
P.ylabel("time")
P.title('KNN time')
P.legend(["Plain KNN", "CoverTree KNN"], loc='center right')
P.tight_layout()
```

*learning* becomes prohibitive when the dataset is huge. Even when the memory is big enough to hold all the data, the prediction will be slow, since the distances between the query point and all the training points need to be computed and ranked. The situation becomes worse if in addition the data samples are all very high-dimensional. Leaving aside computation time issues, k-NN is a very versatile and competitive algorithm. It can be applied to any kind of objects (not just numerical data) - as long as one can design a suitable distance function. In pratice k-NN used with bagging can create improved and more robust results.

In [11]:

```
from modshogun import GaussianKernel, GMNPSVM
width=80
C=1
gk=GaussianKernel()
gk.set_width(width)
svm=GMNPSVM(C, gk, labels)
_=svm.train(feats)
```

Let's apply the SVM to the same test data set to compare results:

In [12]:

```
out=svm.apply(feats_test)
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(out, labels_test)
print "Accuracy = %2.2f%%" % (100*accuracy)
```

In [13]:

```
Xrem=Xall[:,subset[6000:]]
Yrem=Yall[subset[6000:]]
feats_rem=RealFeatures(Xrem)
labels_rem=MulticlassLabels(Yrem)
out=svm.apply(feats_rem)
evaluator = MulticlassAccuracy()
accuracy = evaluator.evaluate(out, labels_rem)
print "Accuracy = %2.2f%%" % (100*accuracy)
idx=np.where(out.get_labels() != Yrem)[0]
Xbad=Xrem[:,idx]
Ybad=Yrem[idx]
_=P.figure(figsize=(17,6))
P.gray()
plot_example(Xbad, Ybad)
```

The misclassified examples are indeed much harder to label even for human beings.