# Classification with Support Vector Machines¶

#### by Soeren Sonnenburg | Saurabh Mahindre - <a href=\"https://github.com/Saurabh7\">github.com/Saurabh7</a> as a part of <a href=\"http://www.google-melange.com/gsoc/project/details/google/gsoc2014/saurabh7/5750085036015616\">Google Summer of Code 2014 project</a> mentored by - Heiko Strathmann - <a href=\"https://github.com/karlnapf\">github.com/karlnapf</a> - <a href=\"http://herrstrathmann.de/\">herrstrathmann.de</a>

This notebook illustrates how to train a Support Vector Machine (SVM) classifier using Shogun. The CLibSVM class of Shogun is used to do binary classification. Multiclass classification is also demonstrated using CGMNPSVM.

### Introduction¶

Support Vector Machines (SVM's) are a learning method used for binary classification. The basic idea is to find a hyperplane which separates the data into its two classes. However, since example data is often not linearly separable, SVMs operate in a kernel induced feature space, i.e., data is embedded into a higher dimensional space where it is linearly separable.

### Linear Support Vector Machines¶

In a supervised learning problem, we are given a labeled set of input-output pairs $\mathcal{D}=(x_i,y_i)^N_{i=1}\subseteq \mathcal{X} \times \mathcal{Y}$ where $x\in\mathcal{X}$ and $y\in\{-1,+1\}$. SVM is a binary classifier that tries to separate objects of different classes by finding a (hyper-)plane such that the margin between the two classes is maximized. A hyperplane in $\mathcal{R}^D$ can be parameterized by a vector $\bf{w}$ and a constant $\text b$ expressed in the equation:$${\bf w}\cdot{\bf x} + \text{b} = 0$$ Given such a hyperplane ($\bf w$,b) that separates the data, the discriminating function is: $$f(x) = \text {sign} ({\bf w}\cdot{\bf x} + {\text b})$$

If the training data are linearly separable, we can select two hyperplanes in a way that they separate the data and there are no points between them, and then try to maximize their distance. The region bounded by them is called "the margin". These hyperplanes can be described by the equations $$({\bf w}\cdot{\bf x} + {\text b}) = 1$$ $$({\bf w}\cdot{\bf x} + {\text b}) = -1$$ the distance between these two hyperplanes is $\frac{2}{\|\mathbf{w}\|}$, so we want to minimize $\|\mathbf{w}\|$. $$\arg\min_{(\mathbf{w},b)}\frac{1}{2}\|\mathbf{w}\|^2 \qquad\qquad(1)$$ This gives us a hyperplane that maximizes the geometric distance to the closest data points. As we also have to prevent data points from falling into the margin, we add the following constraint: for each ${i}$ either $$({\bf w}\cdot{x}_i + {\text b}) \geq 1$$ or $$({\bf w}\cdot{x}_i + {\text b}) \leq -1$$ which is similar to $${y_i}({\bf w}\cdot{x}_i + {\text b}) \geq 1 \forall i$$

Lagrange multipliers are used to modify equation $(1)$ and the corresponding dual of the problem can be shown to be:

\begin{eqnarray} \max{\bf \alpha} && \sum{i=1}^{N} \alphai - \sum{i=1}^{N}\sum_{j=1}^{N} \alpha_i y_i \alpha_j y_j {\bf x_i} \cdot {\bf x_j}\ \mbox{s.t.} && \alphai\geq 0\ && \sum{i}^{N} \alpha_i y_i=0\ \end{eqnarray}

From the derivation of these equations, it was seen that the optimal hyperplane can be written as: $$\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i.$$ here most $\alpha_i$ turn out to be zero, which means that the solution is a sparse linear combination of the training data.

### Prediction using Linear SVM¶

Now let us see how one can train a linear Support Vector Machine with Shogun. Two dimensional data (having 2 attributes say: attribute1 and attribute2) is now sampled to demonstrate the classification.

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import matplotlib.patches as patches
#To import all shogun classes
import shogun as sg
import numpy as np

#Generate some random data
X = 2 * np.random.randn(10,2)
traindata=np.r_[X + 3, X + 7].T

feats_train=sg.RealFeatures(traindata)

trainlab=np.concatenate((np.ones(10),-np.ones(10)))
labels=sg.BinaryLabels(trainlab)

# Plot the training data
plt.figure(figsize=(6,6))
plt.gray()
_=plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.title("Training Data")
plt.xlabel('attribute1')
plt.ylabel('attribute2')
p1 = patches.Rectangle((0, 0), 1, 1, fc="k")
p2 = patches.Rectangle((0, 0), 1, 1, fc="w")
plt.legend((p1, p2), ["Class 1", "Class 2"], loc=2)
plt.gray()


Liblinear, a library for large- scale linear learning focusing on SVM, is used to do the classification. It supports different solver types.

In [2]:
#prameters to svm
#parameter C is described in a later section.
C=1
epsilon=1e-3

svm=sg.LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(sg.L2R_L2LOSS_SVC)
svm.set_epsilon(epsilon)

#train
svm.train()
w=svm.get_w()
b=svm.get_bias()


We solve ${\bf w}\cdot{\bf x} + \text{b} = 0$ to visualise the separating hyperplane. The methods get_w() and get_bias() are used to get the necessary values.

In [3]:
#solve for w.x+b=0
x1=np.linspace(-1.0, 11.0, 100)
def solve (x1):
return -( ( (w[0])*x1 + b )/w[1] )

x2=list(map(solve, x1))

#plot
plt.figure(figsize=(6,6))
plt.gray()
plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.plot(x1,x2, linewidth=2)
plt.title("Separating hyperplane")
plt.xlabel('attribute1')
plt.ylabel('attribute2')
plt.gray()


The classifier is now applied on a X-Y grid of points to get predictions.

In [4]:
size=100
x1_=np.linspace(-5, 15, size)
x2_=np.linspace(-5, 15, size)
x, y=np.meshgrid(x1_, x2_)
#Generate X-Y grid test data
grid=sg.RealFeatures(np.array((np.ravel(x), np.ravel(y))))

#apply on test grid
predictions = svm.apply(grid)

#Distance from hyperplane
z=predictions.get_values().reshape((size, size))

#plot
plt.jet()
plt.figure(figsize=(16,6))
plt.subplot(121)
plt.title("Classification")
c=plt.pcolor(x, y, z)
plt.contour(x, y, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)
plt.gray()
plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.xlabel('attribute1')
plt.ylabel('attribute2')
plt.jet()

#Class predictions
z=predictions.get_labels().reshape((size, size))

#plot
plt.subplot(122)
plt.title("Separating hyperplane")
c=plt.pcolor(x, y, z)
plt.contour(x, y, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)
plt.gray()
plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.xlabel('attribute1')
plt.ylabel('attribute2')
plt.gray()

<matplotlib.figure.Figure at 0x7fda3d3eac50>

### SVMs using kernels¶

If the data set is not linearly separable, a non-linear mapping $\Phi:{\bf x} \rightarrow \Phi({\bf x}) \in \mathcal{F}$ is used. This maps the data into a higher dimensional space where it is linearly separable. Our equation requires only the inner dot products ${\bf x_i}\cdot{\bf x_j}$. The equation can be defined in terms of inner products $\Phi({\bf x_i}) \cdot \Phi({\bf x_j})$ instead. Since $\Phi({\bf x_i})$ occurs only in dot products with $\Phi({\bf x_j})$ it is sufficient to know the formula (kernel function) : $$K({\bf x_i, x_j} ) = \Phi({\bf x_i}) \cdot \Phi({\bf x_j})$$ without dealing with the maping directly. The transformed optimisation problem is:

\begin{eqnarray*} \max_{\bf \alpha} && \sum_{i=1}^{N} \alpha_i - \sum_{i=1}^{N}\sum_{j=1}^{N} \alpha_i y_i \alpha_j y_j k({\bf x_i}, {\bf x_j})\\ \mbox{s.t.} && \alpha_i\geq 0\\ && \sum_{i=1}^{N} \alpha_i y_i=0 \qquad\qquad(2)\\ \end{eqnarray*}

### Kernels in Shogun¶

Shogun provides many options for the above mentioned kernel functions. CKernel is the base class for kernels. Some commonly used kernels :

• Gaussian kernel : Popular Gaussian kernel computed as $k({\bf x},{\bf x'})= exp(-\frac{||{\bf x}-{\bf x'}||^2}{\tau})$

• Linear kernel : Computes $k({\bf x},{\bf x'})= {\bf x}\cdot {\bf x'}$

• Polynomial kernel : Polynomial kernel computed as $k({\bf x},{\bf x'})= ({\bf x}\cdot {\bf x'}+c)^d$

• Simgmoid Kernel : Computes $k({\bf x},{\bf x'})=\mbox{tanh}(\gamma {\bf x}\cdot{\bf x'}+c)$

Some of these kernels are initialised below.

In [5]:
gaussian_kernel=sg.GaussianKernel(feats_train, feats_train, 100)
#Polynomial kernel of degree 2
poly_kernel=sg.PolyKernel(feats_train, feats_train, 2, True)
linear_kernel=sg.LinearKernel(feats_train, feats_train)

kernels=[linear_kernel, poly_kernel, gaussian_kernel]


Just for fun we compute the kernel matrix and display it. There are clusters visible that are smooth for the gaussian and polynomial kernel and block-wise for the linear one. The gaussian one also smoothly decays from some cluster centre while the polynomial one oscillates within the clusters.

In [6]:
plt.jet()
def display_km(kernels, svm):
plt.figure(figsize=(20,6))
plt.suptitle('Kernel matrices for different kernels', fontsize=12)
for i, kernel in enumerate(kernels):
plt.subplot(1, len(kernels), i+1)
plt.title(kernel.get_name())
km=kernel.get_kernel_matrix()
plt.imshow(km, interpolation="nearest")
plt.colorbar()

display_km(kernels, svm)

<matplotlib.figure.Figure at 0x7fda002a6090>

### Prediction using kernel based SVM¶

Now we train an SVM with a Gaussian Kernel. We use LibSVM but we could use any of the other SVM from Shogun. They all utilize the same kernel framework and so are drop-in replacements.

In [7]:
C=1
epsilon=1e-3
svm=sg.LibSVM(C, gaussian_kernel, labels)
_=svm.train()


We could now check a number of properties like what the value of the objective function returned by the particular SVM learning algorithm or the explictly computed primal and dual objective function is

In [8]:
libsvm_obj=svm.get_objective()
primal_obj, dual_obj=svm.compute_svm_primal_objective(), svm.compute_svm_dual_objective()

print(libsvm_obj, primal_obj, dual_obj)

(-9.512855165661737, -9.512859583253785, -9.512855297908345)


and based on the objectives we can compute the duality gap (have a look at reference [2]), a measure of convergence quality of the svm training algorithm . In theory it is 0 at the optimum and in reality at least close to 0.

In [9]:
print("duality_gap", dual_obj-primal_obj)

('duality_gap', 4.285345440280253e-06)


Let's now apply on the X-Y grid data and plot the results.

In [10]:
out=svm.apply(sg.RealFeatures(grid))
z=out.get_values().reshape((size, size))

#plot
plt.jet()
plt.figure(figsize=(16,6))
plt.subplot(121)
plt.title("Classification")
c=plt.pcolor(x1_, x2_, z)
plt.contour(x1_ , x2_, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)

plt.gray()
plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.xlabel('attribute1')
plt.ylabel('attribute2')
plt.jet()

z=out.get_labels().reshape((size, size))
plt.subplot(122)
plt.title("Decision boundary")
c=plt.pcolor(x1_, x2_, z)
plt.contour(x1_ , x2_, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)

plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=50)
plt.xlabel('attribute1')
plt.ylabel('attribute2')
plt.gray()

<matplotlib.figure.Figure at 0x7fd9f98efad0>

### Probabilistic Outputs¶

Calibrated probabilities can be generated in addition to class predictions using scores_to_probabilities() method of BinaryLabels, which implements the method described in [3]. This should only be used in conjunction with SVM. A parameteric form of a sigmoid function $$\frac{1}{{1+}exp(af(x) + b)}$$ is used to fit the outputs. Here $f(x)$ is the signed distance of a sample from the hyperplane, $a$ and $b$ are parameters to the sigmoid. This gives us the posterier probabilities $p(y=1|f(x))$.

Let's try this out on the above example. The familiar "S" shape of the sigmoid should be visible.

In [11]:
n=10
x1t_=np.linspace(-5, 15, n)
x2t_=np.linspace(-5, 15, n)
xt, yt=np.meshgrid(x1t_, x2t_)
#Generate X-Y grid test data
test_grid=sg.RealFeatures(np.array((np.ravel(xt), np.ravel(yt))))

labels_out=svm.apply(sg.RealFeatures(test_grid))

#Get values (Distance from hyperplane)
values=labels_out.get_values()

#Get probabilities
labels_out.scores_to_probabilities()
prob=labels_out.get_values()

#plot
plt.gray()
plt.figure(figsize=(10,6))
p1=plt.scatter(values, prob)
plt.title('Probabilistic outputs')
plt.xlabel('Distance from hyperplane')
plt.ylabel('Probability')
plt.legend([p1], ["Test samples"], loc=2)

Out[11]:
<matplotlib.legend.Legend at 0x7fda0b889a50>
<matplotlib.figure.Figure at 0x7fda003b1c10>

### Soft margins and slack variables¶

If there is no clear classification possible using a hyperplane, we need to classify the data as nicely as possible while incorporating the misclassified samples. To do this a concept of soft margin is used. The method introduces non-negative slack variables, $\xi_i$, which measure the degree of misclassification of the data $x_i$. $$y_i(\mathbf{w}\cdot\mathbf{x_i} + b) \ge 1 - \xi_i \quad 1 \le i \le N$$

Introducing a linear penalty function leads to $$\arg\min_{\mathbf{w},\mathbf{\xi}, b } ({\frac{1}{2} \|\mathbf{w}\|^2 +C \sum_{i=1}^n \xi_i) }$$

This in its dual form is leads to a slightly modified equation $\qquad(2)$. \begin{eqnarray} \max{\bf \alpha} && \sum{i=1}^{N} \alphai - \sum{i=1}^{N}\sum_{j=1}^{N} \alpha_i y_i \alpha_j y_j k({\bf x_i}, {\bf x_j})\ \mbox{s.t.} && 0\leq\alphai\leq C\ && \sum{i=1}^{N} \alpha_i y_i=0 \ \end{eqnarray}

The result is that soft-margin SVM could choose decision boundary that has non-zero training error even if dataset is linearly separable but is less likely to overfit.

Here's an example using LibSVM on the above used data set. Highlighted points show support vectors. This should visually show the impact of C and how the amount of outliers on the wrong side of hyperplane is controlled using it.

In [12]:
def plot_sv(C_values):
plt.figure(figsize=(20,6))
plt.suptitle('Soft and hard margins with varying C', fontsize=12)
for i in range(len(C_values)):
plt.subplot(1, len(C_values), i+1)
linear_kernel=sg.LinearKernel(feats_train, feats_train)
svm1=sg.LibSVM(C_values[i], linear_kernel, labels)
svm1.train()
vec1=svm1.get_support_vectors()
X_=[]
Y_=[]
new_labels=[]
for j in vec1:
X_.append(traindata[0][j])
Y_.append(traindata[1][j])
new_labels.append(trainlab[j])
out1=svm1.apply(sg.RealFeatures(grid))
z1=out1.get_labels().reshape((size, size))
plt.jet()
c=plt.pcolor(x1_, x2_, z1)
plt.contour(x1_ , x2_, z1, linewidths=1, colors='black', hold=True)
plt.colorbar(c)
plt.gray()
plt.scatter(X_, Y_, c=new_labels, s=150)
plt.scatter(traindata[0, :], traindata[1,:], c=labels, s=20)
plt.title('Support vectors for C=%.2f'%C_values[i])
plt.xlabel('attribute1')
plt.ylabel('attribute2')

C_values=[0.1, 1000]
plot_sv(C_values)


You can see that lower value of C causes classifier to sacrifice linear separability in order to gain stability, in a sense that influence of any single datapoint is now bounded by C. For hard margin SVM, support vectors are the points which are "on the margin". In the picture above, C=1000 is pretty close to hard-margin SVM, and you can see the highlighted points are the ones that will touch the margin. In high dimensions this might lead to overfitting. For soft-margin SVM, with a lower value of C, it's easier to explain them in terms of dual (equation $(2)$) variables. Support vectors are datapoints from training set which are are included in the predictor, ie, the ones with non-zero $\alpha_i$ parameter. This includes margin errors and points on the margin of the hyperplane.

### Binary classification using different kernels¶

Two-dimensional Gaussians are generated as data for this section.

$x_-\sim{\cal N_2}(0,1)-d$

$x_+\sim{\cal N_2}(0,1)+d$

and corresponding positive and negative labels. We create traindata and testdata with num of them being negatively and positively labelled in traindata,trainlab and testdata, testlab. For that we utilize Shogun's Gaussian Mixture Model class (GMM) from which we sample the data points and plot them.

In [13]:
num=50;
dist=1.0;

gmm=sg.GMM(2)
gmm.set_nth_mean(np.array([-dist,-dist]),0)
gmm.set_nth_mean(np.array([dist,dist]),1)
gmm.set_nth_cov(np.array([[1.0,0.0],[0.0,1.0]]),0)
gmm.set_nth_cov(np.array([[1.0,0.0],[0.0,1.0]]),1)

gmm.set_coef(np.array([1.0,0.0]))
xntr=np.array([gmm.sample() for i in range(num)]).T

gmm.set_coef(np.array([0.0,1.0]))
xptr=np.array([gmm.sample() for i in range(num)]).T

traindata=np.concatenate((xntr,xptr), axis=1)
trainlab=np.concatenate((-np.ones(num), np.ones(num)))

#shogun format features
feats_train=sg.RealFeatures(traindata)
labels=sg.BinaryLabels(trainlab)

In [14]:
gaussian_kernel=sg.GaussianKernel(feats_train, feats_train, 10)
#Polynomial kernel of degree 2
poly_kernel=sg.PolyKernel(feats_train, feats_train, 2, True)
linear_kernel=sg.LinearKernel(feats_train, feats_train)

kernels=[gaussian_kernel, poly_kernel, linear_kernel]

In [15]:
#train machine
C=1
svm=sg.LibSVM(C, gaussian_kernel, labels)
_=svm.train()


Now lets plot the contour output on a $-5...+5$ grid for

1. The Support Vector Machines decision function $\mbox{sign}(f(x))$
2. The Support Vector Machines raw output $f(x)$
3. The Original Gaussian Mixture Model Distribution
In [16]:
size=100
x1=np.linspace(-5, 5, size)
x2=np.linspace(-5, 5, size)
x, y=np.meshgrid(x1, x2)
grid=sg.RealFeatures(np.array((np.ravel(x), np.ravel(y))))
grid_out=svm.apply(grid)
z=grid_out.get_labels().reshape((size, size))

plt.jet()
plt.figure(figsize=(16,5))

z=grid_out.get_values().reshape((size, size))

plt.subplot(121)
plt.title('Classification')
c=plt.pcolor(x, y, z)
plt.contour(x, y, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)

plt.subplot(122)
plt.title('Original distribution')
gmm.set_coef(np.array([1.0,0.0]))
gmm.set_features(grid)
grid_out=gmm.get_likelihood_for_all_examples()
zn=grid_out.reshape((size, size))
gmm.set_coef(np.array([0.0,1.0]))
grid_out=gmm.get_likelihood_for_all_examples()
zp=grid_out.reshape((size, size))
z=zp-zn
c=plt.pcolor(x, y, z)
plt.contour(x, y, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)

Out[16]:
<matplotlib.colorbar.Colorbar at 0x7fd9fbb1a4d0>
<matplotlib.figure.Figure at 0x7fd9fc53f410>

And voila! The SVM decision rule reasonably distinguishes the red from the blue points. Despite being optimized for learning the discriminative function maximizing the margin, the SVM output quality wise remotely resembles the original distribution of the gaussian mixture model.

Let us visualise the output using different kernels.

In [17]:
def plot_outputs(kernels):
plt.figure(figsize=(20,5))
plt.suptitle('Binary Classification using different kernels', fontsize=12)
for i in range(len(kernels)):
plt.subplot(1,len(kernels),i+1)
plt.title(kernels[i].get_name())
svm.set_kernel(kernels[i])
svm.train()
grid_out=svm.apply(grid)
z=grid_out.get_values().reshape((size, size))
c=plt.pcolor(x, y, z)
plt.contour(x, y, z, linewidths=1, colors='black', hold=True)
plt.colorbar(c)
plt.scatter(traindata[0,:], traindata[1,:], c=trainlab, s=35)

plot_outputs(kernels)