In [1]:

```
%pylab inline
%matplotlib inline
#To import all Shogun classes
from modshogun import *
```

In a general problem setting for the supervised learning approach, the goal is to learn a mapping from inputs $x_i\in\mathcal{X} $ to outputs $y_i \in \mathcal{Y}$, given a labeled set of input-output pairs $ \mathcal{D} = {(x_i,y_i)}^{\text N}_{i=1} $$\subseteq \mathcal{X} \times \mathcal{Y}$. Here $ \mathcal{D}$ is called the training set, and $\text N$ is the number of training examples. In the simplest setting, each training input $x_i$ is a $\mathcal{D}$ -dimensional vector of numbers, representing, say, the height and weight of a person. These are called $\textbf {features}$, attributes or covariates. In general, however, $x_i$ could be a complex structured object, such as an image.

- When the response variable $y_i$ is categorical and discrete, $y_i \in$ {1,...,C} (say male or female) it is a classification problem.
- When it is continuous (say the prices of houses) it is a regression problem.

A real world dataset: Pima Indians Diabetes data set is used now. We load the `LibSVM`

format file using Shogun's LibSVMFile class. The `LibSVM`

format is: $$\space \text {label}\space \text{attribute1:value1 attribute2:value2 }...$$$$\space.$$$$\space .$$ LibSVM uses the so called "sparse" format where zero values do not need to be stored.

In [2]:

```
#Load the file
data_file=LibSVMFile('../../../data/uci/diabetes/diabetes_scale.svm')
```

This results in a LibSVMFile object which we will later use to access the data.

To get off the mark, let us see how Shogun handles the attributes of the data using CFeatures class. Shogun supports wide range of feature representations. We believe it is a good idea to have different forms of data, rather than converting them all into matrices. Among these are: $\hspace {20mm}$

- String features: Implements a list of strings. Not limited to character strings, but could also be sequences of floating point numbers etc. Have varying dimensions.
- Dense features: Implements dense feature matrices
- Sparse features: Implements sparse matrices.
- Streaming features: For algorithms working on data streams (which are too large to fit into memory)

`SpareRealFeatures`

(sparse features handling `64 bit float`

type data) are used to get the data from the file. Since `LibSVM`

format files have labels included in the file, `load_with_labels`

method of `SpareRealFeatures`

is used. In this case it is interesting to play with two attributes, Plasma glucose concentration and Body Mass Index (BMI) and try to learn something about their relationship with the disease. We get hold of the feature matrix using `get_full_feature_matrix`

and row vectors 1 and 5 are extracted. These are the attributes we are interested in.

In [3]:

```
f=SparseRealFeatures()
trainlab=f.load_with_labels(data_file)
mat=f.get_full_feature_matrix()
#exatract 2 attributes
glucose_conc=mat[1]
BMI=mat[5]
#generate a numpy array
feats=array(glucose_conc)
feats=vstack((feats, array(BMI)))
print feats, feats.shape
```

`RealFeatures`

are used which are nothing but the above mentioned Dense features of `64bit Float`

type. To do this call `RealFeatures`

with the matrix (this should be a 64bit 2D numpy array) as the argument.

In [4]:

```
#convert to shogun format
feats_train=RealFeatures(feats)
```

Some of the general methods you might find useful are:

`get_feature_matrix()`

: The feature matrix can be accessed using this.`get_num_features()`

: The total number of attributes can be accesed using this.`get_num_vectors()`

: To get total number of samples in data.`get_feature_vector()`

: To get all the attribute values (A.K.A feature vector) for a particular sample by passing the index of the sample as argument.

In [5]:

```
#Get number of features(attributes of data) and num of vectors(samples)
feat_matrix=feats_train.get_feature_matrix()
num_f=feats_train.get_num_features()
num_s=feats_train.get_num_vectors()
print('Number of attributes: %s and number of samples: %s' %(num_f, num_s))
print('Number of rows of feature matrix: %s and number of columns: %s' %(feat_matrix.shape[0], feat_matrix.shape[1]))
print('First column of feature matrix (Data for first individual):')
print feats_train.get_feature_vector(0)
```

In supervised learning problems, training data is labelled. Shogun provides various types of labels to do this through Clabels. Some of these are:

- Binary labels: Binary Labels for binary classification which can have values +1 or -1.
- Multiclass labels: Multiclass Labels for multi-class classification which can have values from 0 to (num. of classes-1).
- Regression labels: Real-valued labels used for regression problems and are returned as output of classifiers.
- Structured labels: Class of the labels used in Structured Output (SO) problems

In this particular problem, our data can be of two types: diabetic or non-diabetic, so we need binary labels. This makes it a Binary Classification problem, where the data has to be classified in two groups.

In [6]:

```
#convert to shogun format labels
labels=BinaryLabels(trainlab)
```

`get_labels`

and the confidence vector using `get_values`

. The total number of labels is available using `get_num_labels`

.

In [7]:

```
n=labels.get_num_labels()
print 'Number of labels:', n
```

It is usually better to preprocess data to a standard form rather than handling it in raw form. The reasons are having a well behaved-scaling, many algorithms assume centered data, and that sometimes one wants to de-noise data (with say PCA). Preprocessors do not change the domain of the input features. It is possible to do various type of preprocessing using methods provided by CPreprocessor class. Some of these are:

- Norm one: Normalize vector to have norm 1.
- PruneVarSubMean: Substract the mean and remove features that have zero variance.
- Dimension Reduction: Lower the dimensionality of given simple features.
- PCA: Principal component analysis.
- Kernel PCA: PCA using kernel methods.

`True`

to the constructor makes the class normalise the varaince of the variables. It basically dividies every dimension through its standard-deviation. This is the reason behind removing dimensions with constant values. It is required to initialize the preprocessor by passing the feature object to `init`

before doing anything else. The raw and processed data is now plotted.
In [8]:

```
preproc=PruneVarSubMean(True)
preproc.init(feats_train)
feats_train.add_preprocessor(preproc)
feats_train.apply_preprocessor()
# Store preprocessed feature matrix.
preproc_data=feats_train.get_feature_matrix()
```

In [9]:

```
# Plot the raw training data.
figure(figsize=(13,6))
pl1=subplot(121)
gray()
_=scatter(feats[0, :], feats[1,:], c=labels, s=50)
vlines(0, -1, 1, linestyle='solid', linewidths=2)
hlines(0, -1, 1, linestyle='solid', linewidths=2)
title("Raw Training Data")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl1.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
#Plot preprocessed data.
pl2=subplot(122)
_=scatter(preproc_data[0, :], preproc_data[1,:], c=labels, s=50)
vlines(0, -5, 5, linestyle='solid', linewidths=2)
hlines(0, -5, 5, linestyle='solid', linewidths=2)
title("Training data after preprocessing")
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
pl2.legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
gray()
```

CMachine is Shogun's interface for general learning machines. Basically one has to `train()`

the machine on some training data to be able to learn from it. Then we `apply()`

it to test data to get predictions. Some of these are:

- Kernel machine: Kernel based learning tools.
- Linear machine: Interface for all kinds of linear machines like classifiers.
- Distance machine: A distance machine is based on a a-priori choosen distance.
- Gaussian process machine: A base class for Gaussian Processes.
- And many more

In [10]:

```
#prameters to svm
C=0.9
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
#train
svm.train()
size=100
```

`apply`

on test features to get predictions. For visualising the classification boundary, the whole XY is used as test data, i.e. we predict the class on every point in the grid.

In [11]:

```
x1=linspace(-5.0, 5.0, size)
x2=linspace(-5.0, 5.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#apply on test grid
predictions = svm.apply(grid)
#get output labels
z=predictions.get_values().reshape((size, size))
#plot
jet()
figure(figsize=(9,6))
title("Classification")
c=pcolor(x, y, z)
_=contour(x, y, z, linewidths=1, colors='black', hold=True)
_=colorbar(c)
_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
gray()
```

`get_w()`

and `get_bias()`

are used to get the necessary values.

In [12]:

```
w=svm.get_w()
b=svm.get_bias()
x1=linspace(-2.0, 3.0, 100)
#solve for w.x+b=0
def solve (x1):
return -( ( (w[0])*x1 + b )/w[1] )
x2=map(solve, x1)
#plot
figure(figsize=(7,6))
plot(x1,x2, linewidth=2)
title("Decision boundary using w and bias")
_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)
_=xlabel('Plasma glucose concentration')
_=ylabel('Body mass index')
p1 = Rectangle((0, 0), 1, 1, fc="w")
p2 = Rectangle((0, 0), 1, 1, fc="k")
legend((p1, p2), ["Non-diabetic", "Diabetic"], loc=2)
print 'w :', w
print 'b :', b
```

How do you assess the quality of a prediction? Shogun provides various ways to do this using CEvaluation. The preformance is evaluated by comparing the predicted output and the expected output. Some of the base classes for performance measures are:

- Binary class evaluation: used to evaluate binary classification labels.
- Clustering evaluation: used to evaluate clustering.
- Mean absolute error: used to compute an error of regression model.
- Multiclass accuracy: used to compute accuracy of multiclass classification.

Evaluating on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called overfitting. Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. The dataset will now be split into two, we train on one part and evaluate performance on other using CAccuracyMeasure.

In [13]:

```
#split features for training and evaluation
num_train=700
feats=array(glucose_conc)
feats_t=feats[:num_train]
feats_e=feats[num_train:]
feats=array(BMI)
feats_t1=feats[:num_train]
feats_e1=feats[num_train:]
feats_t=vstack((feats_t, feats_t1))
feats_e=vstack((feats_e, feats_e1))
feats_train=RealFeatures(feats_t)
feats_evaluate=RealFeatures(feats_e)
```

Let's see the accuracy by applying on test features.

In [14]:

```
label_t=trainlab[:num_train]
labels=BinaryLabels(label_t)
label_e=trainlab[num_train:]
labels_true=BinaryLabels(label_e)
svm=LibLinear(C, feats_train, labels)
svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)
#train and evaluate
svm.train()
output=svm.apply(feats_evaluate)
#use AccuracyMeasure to get accuracy
acc=AccuracyMeasure()
acc.evaluate(output,labels_true)
accuracy=acc.get_accuracy()*100
print 'Accuracy(%):', accuracy
```

The task is to estimate prices of houses in Boston using the Boston Housing Dataset provided by StatLib library. The attributes are: Weighted distances to employment centres and percentage lower status of the population. Let us see if we can predict a good relationship between the pricing of houses and the attributes. This type of problems are solved using Regression analysis.

In [15]:

```
temp_feats=RealFeatures(CSVFile('../../../data/uci/housing/fm_housing.dat'))
labels=RegressionLabels(CSVFile('../../../data/uci/housing/housing_label.dat'))
#rescale to 0...1
preproc=RescaleFeatures()
preproc.init(temp_feats)
temp_feats.add_preprocessor(preproc)
temp_feats.apply_preprocessor(True)
mat = temp_feats.get_feature_matrix()
dist_centres=mat[7]
lower_pop=mat[12]
feats=array(dist_centres)
feats=vstack((feats, array(lower_pop)))
print feats, feats.shape
#convert to shogun format features
feats_train=RealFeatures(feats)
```

In [16]:

```
from mpl_toolkits.mplot3d import Axes3D
size=100
x1=linspace(0, 1.0, size)
x2=linspace(0, 1.0, size)
x, y=meshgrid(x1, x2)
#Generate X-Y grid test data
grid=RealFeatures(array((ravel(x), ravel(y))))
#Train on data(both attributes) and predict
width=1.0
tau=0.5
kernel=GaussianKernel(feats_train, feats_train, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train)
kernel.init(feats_train, grid)
out = krr.apply().get_labels()
```

`out`

variable now contains a relationship between the attributes. Below is an attempt to establish such relationship between the attributes individually. Separate feature instances are created for each attribute. You could skip the code and have a look at the plots directly if you just want the essence.

In [17]:

```
#create feature objects for individual attributes.
feats_test=RealFeatures(x1.reshape(1,len(x1)))
feats_t0=array(dist_centres)
feats_train0=RealFeatures(feats_t0.reshape(1,len(feats_t0)))
feats_t1=array(lower_pop)
feats_train1=RealFeatures(feats_t1.reshape(1,len(feats_t1)))
#Regression with first attribute
kernel=GaussianKernel(feats_train0, feats_train0, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train0)
kernel.init(feats_train0, feats_test)
out0 = krr.apply().get_labels()
#Regression with second attribute
kernel=GaussianKernel(feats_train1, feats_train1, width)
krr=KernelRidgeRegression(tau, kernel, labels)
krr.train(feats_train1)
kernel.init(feats_train1, feats_test)
out1 = krr.apply().get_labels()
```

In [18]:

```
#Visualization of regression
fig=figure(figsize(20,6))
#first plot with only one attribute
fig.add_subplot(131)
title("Regression with 1st attribute")
_=scatter(feats[0, :], labels.get_labels(),c=ones(560), cmap=gray(), s=20)
_=xlabel('Weighted distances to employment centres ')
_=ylabel('Median value of homes')
_=plot(x1,out0, linewidth=3)
#second plot with only one attribute
fig.add_subplot(132)
title("Regression with 2nd attribute")
_=scatter(feats[1, :], labels.get_labels(),c=ones(560), cmap=gray(), s=20)
_=xlabel('% lower status of the population')
_=ylabel('Median value of homes')
_=plot(x1,out1, linewidth=3)
#Both attributes and regression output
ax=fig.add_subplot(133, projection='3d')
z=out.reshape((size, size))
gray()
title("Regression")
ax.plot_wireframe(y, x, z, linewidths=2, alpha=0.4)
ax.set_xlabel('% lower status of the population')
ax.set_ylabel('Distances to employment centres ')
ax.set_zlabel('Median value of homes')
ax.view_init(25, 40)
```