# Multi-Label Classification with Shogun Machine Learning Toolbox¶

#### Thanks Thoralf Klein for taking time to help me on this project! ;)¶

This notebook presents training of multi-label classification using structured SVM presented in shogun. We would be using MultilabelModel for multi-label classfication.

We begin with brief introduction to Multi-Label Structured Prediction [1] followed by corresponding API in Shogun. Then we are going to implement a toy example (for illustration) before getting to the real one. Finally, we evaluate the multi-label classification on well-known datasets [2]. We showed that SHOGUNs [3] implementation delivers same accuracy as scikit-learn and same or better training time.

## Introduction¶

### Multi-Label Structured Prediction¶

Multi-Label Structured Prediction combines the aspects of multi-label prediction and structured output. Structured prediction typically involves an input $\mathbf{x}$ (can be structured) and a structured output $\mathbf{y}$. Given a training set $\{(x^i, y^i)\}_{i=1,...,n} \subset \mathcal{X} \times \mathbb{P}(\mathcal{Y})$ where $\mathcal{Y}$ is a structured output set of potentially very large size (in this case $\mathcal{Y} = \{y_1, y_2, ...., y_q\}$ where $q$ is total number of possible classes). A joint feature map $\psi(x, y)$ is defined to incorporate structure information into the labels.

The joint feature map $\psi(x, y)$ for MultilabelModel is defined as $\psi(x, y) \rightarrow x \otimes y$ where $\otimes$ is the tensor product.

We formulate the prediction as:
$h(x) = \{y \in \mathcal{Y} : f(x, y) > 0\}$

The compatibility function, $f(x, y)$, acts on individual inputs and outputs, as in single-label prediction, but the prediction step consists of collecting all outputs of positive scores instead of finding the outputs of maximal score.

### Multi-Label Models¶

In this notebook, we are going to compare the performance of two multi-label models:

• MultilabelModel model : with constant entry term $0$ in joint feature vector to not model bias term.
• MultilabelModel model_with_bias : with constant entry $1$ in the joint feature vector to model bias term.

The joint feature vector are:

• model$\leftrightarrow \psi(x, y) = [x || 0] \otimes y$.
• model_with_bias$\leftrightarrow \psi(x, y) = [x || 1] \otimes y$.

For comparision of the two models, we are going to perform on the datasets with binary labels.

## Experiment 1 : Binary Label Data¶

### Generation of some synthetic data¶

First of all we create some synthetic data for our toy example. We add some static offset to the data to compare the models with/without threshold.

In [1]:
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from __future__ import print_function

try:
from sklearn.datasets import make_classification
except ImportError:
import pip
pip.main(['install', '--user', 'scikit-learn'])
from sklearn.datasets import make_classification

import numpy as np

X, Y = make_classification(n_samples=1000,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=2)

# adding some static offset to the data
X = X + 1


### Preparation of data and model¶

To create a multi-label model in shogun, we'll first create an instance of MultilabelModel and initialize it by the features and labels. The labels should be MultilabelSOLables. It should be initialized by providing with the n_labels (number of examples) and n_classes (total number of classes) and then individually adding a label using set_sparse_label() method.

In [2]:
from shogun import RealFeatures, MultilabelSOLabels, MultilabelModel

def create_features(X, constant):
features = RealFeatures(
np.c_[X, constant * np.ones(X.shape[0])].T)

return features
from shogun import MultilabelSOLabels

def create_labels(Y, n_classes):
try:
n_samples = Y.shape[0]
except AttributeError:
n_samples = len(Y)

labels = MultilabelSOLabels(n_samples, n_classes)
for i, sparse_label in enumerate(Y):
try:
sparse_label = sorted(sparse_label)
except TypeError:
sparse_label = [sparse_label]
labels.set_sparse_label(i, np.array(sparse_label, dtype=np.int32))

return labels

def split_data(X, Y, ratio):
num_samples = X.shape[0]
train_samples = int(ratio * num_samples)
return (X[:train_samples], Y[:train_samples],
X[train_samples:], Y[train_samples:])

In [3]:
X_train, Y_train, X_test, Y_test = split_data(X, Y, 0.9)

feats_0 = create_features(X_train, 0)
feats_1 = create_features(X_train, 1)
labels = create_labels(Y_train, 2)

model = MultilabelModel(feats_0, labels)
model_with_bias = MultilabelModel(feats_1, labels)


### Training and Evaluation of Structured Machines with/without Threshold¶

In Shogun, several solvers and online solvers have been implemented for SO-Learning. Let's try to train the model using an online solver StochasticSOSVM.

In [4]:
from shogun import StochasticSOSVM, DualLibQPBMSOSVM, StructuredAccuracy, LabelsFactory
from time import time

sgd = StochasticSOSVM(model, labels)
sgd_with_bias = StochasticSOSVM(model_with_bias, labels)

start = time()
sgd.train()
print(">>> Time taken for SGD *without* threshold tuning = %f" % (time() - start))
start = time()
sgd_with_bias.train()
print(">>> Time taken for SGD *with* threshold tuning    = %f" % (time() - start))

>>> Time taken for SGD *without* threshold tuning = 0.590497
>>> Time taken for SGD *with* threshold tuning    = 0.600128


### Accuracy¶

For measuring accuracy in multi-label classification, Jaccard Similarity Coefficients $\big(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\big)$ is used :
$Accuracy = \frac{1}{p}\sum_{i=1}^{p}\frac{ |Y_i \cap h(x_i)|}{|Y_i \cup h(x_i)|}$
This is available in MultilabelAccuracy for MultilabelLabels and StructuredAccuracy for MultilabelSOLabels.

In [5]:
def evaluate_machine(machine,
X_test,
Y_test,
n_classes,
bias):
if bias:
feats_test = create_features(X_test, 1)
else:
feats_test = create_features(X_test, 0)

test_labels = create_labels(Y_test, n_classes)

out_labels = LabelsFactory.to_structured(machine.apply(feats_test))
evaluator = StructuredAccuracy()
jaccard_similarity_score = evaluator.evaluate(out_labels, test_labels)

return jaccard_similarity_score

In [6]:
print(">>> Accuracy of SGD *without* threshold tuning   = %f " % evaluate_machine(sgd, X_test, Y_test, 2, False))
print(">>> Accuracy of SGD *with* threshold tuning      = %f " %evaluate_machine(sgd_with_bias, X_test, Y_test, 2, True))

>>> Accuracy of SGD *without* threshold tuning   = 0.800000
>>> Accuracy of SGD *with* threshold tuning      = 0.850000


### Plotting the Data along with the Boundary¶

In [7]:
import matplotlib.pyplot as plt
%matplotlib inline

def get_parameters(weights):
return -weights[0]/weights[1], -weights[2]/weights[1]

def scatter_plot(X, y):
zeros_class = np.where(y == 0)
ones_class = np.where(y == 1)
plt.scatter(X[zeros_class, 0], X[zeros_class, 1], c='b', label="Negative Class")
plt.scatter(X[ones_class, 0], X[ones_class, 1], c='r', label="Positive Class")

def plot_hyperplane(machine_0,
machine_1,
label_0,
label_1,
title,
X, y):
scatter_plot(X, y)
x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
xx = np.linspace(x_min, x_max, 1000)

m_0, c_0 = get_parameters(machine_0.get_w())
m_1, c_1 = get_parameters(machine_1.get_w())
yy_0 = m_0 * xx + c_0
yy_1 = m_1 * xx + c_1
plt.plot(xx, yy_0, "k--", label=label_0)
plt.plot(xx, yy_1, "g-", label=label_1)

plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.grid()
plt.legend(loc="best")
plt.title(title)
plt.show()

/usr/local/lib/python2.7/dist-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [8]:
fig = plt.figure(figsize=(10, 10))
plot_hyperplane(sgd, sgd_with_bias,
"Boundary for machine *without* bias for class 0",
"Boundary for machine *with* bias for class 0",
"Binary Classification using SO-SVM with/without threshold tuning",
X, Y)


As we can see from the above plot that sgd_with_bias can produce better classification boundary. The model without threshold tuning is crossing origin of space, while the one with threshold tuning is crossing $(1,1)$ (the constant we have added earlier).

In [9]:
from shogun import SparseMultilabel_obtain_from_generic

def plot_decision_plane(machine,
title,
X, y, bias):
plt.figure(figsize=(24, 8))
plt.suptitle(title)
plt.subplot(1, 2, 1)
x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
xx = np.linspace(x_min, x_max, 200)
yy = np.linspace(y_min, y_max, 200)
x_mesh, y_mesh = np.meshgrid(xx, yy)

if bias:
feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 1)
else:
feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 0)
out_labels = machine.apply(feats)
z = []
for i in range(out_labels.get_num_labels()):
label = SparseMultilabel_obtain_from_generic(out_labels.get_label(i)).get_data()
if label.shape[0] == 1:
# predicted a single label
z.append(label[0])
elif label.shape[0] == 2:
# predicted both the classes
z.append(2)
elif label.shape[0] == 0:
# predicted none of the class
z.append(3)
z = np.array(z)
z = z.reshape(x_mesh.shape)
c = plt.pcolor(x_mesh, y_mesh, z, cmap=plt.cm.gist_heat)
scatter_plot(X, y)
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))
plt.colorbar(c)
plt.title("Decision Surface")
plt.legend(loc="best")

plt.subplot(1, 2, 2)
weights = machine.get_w()
m_0, c_0 = get_parameters(weights[:3])
m_1, c_1 = get_parameters(weights[3:])
yy_0 = m_0 * xx + c_0
yy_1 = m_1 * xx + c_1
plt.plot(xx, yy_0, "r--", label="Boundary for class 0")
plt.plot(xx, yy_1, "g-", label="Boundary for class 1")
plt.title("Hyper planes for different classes")
plt.legend(loc="best")
plt.xlim((x_min, x_max))
plt.ylim((y_min, y_max))

plt.show()

In [10]:
plot_decision_plane(sgd,"Model *without* Threshold Tuning", X, Y, False)
plot_decision_plane(sgd_with_bias,"Model *with* Threshold Tuning", X, Y, True)


As we can see from the above plots of decision surface, the black region corresponds to the region of negative (label = $0$) class, where as the red region corresponds to the positive (label = $1$). But along with that there are some regions (although very small) of white surface and orange surface. The white surface corresponds to the region not classified to any label, whereas the orange region correspond to the region classified to both the labels. The reason for existence of these type of surface is that the above boundaries for both the class don't overlap exactly with each other (illustrated above). So, there are some regions for which both the compatibility function $f(x, 0) > 0$ as well as $f(x, 1) > 0$ (predicted both the labels) and there are some regions where both the compatibility function $f(x, 0) < 0$ and $f(x, 1) < 0$ (predicted none of the labels).

## Experiment 2 : Multi-Label Data¶

In [11]:
def load_data(file_name):
input_file = open(file_name)
n_samples = len(lines)
n_features = len(lines[0].split()) - 1
Y = []
X = []
for line in lines:
data = line.split()
Y.append(map(int, data[0].split(",")))
feats = []
for feat in data[1:]:
feats.append(float(feat.split(":")[1]))
X.append(feats)
X = np.array(X)
n_classes = max(max(label) for label in Y) + 1
return X, Y, n_samples, n_features, n_classes


### Training and Evaluation of Structured Machines with/without Threshold¶

In [12]:
def test_multilabel_data(train_file,
test_file):
X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)

X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)

# create features and labels
multilabel_feats_0 = create_features(X_train, 0)
multilabel_feats_1 = create_features(X_train, 1)
multilabel_labels = create_labels(Y_train, n_classes)

# create multi-label model
multilabel_model = MultilabelModel(multilabel_feats_0, multilabel_labels)
multilabel_model_with_bias = MultilabelModel(multilabel_feats_1, multilabel_labels)

# initializing machines for SO-learning
multilabel_sgd = StochasticSOSVM(multilabel_model, multilabel_labels)
multilabel_sgd_with_bias = StochasticSOSVM(multilabel_model_with_bias, multilabel_labels)

start = time()
multilabel_sgd.train()
t1 = time() - start
multilabel_sgd_with_bias.train()
t2 = time() - start - t1

return (evaluate_machine(multilabel_sgd,
X_test, Y_test,
n_classes, False), t1,
evaluate_machine(multilabel_sgd_with_bias,
X_test, Y_test,
n_classes, True), t2)



### Comparision with scikit-learn's implementation¶

In [13]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import jaccard_similarity_score
from sklearn.preprocessing import LabelBinarizer

def sklearn_implementation(train_file,
test_file):
label_binarizer = LabelBinarizer()

X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)
X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)

clf = OneVsRestClassifier(SVC(kernel='linear'))
start = time()
clf.fit(X_train, label_binarizer.fit_transform(Y_train))
t1 = time() - start
return (jaccard_similarity_score(label_binarizer.fit_transform(Y_test),
clf.predict(X_test)), t1)

In [14]:
def print_table(train_file,
test_file,
caption):
acc_0, t1, acc_1, t2 = test_multilabel_data(train_file,
test_file)
sk_acc, sk_t1 = sklearn_implementation(train_file,
test_file)
result = '''
\t\t%s
Machine\t\t\t\tAccuracy\tTrain-time\n
SGD *without* threshold tuning \t%f \t%f
SGD *with* threshold tuning \t%f \t%f
scikit-learn's implementation \t%f \t%f
''' % (caption, acc_0, t1, acc_1, t2,
sk_acc, sk_t1)
print(result)


### Yeast Multi-Label Data [2]¶

In [15]:
print_table(os.path.join(SHOGUN_DATA_DIR, "multilabel/yeast_train.svm"),
os.path.join(SHOGUN_DATA_DIR, "multilabel/yeast_test.svm"),
"Yeast dataset")

/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)

            		Yeast dataset
Machine				Accuracy	Train-time

SGD *without* threshold tuning 	0.339940 	2.076551
SGD *with* threshold tuning 	0.491876 	2.010138
scikit-learn's implementation 	0.497962 	2.348428



### Scene Multi-Label Data [2]¶

In [16]:
print_table(os.path.join(SHOGUN_DATA_DIR, "multilabel/scene_train"),
os.path.join(SHOGUN_DATA_DIR, "multilabel/scene_test"),
"Scene dataset")

/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
DeprecationWarning)

            		Scene dataset
Machine				Accuracy	Train-time

SGD *without* threshold tuning 	0.548774 	1.711514
SGD *with* threshold tuning 	0.579571 	1.722789
scikit-learn's implementation 	0.576226 	1.029319



As we can see that the accuracy of the machine with threshold tuning is comparable to that of scikit-learn's implementation. A possible explanation of that is : for multi-label classification using scikit-learn, we have used OneVsRestClassifier strategy. This strategy fits one classifier per class. It also support multi-label classification. It is initiated using an estimator, for eg. in our case:


clf = OneVsRestClassifier(SVC(kernel='linear'))

the estimator is SVC(kernel="linear") a support vector machine for classification using linear kernel. So, the OneVsRestClassifier would train a number of estimator (one for each class). The SVC estimator learns the weight ($w$) as well as the thresholds/bias($b$).

In the shogun implementation, the structured machines only learn the weights($w$) and there is no threshold or bias. So, to model the threshold to we have to add an constant entry to the joint feature vector.

Thus the machines with constant entry have the same accuracy as that of scikit-learn implementation.