Multi-Label Classification with Shogun Machine Learning Toolbox

Abinash Panda (github: abinashpanda)

Thanks Thoralf Klein for taking time to help me on this project! ;)

This notebook presents training of multi-label classification using structured SVM presented in shogun. We would be using MultilabelModel for multi-label classfication.

We begin with brief introduction to Multi-Label Structured Prediction [1] followed by corresponding API in Shogun. Then we are going to implement a toy example (for illustration) before getting to the real one. Finally, we evaluate the multi-label classification on well-known datasets [2]. We showed that SHOGUNs [3] implementation delivers same accuracy as scikit-learn and same or better training time.

Introduction

Multi-Label Structured Prediction

Multi-Label Structured Prediction combines the aspects of multi-label prediction and structured output. Structured prediction typically involves an input $\mathbf{x}$ (can be structured) and a structured output $\mathbf{y}$. Given a training set $\{(x^i, y^i)\}_{i=1,...,n} \subset \mathcal{X} \times \mathbb{P}(\mathcal{Y})$ where $\mathcal{Y}$ is a structured output set of potentially very large size (in this case $\mathcal{Y} = \{y_1, y_2, ...., y_q\}$ where $q$ is total number of possible classes). A joint feature map $\psi(x, y)$ is defined to incorporate structure information into the labels.

The joint feature map $\psi(x, y)$ for MultilabelModel is defined as $\psi(x, y) \rightarrow x \otimes y$ where $\otimes$ is the tensor product.

We formulate the prediction as:
$h(x) = \{y \in \mathcal{Y} : f(x, y) > 0\}$

The compatibility function, $f(x, y)$, acts on individual inputs and outputs, as in single-label prediction, but the prediction step consists of collecting all outputs of positive scores instead of finding the outputs of maximal score.

Multi-Label Models

In this notebook, we are going to compare the performance of two multi-label models:

  • MultilabelModel model : with constant entry term $0$ in joint feature vector to not model bias term.
  • MultilabelModel model_with_bias : with constant entry $1$ in the joint feature vector to model bias term.

The joint feature vector are:

  • model$\leftrightarrow \psi(x, y) = [x || 0] \otimes y$.
  • model_with_bias$\leftrightarrow \psi(x, y) = [x || 1] \otimes y$.

For comparision of the two models, we are going to perform on the datasets with binary labels.

Experiment 1 : Binary Label Data

Generation of some synthetic data

First of all we create some synthetic data for our toy example. We add some static offset to the data to compare the models with/without threshold.

In []:
from __future__ import print_function

try:
    from sklearn.datasets import make_classification
except ImportError:
    import pip
    pip.main(['install', '--user', 'scikit-learn'])
    from sklearn.datasets import make_classification
    
import numpy as np

X, Y = make_classification(n_samples=1000,
                           n_features=2,
                           n_informative=2,
                           n_redundant=0,
                           n_clusters_per_class=2)

# adding some static offset to the data
X = X + 1

Preparation of data and model

To create a multi-label model in shogun, we'll first create an instance of MultilabelModel and initialize it by the features and labels. The labels should be MultilabelSOLables. It should be initialized by providing with the n_labels (number of examples) and n_classes (total number of classes) and then individually adding a label using set_sparse_label() method.

In []:
from modshogun import RealFeatures, MultilabelSOLabels, MultilabelModel

def create_features(X, constant):
    features = RealFeatures(
                np.c_[X, constant * np.ones(X.shape[0])].T)
    
    return features
from modshogun import MultilabelSOLabels

def create_labels(Y, n_classes):
    try:
        n_samples = Y.shape[0]
    except AttributeError:
        n_samples = len(Y)
        
    labels = MultilabelSOLabels(n_samples, n_classes)
    for i, sparse_label in enumerate(Y):
        try:
            sparse_label = sorted(sparse_label)
        except TypeError:
            sparse_label = [sparse_label]
        labels.set_sparse_label(i, np.array(sparse_label, dtype=np.int32))
    
    return labels

def split_data(X, Y, ratio):
    num_samples = X.shape[0]
    train_samples = int(ratio * num_samples)
    return (X[:train_samples], Y[:train_samples],
            X[train_samples:], Y[train_samples:])
In []:
X_train, Y_train, X_test, Y_test = split_data(X, Y, 0.9)

feats_0 = create_features(X_train, 0)
feats_1 = create_features(X_train, 1)
labels = create_labels(Y_train, 2)

model = MultilabelModel(feats_0, labels)
model_with_bias = MultilabelModel(feats_1, labels)

Training and Evaluation of Structured Machines with/without Threshold

In Shogun, several solvers and online solvers have been implemented for SO-Learning. Let's try to train the model using an online solver StochasticSOSVM.

In []:
from modshogun import StochasticSOSVM, DualLibQPBMSOSVM, StructuredAccuracy, LabelsFactory
from time import time

sgd = StochasticSOSVM(model, labels)
sgd_with_bias = StochasticSOSVM(model_with_bias, labels)

start = time()
sgd.train()
print(">>> Time taken for SGD *without* threshold tuning = %f" % (time() - start))
start = time()
sgd_with_bias.train()
print(">>> Time taken for SGD *with* threshold tuning    = %f" % (time() - start))
>>> Time taken for SGD *without* threshold tuning = 0.427978
>>> Time taken for SGD *with* threshold tuning    = 0.330294

Accuracy

For measuring accuracy in multi-label classification, Jaccard Similarity Coefficients $\big(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\big)$ is used :
$Accuracy = \frac{1}{p}\sum_{i=1}^{p}\frac{ |Y_i \cap h(x_i)|}{|Y_i \cup h(x_i)|}$
This is available in MultilabelAccuracy for MultilabelLabels and StructuredAccuracy for MultilabelSOLabels.

In []:
def evaluate_machine(machine,
                     X_test,
                     Y_test,
                     n_classes,
                     bias):
    if bias:
        feats_test = create_features(X_test, 1)
    else:
        feats_test = create_features(X_test, 0)
    
    test_labels = create_labels(Y_test, n_classes)
    
    out_labels = LabelsFactory.to_structured(machine.apply(feats_test))
    evaluator = StructuredAccuracy()
    jaccard_similarity_score = evaluator.evaluate(out_labels, test_labels)
    
    return jaccard_similarity_score 
In []:
print(">>> Accuracy of SGD *without* threshold tuning   = %f " % evaluate_machine(sgd, X_test, Y_test, 2, False))
print(">>> Accuracy of SGD *with* threshold tuning      = %f " %evaluate_machine(sgd_with_bias, X_test, Y_test, 2, True))
>>> Accuracy of SGD *without* threshold tuning   = 0.830000 
>>> Accuracy of SGD *with* threshold tuning      = 0.920000 

Plotting the Data along with the Boundary

In []:
import matplotlib.pyplot as plt
%matplotlib inline

def get_parameters(weights):
    return -weights[0]/weights[1], -weights[2]/weights[1]

def scatter_plot(X, y):
    zeros_class = np.where(y == 0)
    ones_class = np.where(y == 1)
    plt.scatter(X[zeros_class, 0], X[zeros_class, 1], c='b', label="Negative Class")
    plt.scatter(X[ones_class, 0], X[ones_class, 1], c='r', label="Positive Class")
    
def plot_hyperplane(machine_0,
                    machine_1,
                    label_0,
                    label_1,
                    title,
                    X, y):
    scatter_plot(X, y)
    x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
    y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
    xx = np.linspace(x_min, x_max, 1000)
    
    m_0, c_0 = get_parameters(machine_0.get_w()) 
    m_1, c_1 = get_parameters(machine_1.get_w())
    yy_0 = m_0 * xx + c_0
    yy_1 = m_1 * xx + c_1
    plt.plot(xx, yy_0, "k--", label=label_0)
    plt.plot(xx, yy_1, "g-", label=label_1)
    
    plt.xlim((x_min, x_max))
    plt.ylim((y_min, y_max))
    plt.grid()
    plt.legend(loc="best")
    plt.title(title)
    plt.show()
In []:
fig = plt.figure(figsize=(10, 10))
plot_hyperplane(sgd, sgd_with_bias,
                "Boundary for machine *without* bias for class 0",
                "Boundary for machine *with* bias for class 0",
                "Binary Classification using SO-SVM with/without threshold tuning",
                X, Y)

As we can see from the above plot that sgd_with_bias can produce better classification boundary. The model without threshold tuning is crossing origin of space, while the one with threshold tuning is crossing $(1,1)$ (the constant we have added earlier).

In []:
from modshogun import SparseMultilabel_obtain_from_generic

def plot_decision_plane(machine,
                        title,
                        X, y, bias):
    plt.figure(figsize=(24, 8))
    plt.suptitle(title)
    plt.subplot(1, 2, 1)
    x_min, x_max = np.min(X[:, 0]) - 0.5, np.max(X[:, 0]) + 0.5
    y_min, y_max = np.min(X[:, 1]) - 0.5, np.max(X[:, 1]) + 0.5
    xx = np.linspace(x_min, x_max, 200)
    yy = np.linspace(y_min, y_max, 200)
    x_mesh, y_mesh = np.meshgrid(xx, yy)

    if bias:
        feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 1)
    else:
        feats = create_features(np.c_[x_mesh.ravel(), y_mesh.ravel()], 0)
    out_labels = machine.apply(feats)
    z = []
    for i in range(out_labels.get_num_labels()):
        label = SparseMultilabel_obtain_from_generic(out_labels.get_label(i)).get_data()
        if label.shape[0] == 1:
            # predicted a single label
            z.append(label[0])
        elif label.shape[0] == 2:
            # predicted both the classes
            z.append(2)
        elif label.shape[0] == 0:
            # predicted none of the class
            z.append(3)
    z = np.array(z)
    z = z.reshape(x_mesh.shape)
    c = plt.pcolor(x_mesh, y_mesh, z, cmap=plt.cm.gist_heat)
    scatter_plot(X, y)
    plt.xlim((x_min, x_max))
    plt.ylim((y_min, y_max))
    plt.colorbar(c)
    plt.title("Decision Surface")
    plt.legend(loc="best")

    plt.subplot(1, 2, 2)
    weights = machine.get_w()
    m_0, c_0 = get_parameters(weights[:3])
    m_1, c_1 = get_parameters(weights[3:])
    yy_0 = m_0 * xx + c_0
    yy_1 = m_1 * xx + c_1
    plt.plot(xx, yy_0, "r--", label="Boundary for class 0")
    plt.plot(xx, yy_1, "g-", label="Boundary for class 1")
    plt.title("Hyper planes for different classes")
    plt.legend(loc="best")
    plt.xlim((x_min, x_max))
    plt.ylim((y_min, y_max))
    
    plt.show()
In []:
plot_decision_plane(sgd,"Model *without* Threshold Tuning", X, Y, False)
plot_decision_plane(sgd_with_bias,"Model *with* Threshold Tuning", X, Y, True)

As we can see from the above plots of decision surface, the black region corresponds to the region of negative (label = $0$) class, where as the red region corresponds to the positive (label = $1$). But along with that there are some regions (although very small) of white surface and orange surface. The white surface corresponds to the region not classified to any label, whereas the orange region correspond to the region classified to both the labels. The reason for existence of these type of surface is that the above boundaries for both the class don't overlap exactly with each other (illustrated above). So, there are some regions for which both the compatibility function $f(x, 0) > 0$ as well as $f(x, 1) > 0$ (predicted both the labels) and there are some regions where both the compatibility function $f(x, 0) < 0$ and $f(x, 1) < 0$ (predicted none of the labels).

Experiment 2 : Multi-Label Data

Loading of data from LibSVM File

In []:
def load_data(file_name):
    input_file = open(file_name)
    lines = input_file.readlines()
    n_samples = len(lines)
    n_features = len(lines[0].split()) - 1
    Y = []
    X = []
    for line in lines:
        data = line.split()
        Y.append(map(int, data[0].split(",")))
        feats = []
        for feat in data[1:]:
            feats.append(float(feat.split(":")[1]))
        X.append(feats)
    X = np.array(X)
    n_classes = max(max(label) for label in Y) + 1
    return X, Y, n_samples, n_features, n_classes

Training and Evaluation of Structured Machines with/without Threshold

In []:
def test_multilabel_data(train_file,
                         test_file):
    X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)

    X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)

    # create features and labels
    multilabel_feats_0 = create_features(X_train, 0)
    multilabel_feats_1 = create_features(X_train, 1)
    multilabel_labels = create_labels(Y_train, n_classes)

    # create multi-label model
    multilabel_model = MultilabelModel(multilabel_feats_0, multilabel_labels)
    multilabel_model_with_bias = MultilabelModel(multilabel_feats_1, multilabel_labels)
    
    # initializing machines for SO-learning
    multilabel_sgd = StochasticSOSVM(multilabel_model, multilabel_labels)
    multilabel_sgd_with_bias = StochasticSOSVM(multilabel_model_with_bias, multilabel_labels)
    
    start = time()
    multilabel_sgd.train()
    t1 = time() - start
    multilabel_sgd_with_bias.train()
    t2 = time() - start - t1
    
    return (evaluate_machine(multilabel_sgd,
                             X_test, Y_test,
                             n_classes, False), t1,
            evaluate_machine(multilabel_sgd_with_bias,
                             X_test, Y_test,
                             n_classes, True), t2)
            

Comparision with scikit-learn's implementation

In []:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import jaccard_similarity_score
from sklearn.preprocessing import LabelBinarizer

def sklearn_implementation(train_file,
                           test_file):
    label_binarizer = LabelBinarizer()

    X_train, Y_train, n_samples, n_features, n_classes = load_data(train_file)
    X_test, Y_test, n_samples, n_features, n_classes = load_data(test_file)

    clf = OneVsRestClassifier(SVC(kernel='linear'))
    start = time()
    clf.fit(X_train, label_binarizer.fit_transform(Y_train))
    t1 = time() - start
    return (jaccard_similarity_score(label_binarizer.fit_transform(Y_test),
                                     clf.predict(X_test)), t1)
In []:
def print_table(train_file,
                test_file,
                caption):
    acc_0, t1, acc_1, t2 = test_multilabel_data(train_file,
                                                test_file)
    sk_acc, sk_t1 = sklearn_implementation(train_file,
                                           test_file)
    result = '''
            \t\t%s
            Machine\t\t\t\tAccuracy\tTrain-time\n
            SGD *without* threshold tuning \t%f \t%f
            SGD *with* threshold tuning \t%f \t%f
            scikit-learn's implementation \t%f \t%f
           ''' % (caption, acc_0, t1, acc_1, t2,
               sk_acc, sk_t1)
    print(result)
In []:
print_table("../../../data/multilabel/yeast_train.svm",
            "../../../data/multilabel/yeast_test.svm",
            "Yeast dataset")
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)


            		Yeast dataset
            Machine				Accuracy	Train-time

            SGD *without* threshold tuning 	0.339940 	1.701098
            SGD *with* threshold tuning 	0.491876 	1.701820
            scikit-learn's implementation 	0.497962 	3.643684
           

/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)

In []:
print_table("../../../data/multilabel/scene_train",
            "../../../data/multilabel/scene_test",
            "Scene dataset")
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)


            		Scene dataset
            Machine				Accuracy	Train-time

            SGD *without* threshold tuning 	0.548774 	1.533526
            SGD *with* threshold tuning 	0.579571 	1.471535
            scikit-learn's implementation 	0.576226 	1.899514
           

/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)
/home/buildslave/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py:187: DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation.
  DeprecationWarning)

As we can see that the accuracy of the machine with threshold tuning is comparable to that of scikit-learn's implementation. A possible explanation of that is : for multi-label classification using scikit-learn, we have used OneVsRestClassifier strategy. This strategy fits one classifier per class. It also support multi-label classification. It is initiated using an estimator, for eg. in our case:


clf = OneVsRestClassifier(SVC(kernel='linear'))
the estimator is SVC(kernel="linear") a support vector machine for classification using linear kernel. So, the OneVsRestClassifier would train a number of estimator (one for each class). The SVC estimator learns the weight ($w$) as well as the thresholds/bias($b$).

In the shogun implementation, the structured machines only learn the weights($w$) and there is no threshold or bias. So, to model the threshold to we have to add an constant entry to the joint feature vector.

Thus the machines with constant entry have the same accuracy as that of scikit-learn implementation.

References