*distance metric*) as input and returns mean of each of the k clusters along with labels indicating belongingness of each observations. Let us construct a simple example to understand how it is done in Shogun using the CKMeans class.

Let us start by creating a toy dataset.

In [1]:

```
from numpy import concatenate, array
from numpy.random import randn
num = 200
d1 = concatenate((randn(1,num),10.*randn(1,num)),0)
d2 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[0.]])
d3 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[0.],[100.]])
d4 = concatenate((randn(1,num),10.*randn(1,num)),0)+array([[10.],[100.]])
rectangle = concatenate((d1,d2,d3,d4),1)
totalPoints = 800
```

In [2]:

```
import matplotlib.pyplot as pyplot
%matplotlib inline
figure,axis = pyplot.subplots(1,1)
axis.plot(rectangle[0], rectangle[1], 'o', color='r', markersize=5)
axis.set_xlim(-5,15)
axis.set_ylim(-50,150)
axis.set_title('Toy data : Rectangle')
pyplot.show()
```

In [3]:

```
from modshogun import *
train_features = RealFeatures(rectangle)
```

In [4]:

```
# number of clusters
k = 2
# distance metric over feature matrix - Euclidean distance
distance = EuclideanDistance(train_features, train_features)
```

Next, we create a KMeans object with our desired inputs/parameters and train:

In [5]:

```
# KMeans object created
kmeans = KMeans(k, distance)
# KMeans training
kmeans.train()
```

Out[5]:

Now that training has been done, let's get the cluster centers and label for each data point

In [6]:

```
# cluster centers
centers = kmeans.get_cluster_centers()
# Labels for data points
result = kmeans.apply()
```

Finally let us plot the centers and the data points (in different colours for different clusters):

In [7]:

```
def plotResult(title = 'KMeans Plot'):
figure,axis = pyplot.subplots(1,1)
for i in xrange(totalPoints):
if result[i]==0.0:
axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='g', markersize=3)
else:
axis.plot(rectangle[0,i], rectangle[1,i], 'o', color='y', markersize=3)
axis.plot(centers[0,0], centers[1,0], 'ko', color='g', markersize=10)
axis.plot(centers[0,1], centers[1,1], 'ko', color='y', markersize=10)
axis.set_xlim(-5,15)
axis.set_ylim(-50,150)
axis.set_title(title)
pyplot.show()
plotResult('KMeans Results')
```

**Note:** You might not get the perfect result always. That is an inherent flaw of KMeans algorithm. In subsequent sections, we will discuss techniques which allow us to counter this.

Now that we have already worked out a simple KMeans implementation, it's time to understand certain specifics of KMeans implementaion and the options provided by Shogun to its users.

The KMeans algorithm requires that the cluster centers are initialized with some values. Shogun offers 3 ways to initialize the clusters.

- Random initialization (default)
- Initialization by hand
- Initialization using KMeans++ algorithm

In [8]:

```
from numpy import array
initial_centers = array([[0.,10.],[50.,50.]])
# initial centers passed
kmeans = KMeans(k, distance, initial_centers)
```

Now, let's first get results by repeating the rest of the steps:

In [9]:

```
# KMeans training
kmeans.train(train_features)
# cluster centers
centers = kmeans.get_cluster_centers()
# Labels for data points
result = kmeans.apply()
# plot the results
plotResult('Hand initialized KMeans Results 1')
```

The other way to initialize centers by hand is as follows:

In [10]:

```
new_initial_centers = array([[5.,5.],[0.,100.]])
# set new initial centers
kmeans.set_initial_centers(new_initial_centers)
```

Let's complete the rest of the code to get results.

In [11]:

```
# KMeans training
kmeans.train(train_features)
# cluster centers
centers = kmeans.get_cluster_centers()
# Labels for data points
result = kmeans.apply()
# plot the results
plotResult('Hand initialized KMeans Results 2')
```

Note the difference that inititial cluster centers can have on final result.

*true* during KMeans object creation, as follows:

In [12]:

```
# set flag for using KMeans++
kmeans = KMeans(k, distance, True)
```

The other way to initilize using KMeans++ is as follows:

In [13]:

```
# set KMeans++ flag
kmeans.set_use_kmeanspp(True)
```

Completing rest of the steps to get result:

In [14]:

```
# KMeans training
kmeans.train(train_features)
# cluster centers
centers = kmeans.get_cluster_centers()
# Labels for data points
result = kmeans.apply()
# plot the results
plotResult('KMeans with KMeans++ Results')
```

To switch back to random initialization, you may use:

In [15]:

```
#unset KMeans++ flag
kmeans.set_use_kmeanspp(False)
```

Shogun offers 2 training methods for KMeans clustering:

Lloyd's training method is used by Shogun by default unless user switches to mini-batch training method.In [16]:

```
# set training method to mini-batch
kmeans = KMeans(k, distance, KMM_MINI_BATCH)
```

One can switch to Mini-batch KMeans also by making use of the following method:

In [17]:

```
# set training method to mini-batch
kmeans.set_train_method(KMM_MINI_BATCH)
```

In [18]:

```
# set both parameters together batch size-2 and no. of iterations-100
kmeans.set_mbKMeans_params(2,100)
# OR
# set batch size-2
kmeans.set_mbKMeans_batch_size(2)
# set no. of iterations-100
kmeans.set_mbKMeans_iter(100)
```

Completing the code to get results:

In [19]:

```
# KMeans training
kmeans.train(train_features)
# cluster centers
centers = kmeans.get_cluster_centers()
# Labels for data points
result = kmeans.apply()
# plot the results
plotResult('Mini-batch KMeans Results')
```

One can switch back to Lloyd's KMeans in the following way:

In [20]:

```
# set training method to mini-batch
kmeans.set_train_method(KMM_LLOYD)
```

In this section we see how useful KMeans can be in classifying the different varieties of Iris plant. For this purpose, we make use of Fisher's Iris dataset borrowed from the UCI Machine Learning Repository. There are 3 varieties of Iris plants

- Iris Sensosa
- Iris Versicolour
- Iris Virginica

- sepal length
- sepal width
- petal length
- petal width

In [21]:

```
f = open('../../../data/uci/iris/iris.data')
features = []
# read data from file
for line in f:
words = line.rstrip().split(',')
features.append([float(i) for i in words[0:4]])
f.close()
# create observation matrix
obsmatrix = array(features).T
# plot the data
figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(obsmatrix[2,0:50], obsmatrix[3,0:50], 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(obsmatrix[2,50:100], obsmatrix[3,50:100], 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(obsmatrix[2,100:150], obsmatrix[3,100:150], 'o', color='blue', markersize=5)
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('3 varieties of Iris plants')
pyplot.show()
```

In [22]:

```
def apply_kmeans_iris(data):
# wrap to Shogun features
train_features = RealFeatures(data)
# number of cluster centers = 3
k = 3
# distance function features - euclidean
distance = EuclideanDistance(train_features, train_features)
# initialize KMeans object
kmeans = KMeans(k, distance)
# use kmeans++ to initialize centers [play around: change it to False and compare results]
kmeans.set_use_kmeanspp(True)
# training method is Lloyd by default [play around: change it to mini-batch by uncommenting the following lines]
#kmeans.set_train_method(KMM_MINI_BATCH)
#kmeans.set_mbKMeans_params(20,30)
# training kmeans
kmeans.train(train_features)
# labels for data points
result = kmeans.apply()
return result
result = apply_kmeans_iris(obsmatrix)
```

In [23]:

```
# plot the clusters over the original points in 2 dimensions
figure,axis = pyplot.subplots(1,1)
for i in xrange(150):
if result[i]==0.0:
axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='r', markersize=5)
elif result[i]==1.0:
axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='g', markersize=5)
else:
axis.plot(obsmatrix[2,i],obsmatrix[3,i],'ko',color='b', markersize=5)
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('Iris plants clustered based on attributes')
pyplot.show()
```

In [24]:

```
from numpy import ones, zeros
# first 50 are iris sensosa labelled 0, next 50 are iris versicolour labelled 1 and so on
labels = concatenate((zeros(50),ones(50),2.*ones(50)),1)
# bind labels assigned to Shogun multiclass labels
ground_truth = MulticlassLabels(array(labels,dtype='float64'))
```

Now we can compute clustering accuracy making use of the ClusteringAccuracy class in Shogun

In [25]:

```
from numpy import nonzero
def analyzeResult(result):
# shogun object for clustering accuracy
AccuracyEval = ClusteringAccuracy()
# changes the labels of result (keeping clusters intact) to produce a best match with ground truth
AccuracyEval.best_map(result, ground_truth)
# evaluates clustering accuracy
accuracy = AccuracyEval.evaluate(result, ground_truth)
# find out which sample points differ from actual labels (or ground truth)
compare = result.get_labels()-labels
diff = nonzero(compare)
return (diff,accuracy)
(diff,accuracy_4d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_4d)
# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)
axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
axis.set_title('Difference')
pyplot.show()
```

*curse of dimensionality*. So, dimension reduction becomes an important preprocessing step. Shogun offers a variety of dimension reduction techniques to choose from. Since our data is not very high dimensional, PCA is a good choice for dimension reduction. We have already seen the accuracy of KMeans when all four dimensions are used. In the following exercise we shall see how the accuracy varies as one chooses lower dimensions to represent data.

Let us first apply PCA to reduce training features to 1 dimension

In [26]:

```
from numpy import dot
def apply_pca_to_data(target_dims):
train_features = RealFeatures(obsmatrix)
submean = PruneVarSubMean(False)
submean.init(train_features)
submean.apply_to_feature_matrix(train_features)
preprocessor = PCA()
preprocessor.set_target_dim(target_dims)
preprocessor.init(train_features)
pca_transform = preprocessor.get_transformation_matrix()
new_features = dot(pca_transform.T, train_features)
return new_features
oneD_matrix = apply_pca_to_data(1)
```

Next, let us get an idea of the data in 1-D by plotting it.

In [27]:

```
figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(oneD_matrix[0,0:50], zeros(50), 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(oneD_matrix[0,50:100], zeros(50), 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(oneD_matrix[0,100:150], zeros(50), 'o', color='blue', markersize=5)
axis.set_xlim(-5,5)
axis.set_ylim(-1,1)
axis.set_title('3 varieties of Iris plants')
pyplot.show()
```

Let us now apply KMeans to the 1-D data to get clusters.

In [28]:

```
result = apply_kmeans_iris(oneD_matrix)
```

Now that we have the results, the inevitable step is to check how good these results are.

In [29]:

```
(diff,accuracy_1d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_1d)
# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(oneD_matrix[0,:],zeros(150),'x',color='black', markersize=5)
axis.plot(oneD_matrix[0,diff],zeros(len(diff)),'x',color='r', markersize=7)
axis.set_xlim(-5,5)
axis.set_ylim(-1,1)
axis.set_title('Difference')
pyplot.show()
```

We follow the same steps as above and get the clustering accuracy.

STEP 1 : Apply PCA and plot the data (plotting is optional)

In [30]:

```
twoD_matrix = apply_pca_to_data(2)
figure,axis = pyplot.subplots(1,1)
# First 50 data belong to Iris Sentosa, plotted in green
axis.plot(twoD_matrix[0,0:50], twoD_matrix[1,0:50], 'o', color='green', markersize=5)
# Next 50 data belong to Iris Versicolour, plotted in red
axis.plot(twoD_matrix[0,50:100], twoD_matrix[1,50:100], 'o', color='red', markersize=5)
# Last 50 data belong to Iris Virginica, plotted in blue
axis.plot(twoD_matrix[0,100:150], twoD_matrix[1,100:150], 'o', color='blue', markersize=5)
axis.set_title('3 varieties of Iris plants')
pyplot.show()
```

STEP 2 : Apply KMeans to obtain clusters

In [31]:

```
result = apply_kmeans_iris(twoD_matrix)
```

STEP 3: Get the accuracy of the results

In [32]:

```
(diff,accuracy_2d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_2d)
# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(twoD_matrix[0,:],twoD_matrix[1,:],'x',color='black', markersize=5)
axis.plot(twoD_matrix[0,diff],twoD_matrix[1,diff],'x',color='r', markersize=7)
axis.set_title('Difference')
pyplot.show()
```

Again, we follow the same steps, but skip plotting data.

STEP 1: Apply PCA to data

In [33]:

```
threeD_matrix = apply_pca_to_data(3)
```

STEP 2: Apply KMeans to 3-D representation of data

In [34]:

```
result = apply_kmeans_iris(threeD_matrix)
```

In [35]:

```
(diff,accuracy_3d) = analyzeResult(result)
print 'Accuracy : ' + str(accuracy_3d)
# plot the difference between ground truth and predicted clusters
figure,axis = pyplot.subplots(1,1)
axis.plot(obsmatrix[2,:],obsmatrix[3,:],'x',color='black', markersize=5)
axis.plot(obsmatrix[2,diff],obsmatrix[3,diff],'x',color='r', markersize=7)
axis.set_title('Difference')
axis.set_xlim(-1,8)
axis.set_ylim(-1,3)
pyplot.show()
```

Finally, let us plot clustering accuracy vs. number of dimensions to consolidate our results.

In [36]:

```
from scipy.interpolate import interp1d
from numpy import linspace
x = array([1, 2, 3, 4])
y = array([accuracy_1d, accuracy_2d, accuracy_3d, accuracy_4d])
f = interp1d(x, y)
xnew = linspace(1,4,10)
pyplot.plot(x,y,'o',xnew,f(xnew),'-')
pyplot.xlim([0,5])
pyplot.xlabel('no. of dims')
pyplot.ylabel('Clustering Accuracy')
pyplot.title('PCA Results')
pyplot.show()
```

[1] D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pages 1177–1178. ACM, 2010

[2] Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York.

[3] Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science