\(K\)-means clustering aims to partition \(n\) observations into \(k\leq n\) clusters (sets \(\mathbf{S}\)), in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In other words, its objective is to minimize:

\[\argmin_\mathbf{S} \sum_{i=1}^{k}\sum_{\mathbf{x}\in S_k}\left \|\boldsymbol{x} - \boldsymbol{\mu}_i \right \|^{2}\]

where \(\mathbf{μ}_i\) is the mean of points in \(S_i\).

See Chapter 20 in [Bar12] for a detailed introduction.


Imagine we have files with training and test data. We create CDenseFeatures (here 64 bit floats aka RealFeatures) as

features_train = RealFeatures(f_feats_train)
features_train = RealFeatures(f_feats_train);
RealFeatures features_train = new RealFeatures(f_feats_train);
features_train = Shogun::RealFeatures.new f_feats_train
features_train <- RealFeatures(f_feats_train)
features_train = shogun.RealFeatures(f_feats_train)
RealFeatures features_train = new RealFeatures(f_feats_train);
auto features_train = some<CDenseFeatures<float64_t>>(f_feats_train);

In order to run CKMeans, we need to choose a distance, for example CEuclideanDistance, or other sub-classes of CDistance. The distance is initialized with the data we want to classify.

distance = EuclideanDistance(features_train, features_train)
distance = EuclideanDistance(features_train, features_train);
EuclideanDistance distance = new EuclideanDistance(features_train, features_train);
distance = Shogun::EuclideanDistance.new features_train, features_train
distance <- EuclideanDistance(features_train, features_train)
distance = shogun.EuclideanDistance(features_train, features_train)
EuclideanDistance distance = new EuclideanDistance(features_train, features_train);
auto distance = some<CEuclideanDistance>(features_train, features_train);

Once we have chosen a distance, we create an instance of the CKMeans classifier. We explicitly set \(k\), the number of clusters we are expecting to have as 3 and pass it to CKMeans. In this example, we apply Lloyd’s method for k-means clustering.

kmeans = KMeans(2, distance)
kmeans = KMeans(2, distance);
KMeans kmeans = new KMeans(2, distance);
kmeans = Shogun::KMeans.new 2, distance
kmeans <- KMeans(2, distance)
kmeans = shogun.KMeans(2, distance)
KMeans kmeans = new KMeans(2, distance);
auto kmeans = some<CKMeans>(2, distance);

Then we train the model:


We can extract centers and radius of each cluster:

c = kmeans.get_cluster_centers()
r = kmeans.get_radiuses()
c = kmeans.get_cluster_centers();
r = kmeans.get_radiuses();
DoubleMatrix c = kmeans.get_cluster_centers();
DoubleMatrix r = kmeans.get_radiuses();
c = kmeans.get_cluster_centers 
r = kmeans.get_radiuses 
c <- kmeans$get_cluster_centers()
r <- kmeans$get_radiuses()
c = kmeans:get_cluster_centers()
r = kmeans:get_radiuses()
double[,] c = kmeans.get_cluster_centers();
double[] r = kmeans.get_radiuses();
auto c = kmeans->get_cluster_centers();
auto r = kmeans->get_radiuses();

CKMeans also supports mini batch \(k\)-means clustering. We can create an instance of CKMeans classifier with mini batch \(k\)-means method by providing the batch size and iteration number.

kmeans_mb = KMeansMiniBatch(2, distance)
kmeans_mb.set_mb_params(4, 1000)
kmeans_mb = KMeansMiniBatch(2, distance);
kmeans_mb.set_mb_params(4, 1000);
KMeansMiniBatch kmeans_mb = new KMeansMiniBatch(2, distance);
kmeans_mb.set_mb_params(4, 1000);
kmeans_mb = Shogun::KMeansMiniBatch.new 2, distance
kmeans_mb.set_mb_params 4, 1000
kmeans_mb <- KMeansMiniBatch(2, distance)
kmeans_mb$set_mb_params(4, 1000)
kmeans_mb = shogun.KMeansMiniBatch(2, distance)
kmeans_mb:set_mb_params(4, 1000)
KMeansMiniBatch kmeans_mb = new KMeansMiniBatch(2, distance);
kmeans_mb.set_mb_params(4, 1000);
auto kmeans_mb = some<CKMeansMiniBatch>(2, distance);
kmeans_mb->set_mb_params(4, 1000);

Then train the model and extract the centers and radius information as mentioned above.


Wikipedia: K-means_clustering

Wikipedia: Lloyd’s_algorithm


D. Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012.