K-means¶

$$K$$-means clustering aims to partition $$n$$ observations into $$k\leq n$$ clusters (sets $$\mathbf{S}$$), in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In other words, its objective is to minimize:

$\argmin_\mathbf{S} \sum_{i=1}^{k}\sum_{\mathbf{x}\in S_k}\left \|\boldsymbol{x} - \boldsymbol{\mu}_i \right \|^{2}$

where $$\mathbf{μ}_i$$ is the mean of points in $$S_i$$.

See Chapter 20 in [Bar12] for a detailed introduction.

Example¶

Imagine we have files with training and test data. We create CDenseFeatures (here 64 bit floats aka RealFeatures) as

features_train = RealFeatures(f_feats_train)

features_train = RealFeatures(f_feats_train);

RealFeatures features_train = new RealFeatures(f_feats_train);

features_train = Shogun::RealFeatures.new f_feats_train

features_train <- RealFeatures(f_feats_train)

features_train = shogun.RealFeatures(f_feats_train)

RealFeatures features_train = new RealFeatures(f_feats_train);

auto features_train = some<CDenseFeatures<float64_t>>(f_feats_train);


In order to run CKMeans, we need to choose a distance, for example CEuclideanDistance, or other sub-classes of CDistance. The distance is initialized with the data we want to classify.

distance = EuclideanDistance(features_train, features_train)

distance = EuclideanDistance(features_train, features_train);

EuclideanDistance distance = new EuclideanDistance(features_train, features_train);

distance = Shogun::EuclideanDistance.new features_train, features_train

distance <- EuclideanDistance(features_train, features_train)

distance = shogun.EuclideanDistance(features_train, features_train)

EuclideanDistance distance = new EuclideanDistance(features_train, features_train);

auto distance = some<CEuclideanDistance>(features_train, features_train);


Once we have chosen a distance, we create an instance of the CKMeans classifier. We explicitly set $$k$$, the number of clusters we are expecting to have as 3 and pass it to CKMeans. In this example, we apply Lloyd’s method for k-means clustering.

kmeans = KMeans(2, distance)

kmeans = KMeans(2, distance);

KMeans kmeans = new KMeans(2, distance);

kmeans = Shogun::KMeans.new 2, distance

kmeans <- KMeans(2, distance)

kmeans = shogun.KMeans(2, distance)

KMeans kmeans = new KMeans(2, distance);

auto kmeans = some<CKMeans>(2, distance);


Then we train the model:

kmeans.train()

kmeans.train();

kmeans.train();

kmeans.train

kmeans$train()  kmeans:train()  kmeans.train();  kmeans->train();  We can extract centers and radius of each cluster: c = kmeans.get_cluster_centers() r = kmeans.get_radiuses()  c = kmeans.get_cluster_centers(); r = kmeans.get_radiuses();  DoubleMatrix c = kmeans.get_cluster_centers(); DoubleMatrix r = kmeans.get_radiuses();  c = kmeans.get_cluster_centers r = kmeans.get_radiuses  c <- kmeans$get_cluster_centers()
r <- kmeans$get_radiuses()  c = kmeans:get_cluster_centers() r = kmeans:get_radiuses()  double[,] c = kmeans.get_cluster_centers(); double[] r = kmeans.get_radiuses();  auto c = kmeans->get_cluster_centers(); auto r = kmeans->get_radiuses();  CKMeans also supports mini batch $$k$$-means clustering. We can create an instance of CKMeans classifier with mini batch $$k$$-means method by providing the batch size and iteration number. kmeans_mb = KMeansMiniBatch(2, distance) kmeans_mb.set_mb_params(4, 1000)  kmeans_mb = KMeansMiniBatch(2, distance); kmeans_mb.set_mb_params(4, 1000);  KMeansMiniBatch kmeans_mb = new KMeansMiniBatch(2, distance); kmeans_mb.set_mb_params(4, 1000);  kmeans_mb = Shogun::KMeansMiniBatch.new 2, distance kmeans_mb.set_mb_params 4, 1000  kmeans_mb <- KMeansMiniBatch(2, distance) kmeans_mb$set_mb_params(4, 1000)

kmeans_mb = shogun.KMeansMiniBatch(2, distance)
kmeans_mb:set_mb_params(4, 1000)

KMeansMiniBatch kmeans_mb = new KMeansMiniBatch(2, distance);
kmeans_mb.set_mb_params(4, 1000);

auto kmeans_mb = some<CKMeansMiniBatch>(2, distance);
kmeans_mb->set_mb_params(4, 1000);


Then train the model and extract the centers and radius information as mentioned above.

References¶

Wikipedia: K-means_clustering

Wikipedia: Lloyd’s_algorithm

Bar12

D. Barber. Bayesian reasoning and machine learning. Cambridge University Press, 2012.