=======
K-means
=======
:math:`K`-means clustering aims to partition :math:`n` observations into :math:`k\leq n` clusters (sets :math:`\mathbf{S}`),
in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In other words, its objective is to minimize:

.. math::
   \argmin_\mathbf{S} \sum_{i=1}^{k}\sum_{\mathbf{x}\in S_k}\left \|\boldsymbol{x} - \boldsymbol{\mu}_i  \right \|^{2}

where :math:`\mathbf{μ}_i` is the mean of points in :math:`S_i`.

See Chapter 20 in :cite:`barber2012bayesian` for a detailed introduction.

-------
Example
-------
Imagine we have files with training and test data. We create CDenseFeatures (here 64 bit floats aka RealFeatures) as

.. sgexample:: kmeans.sg:create_features

In order to run :sgclass:`CKMeans`, we need to choose a distance, for example :sgclass:`CEuclideanDistance`, or other sub-classes of :sgclass:`CDistance`. The distance is initialized with the data we want to classify.

.. sgexample:: kmeans.sg:choose_distance

Once we have chosen a distance, we create an instance of the :sgclass:`CKMeans` classifier.
We explicitly set :math:`k`, the number of clusters we are expecting to have as 3 and pass it to :sgclass:`CKMeans`. In this example, we apply Lloyd's method for `k`-means clustering.

.. sgexample:: kmeans.sg:create_instance_lloyd

Then we train the model:

.. sgexample:: kmeans.sg:train_dataset

We can extract centers and radius of each cluster:

.. sgexample:: kmeans.sg:extract_centers_and_radius


:sgclass:`CKMeans` also supports mini batch :math:`k`-means clustering.
We can create an instance of :sgclass:`CKMeans` classifier with mini batch :math:`k`-means method by providing the batch size and iteration number.

.. sgexample:: kmeans.sg:create_instance_mb

Then train the model and extract the centers and radius information as mentioned above.

----------
References
----------
:wiki:`K-means_clustering`

:wiki:`Lloyd's_algorithm`

.. bibliography:: ../../references.bib
    :filter: docname in docnames