K means is one of the simplest algorithms for unsupervised learning. It works iteratively to cluster together similar values in a dataset of one to many dimensions ($X\in\mathbb{R}^{m \times n}$). The algorithmic steps are simple, it relies only on arithmetic means, so it’s pretty simple to understand, but can also be quite powerful. Because it relies on random sampling to initiate the algorithm it can be quite slow however, as there is a need to complete many replications to get a robust result.
K-means features as an exercise in Coursera’s machine learning course, which I am working through. In this post I produce a simple implementation of the K means algorithm in R, and because I’m trying to improve my package writing, I wrap it up into a simple package called knn.
The problem
K-means can be divided into two steps. In a two-dimensional example, given a dataset $X\in\mathbb{R}^{m \times 2}$, the first step is to calculate the closest centroid for each training example $x$. In my implementation, centroids $\{1,…,K\}$ can be assigned manually, or are automatically generated by selecting $K$ random training examples, and setting these as the centroids.
Mathematically, for each training example $x^{i}$ we are minimising
i.e. calculating the Euclidean norm of the vector $x^{(i)}$.
This is very simply achieved through pythagorean trigonometry, and is already implemented in R with norm
.
Generate dummy data
The first step is to generate dummy two dimensional data which will demonstrate the principles well. It’s expedient to functionalise this.
Which gives us a small dataset with some clear clusters to work with.
Calculating norms
Calculating norms can be done with base R using norm
.
But, the norm
function isn’t too useful in the present case, as we really want to vectorise this process, so I use my own function rowNorms
:
rowNorms
works across a matrix, and will return a vector of norms, whilst norm
will return the norm of the matrix.
This function is implemented in the knn
package.
Calculate distances
So now I can produce a vector of euclidean norms, I can put this into practice for calculating the distance between points $x^i$ and the initial centroids.
In the first instance I will specify some centroids, although as noted, more usual practice is to randomly initiate these from $X$.
So to calculate the distance between some pre-specified centroids and $X$:
Rather than repeating this for all three initial centroids, I’ve wrapped this up into a function.
Here knn::find_group()
calculates the closest centroid and returns this as a vector of values ${1,..,K}$, where $K$ is the number of centroids.
Calculate centroids
The next step of the process is to now calculate the new centroid from the groups that were established in the previous step.
Again, this is qute a simple thing to code.
I achieved it by sequentially subsetting $X$ by $k$, and outputting the new centroids in a matrix $Y\in\mathbb{R}^{K \times n}$.
I’ve wrapped this step into a function called centroid_mean()
.
With the centroids calculated, now is a good chance to plot again, and check the sanity of the new centroids.
I’ve written a function plot_knn()
to do this (it would make sense to roll this up into a plot method one day…).
Start with the original centroids…
And now with the new centroids…
Not bad, so it looks things are moving in the right direction, and with one further iteration, it looks like we are pretty close to the original centroids.
I could keep on going manually, but I have already implemented the whole process in the function knn()
. So I’ll try that instead, this time randomly initiating the centroids.
Great, so what are the final centroids?
…which as you’ll remember were the starting values about which I generated the normal distributions. Of course, it wouldn’t normally be possible to assess the success of the algorithm so easily, and note that the order of the centroids has changed, although this doesn’t matter.
A harder test
I’ll try again with a more difficult example. Here I am still relying on three normally distributed clusters about the same centroids, but I specify a much greater standard deviation, so it is much more difficult with the human eye alone to separate the clusters. I’ll specify $x = 100$ so that the algorithm has more data to work with.
Hard to see three distinct clusters…now what the algorithm makes of it.
Again, I use the knn()
function which takes the leg work out of the programming.
In the example below I also set a seed which will make the results reproducible. As I noted in the pre-amble, the solutions from K-means are dependent on the initial centroids, so it is usual practice to repeat the process many times, and then choose the group that has the highest probability.
I’ve not yet implemented this repetition in knn
yet (and probably won’t go that far), so for now I will use set.seed()
to demonstrate the ideal and not-so-ideal outcomes.
So we’re a little bit out this time after 3 steps, but not too far. This could probably be improved by setting a more precise endpoint, which at present has a very simplistic implementation.
So what happens if I change the seed?
So the solution given by the algorithm has changed, and this highlights one of the issues with K-means: it can be very dependent on the initial location of centroids.
This is why it is usual to repeat the process many times, then essentially average the results.
I haven’t gone that far with my simple implementation, but the in-built implementation in base R (kmeans
) does include an argument nstarts
which will give you $n$ possible scenarios.
So how well does kmeans()
perform in the current scenario? Well not vastly different it turns out.
What about the improvement with additional replications?
At two decimal places, there is only a small difference between the final centroids, and those produced by only a single repetition using kmeans()
.
Of course I should note that kmeans()
uses a more complicated (and more accurate) algorithm than the simplified implementation I have used in knn
.
A more interesting application
One of the ways that K-means can be used is for image compression. Images are just arrays of values, which code for intensity. A grayscale image requires just a single matrix, while a colour image typically requires three matrices. By clustering similar colour values in an image, we can compress an image by reducing the number of colours required to render it. Note that png images may also have a fourth channel which codes for alpha, or transparency.
I use the jpeg package to load and image, in an example inspired by both Markus Gesmann and the Coursera machine learning course, but using my own implementation of K-means.
First I load in a holiday snap in glorious full colour…
Now the same image again after clustering the colours into sixteen clusters.
So it works! And whilst I have not attempted to calculate the resulting image compression, the reduction in possible colour values should lead to a pretty considerable reduction in the file size.
I’m not going to develop this package any further, because far better implementations are already out there, but it was a good exercise to learn more about K-means, and package writing in R.