K-means Clustering
You’ve just loaded a fresh dataset. It’s still so hot from transformation that your laptop fans haven’t stopped whirring. After performing some basic inspection, your eyes begin to glaze over and your face starts to go numb. Is that your dead grandmother, Maureen? Wait, Maureen isn’t dead, and she isn’t your grandma, she works in accounting. If you want to keep exploring the data, you’re going to need some patterns to hold on to before you pass out: market segments in consumer data, tissues in medical images, communities in social networks, topics in a text corpus. If only you could steady yourself on some underlying subset structure lurking in your data, group them into different categories according to the attribute values taken by related samples.
It’s cluster hunting season.
Clusters are the truffles in your dataset, waiting to be found and finely grated over your analysis. And like any good fungus hunter knows, the first task is getting the right hog: one with a keen sense of smell, a lust tempered by perspicacity, a snout for rooting but a maw you can muzzle. The second task is not getting shot in the face by a paranoid French farmer 1. Ok, I may have lost the thread of the analogy there, but the hogs we’re talking about are clustering methods. Very often, the first choice in cluster hogs is the granddaddy of ‘em all — k-means clustering.
Not only is k-means one of the oldest forms of cluster analysis, it is also one of the most widely used. And it’s the first approach to unsupervised learning many budding datamancers encounter. K-means’ enduring popularity both in practice and pedagogy stems from its elegance and simplicity. In my experience, this unfortunately also means that k-means is one of the most frequently misunderstood, poorly explained, and outright abused methods in machine learning. Granted, my personal experience, like anyone’s, is statistically biased — no individual is an arbitrarily random sampler — but I have seen enough incorrect implementations and misleading characterizations to warrant address here. K-means may be elegant in its simplicity, but that doesn’t excuse simplicity in the minds of those who use it.
In this article I aim to impart an appreciation for the nuances of k-means cluster analysis: what it is, how it works, and when it breaks.
One of the major sources of confusion about k-means can be traced to conflating the problem with the algorithm used to solve it.
The K-Means Problem
Let’s go back to our hypothetical dataset. Hopefully your imaginary laptop’s
fans have spun down by now. To make things more concrete, we’ll say that we
have
Now suppose there is some sort of meaningful, useful subset structure lurking in the data, just waiting to be discovered. Before we can find it, we need to decide how we will operationalize the distinguishing features of the subsets; in other words, what measurable characteristics will we use to differentiate one cluster from another?
Means to an End
From a statistical point of view, an obvious choice to characterize subpopulations is the average, or mean, of each subpopulation. Sample means are easy to compute - just add and divide.
Furthermore, depending on the true underling cluster structure and what we plan to do with it, using only the means to characterize each cluster could be sufficient in a statistical sense - the information bottleneck 2 imposed by using these single parameter cluster summaries could be immaterial to our purposes.
An elegant weapon for a more civilized age. But now we have new issues to address:
- How many means do we need to estimate?
- How are we going to estimate those means?
The ‘k’ in ‘k-means’ answers the first question. We assume we know how many clusters there are a priori. Of course, this assumption is artificial, as in practice if we knew how many clusters there were beforehand we’d already know a lot about the underlying structure of our data.
Estimating k when it is unknown is a separate issue that we’ll come back to later.
In order to deal with the second issue, we’ll use a standard property of the
mean: it is the unique minimizer of the square error to all the sample points.
Formally, for a random sample of
We want to apply this property of means not to the whole data set, but to
each of our
The main thing to notice here is that now our sum only runs over points in the
We now have an objective function for each cluster. Since we don’t know the
true mean
But this isn’t good enough - in order to estimate a cluster’s mean using the
average of all data points belonging only to that cluster, we need to first know
which data points belong to that cluster. Of course, if we knew which points
belonged to each cluster, we wouldn’t need to do any damn cluster analysis in
the first place! In order to address this dilemma, we incorporate the cluster
assignment information into our objective function: assign points to clusters
so that we minimize the total within cluster sum of square errors over all possible
cluster assignments
Sounds simple enough, right? All we have to do is try out all the different ways
to split our data into
The Size of the Problem
Well I hope you’ve got a lot of free time, because the number of possible
partitions is huge. How huge? Mathematically speaking, it makes Dwayne Johnson
look like Kevin Hart - when he was five. The Stirling numbers of the second
kind count the number of ways to partition a set
of
See those factorials and binomial coefficients? Turns out we’re dealing with
exponential growth. For
For a concrete example, if you have
different clusterings to try out. It’s estimated that there are on the order
of
cluster configurations to search through. The number of baryons in the
observable universe is estimated to be on the order to
So, brute force is out. Any algorithm of practical use is going to need to be more sophisticated than arm wrestling with time.
Is the Mean Meaningful?
Implicit in the description of the k-means problem so far is that our data can
be modeled so that each cluster has a mean, and all
We can say slightly more about the mixture model tacitly assumed by the k-means problem: for each component, computing the mean is meaningful. What I mean is (I’ll stop now, I mean it), we have assumed both the existence and identifiability of each component distribution’s first moment.
K-Means Algorithms
Since the space of possible solutions is generally exponential in the cluster number and dataset size, all commonly used algorithms applied to the k-means problem are heuristic in nature.
Lloyd’s Algorithm
The standard algorithm used to solve the k-means problem is often just called ‘the k-means algorithm.’ I find this name misleading for two reasons: it promotes conflation of the problem posed with the method of solving it, and there are actually several algorithms that can solve the problem. In order to avoid both these issues, I’m going to use the original name from the computer science and pattern recognition communities: Lloyd’s algorithm 5. Stuart Lloyd of Bell Laboratories originally proposed the procedure in 1957 for the purpose of finding the minimal distortion signal quantization in the context of pulse code modulation 6.
Unless you’re familiar with signal processing, that last sentence probably
sounds like some nonsense J.J. Abrams used as dialogue in a Star Wars movie.
Basically, Lloyd was working on transmitting digital signals over a noisy
channel. The receiver stored
LLoyd’s algorithm is an alternating, iterative procedure that shares structural similarities with expectation maximization algorithms. The algorithm performs a greedy optimization at each step in order to avoid examining every possible partition of our data set. We can thus find an answer to our k-means problem without the necessity of immortality — the trade-off is that the solution produced is only guaranteed to be ‘locally’ optimal in a sense that we’ll make concrete later.
The simplicity of Lloyd’s algorithm is in large part responsible for the ubiquity of k-means in cluster analysis. In it’s simplest form, Lloyd’s algorithm is as follows:
The pseudocode above focuses on the main idea of the algorithm, and thus certain
details covering edge cases and efficient implementation are left out. In lines
2 through 7, we select as initial centroids
The first for loop in lines 11 through 14 is called the assignment step,
and assigns each data point
The second for loop in lines 15 through 18 is called the update or estimation step, and consists of computing new estimates for each cluster’s centroid by averaging all the points assigned to that cluster in the immediately preceding assignment step.
The main loop terminates when all centroid estimates are unchanged from one iteration to the next.
Forgy Initialization
The initialization method in Lloyd’s algorithm deserves more attention. In the above procedure, the initial centroids are selected from the input data uniformly at random. This initialization procedure is often attributed to E.W. Forgy 7, and the overall iterative method is sometimes called the Lloyd- Forgy algorithm: for instance, in the kmeans function of the stats package in R. There is no theoretical justification for such an initialization, and it can in fact produce undesirable results.
Ideally, the initial centroids will each be selected from distinct clusters. Such an ideal initialization would produce a much better preliminary representation of the data’s cluster structure, and almost certainly a final clustering that accurately captures the target patterns. Of course, guaranteeing such a representative initialization would require knowing the cluster structure, and we are again left only with heuristic methods.
For a dataset with
Alternatively, we may determine the initial centroids by:
- Initially assigning each point to a cluster uniformly at random. In other
words, for
\(1 \leq i \leq n\) , select\(r_{i} \sim \scriptsize{\text{UNIFORM}}\{1, k\} \) . - Compute each initial centroid as we would during a typical estimation step,
i.e., set
\(\mu_{j}^{\scriptsize{(0)}} = \frac{1}{\vert P_{j}\vert}\scriptsize{\displaystyle\sum_{x \in P_{j}}}\hspace{0.2em} x \) .
The above initialization method is often referred to as Random Partition initialization, although there are conflicting claims as to who originated each of these initialization methods 789.
Some passing thought should make apparent that the initial centroids produced by random partition will tend to be near the collective centroid of the entire dataset. Such an initialization can lead to its own problems, and is as arbitrary a starting point as Forgy initialization.
Extensive empirical results on real and synthetic data 9 demonstrate that both Forgy and Random Partition initialization are among the worst methods to start Lloyd’s algorithm. For even reasonably small datasets, Forgy initialization tends to vary wildly from one run to the next, and, as discussed above, may under represent the cluster structure by selecting initial centroids that are close together or outlying points. Random Partition initialization varies less between runs, but this is an obvious byproduct of the tendency to position initial centroids towards the overall center of the data set. This also means successive runs of Random Partition tend to produce similar initializations. If such central initializations cause Lloyd’s algorithm to converge to a local minimum with poor within cluster variance, then repeated runs using Random Partition are unlikely to produce any noticeable improvement. The idiosyncrasies of a particular data sets may make it appear as if Random Partition tends to produce lower within cluster variance than any one Forgy initialization 7, but such a conclusion fails to account for the similar, centrally located initial centroids of each Random Partition initialization. You might just get lucky and Random Partition initialization converges to a good clustering. The same might occur with a Forgy initialization, but with more variability. Both methods are still generally bad.
Since Lloyd’s algorithm is highly sensitive to initial conditions, as discussed below, careful initialization methods can greatly improve results. More principled approaches that address, albeit heuristically, the issue of finding initial centroids more representative of the cluster structure are discussed below.
Properties
For reference, here are some important properties of Lloyd’s algorithm:
- Hard Assignment — although the k-means problem specifies that the data set should be partitioned into disjoint subsets, the assignment step in each iteration of Lloyd’s algorithm does not relax this requirement. Each iteration’s hard assignment influences subsequent update steps, and thus the possible clustering solutions found.
- Batch Processing — all points are assigned to clusters before any
centroid update occurs. As a consequence, the result of the algorithm is
independent of the order in which the data are stored; for fixed initial
centroids, permuting the index labels
\(\{1, \dots, n\}\) has no effect on the solution found. - Deterministic After Initialization — while the initialization method used may be stochastic, once initial centroids have been generated Lloyd’s algorithm is entirely deterministic. Thus, the initial centroids completely determine the final clustering.
- Complexity — time complexity of
\(\Theta\left(nkdt\right)\) and space complexity of\(\Theta\left(d(n + k)\right)\) , where\(n\) is the number of data points,\(k\) is the number of clusters,\(d\) is the number of features, and\(t\) is the number of iterations.
On data sets that have high measures of clusterability 1011, Lloyd’s
algorithm converges in an acceptably small number of iterations
Worst case analysis can be framed as an adversarial game. In this setting,
you are player 1 and will use Lloyd’s algorithm to perform k-means clustering
on some data set; player 2 is some jerk who knows all this and tries to
generate a data set that will require you to perform as many iterations as
possible. A geometric construction in the plane can be used by such a jerk to
cause you to take at least exponentially many
iterations
Smoothed analysis is a hybrid of worst-case and average-case analysis. Worst-case complexity only focuses on the worst possible input, regardless of how rare such an input may be. Average-case complexity, on the other hand, assumes some probability distribution over all possible inputs, and then considers the expected outcome. Typically, a uniform distribution over inputs is assumed — I suppose this can be seen from a Bayesian viewpoint as an ‘uninformative’ prior, but that just raises the reference class problem. Smoothed complexity avoids the pessimism of worst-case complexity and the arbitrariness of average-case complexity, while retaining positive aspects of both. If worst-case complexity asks, ‘what is the longest possible running time?’, and average-case complexity asks, ‘what is the expected running time?’, then smoothed complexity asks, ‘what is the longest expected running time, given that your adversary can choose any input which will then be perturbed?’14
So for smoothed complexity, the game changes: there’s still some jerk trying to
generate the worst possible data for you to cluster, but now the jerk’s input
gets corrupted by Gaussian noise before it gets to you. More concretely, every
data point
So, Lloyd’s algorithm has a smoothed running time that is polynomial in the size of the data set, which explains the typically fast convergence in practice. Most implementations in popular packages in languages such as R, python, and MATLAB also typically set a maximum number of iterations, so just in case a jerk did generate your data, you don’t have to play. Alternatively, the smoothed analysis of Lloyd’s algorithm shows that if we add some Gaussian noise to our data set, we might reduce the number of iterations to convergence 121315.
The determinism of Lloyd’s algorithm combined with hard assignment and batch processing cause any clustering results obtained to be highly sensitive to initial conditions. The next section addresses these details.
Solution Characteristics
When describing the clusters found by Lloyd’s algorithm, three traits tend to come up repeatedly: locally optimal, spherical, and equally sized.
Lloyd’s algorithm guarantees only locally, and not globally, optimal solutions. Although a nearly universal description of the clusters returned by Lloyd’s algorithm, the use of the analytic notion of local optimality may seen incongruous for a combinatorial problem. We can make this statement about local optimality rigorous by thinking of the problem in the right context.
Recall that the objective from the k-means problem is to find the lowest total within cluster variance:
Our given data set
Local optimality describes solutions that are better than any other ‘nearby’ solutions, but in order for this idea to make sense we need a well-defined concept of ‘nearness’; specifically, we need to be able to talk about arbitrarily ‘near’ solutions, which is equivalent to being able to move continuously from one solution to another in arbitrarily small steps. I’m talking calculus, and we’ve entered the realm of analysis. As we just saw above, however, the combinatorial nature of the k-means problem only permits discrete jumps in the value of our objective function \eqref{3}. So what does everyone mean when they use local optimality to describe the output of Lloyd’s algorithm?
Well, in order to make sense we’re going to have to get weird. Lloyd’s algorithm
estimates a set of
centroids
If we look inside the solution Lloyd’s algorithm gives us, it consists of a set
of k different
Now all of the possible partitions of our data set corresponds to a discrete set
of possible points
So every iteration changes
Well, this is where the hard assignment, batch processing, and determinism of
Lloyd’s algorithm come in. Initialization picks
some
What happens is that eventually, Lloyd’s algorithm reaches a
point
The above argument moves the combinatorial optimization procedure used
by Lloyd’s algorithm into the context of gradient descent 16. This gradient
descent formulation also makes the size of the neighborhoods in each iteration
more concrete. It turns out the learning rate in the
Another aspect of the clusters returned by Lloyd’s algorithm is their tendency toward spherical shape. The shape of the clusters is a direct consequence of the k-means objective and the hard assignment used.
For a fixed point
Obviously, the specific location of the points in the data set will determine the actual shape of the clusters found by Lloyd’s algorithm. But for each cluster, if we take the point in that cluster farthest from its centroid, we can form a sphere containing all points in the cluster.
The last common attribute of clusters resulting from Lloyd’s algorithm we will
consider is the claim that they tend to be of equal size. This description tends
to cause confusion, as the notion of size intended is often left undefined. The
centroids produced by Lloyd’s algorithm define a centroidal Voronoi tessellation:
we can split up
We can attempt to correct this error by drawing a box around our data set with lengths defined by the smallest and largest coordinates in each dimension. While this forces each Voronoi cell to be compact, it doesn’t capture the real notion of ‘size’ used by Lloyd’s algorithm; besides, the area of those formerly infinite Voronoi cells are now governed by the outliers in our data.
Some descriptions instead use the number of points in each cluster to define their sizes, but this too does not completely describe the results of Lloyd’s algorithm: it is easy to construct examples of clusters containing very different numbers of points, but which produce lower total within cluster variance than a more equal division of point into clusters.
Combining the variance and the cardinality of the clusters in a certain way,
however, does give us a strict meaning of ‘size’ which can describe the behavior
of Lloyd’s algorithm. For a given data
set
Well, if you work out some mildly tedious algebra, you’ll find that after adding
point
That scale factor of
Using this expression for the change in total sum of square errors, we can see
that the ‘size’ meant when describing the clusters returned by Lloyd’s algorithm
is a product of both the number of points in each
cluster
To make things more concrete, let’s compare two
clusters
In this case, the cardinalities
If
Of course, if the sums of square errors of our two clusters are not equal, then the behavior of Lloyd’s algorithm is a more complicated interplay between the square errors and cardinalities of the different clusters. The spherical shape of the clusters can provide some insight into this aspect of a cluster’s size. And of course, assigning only one point to one of two clusters is not a typical iteration of LLoyd’s algorithm. But the above ain’t just shootin’ the breeze — that would be a waste of ammunition. It illustrates that the size of a cluster is determined by both the spread of the points from its centroid, as quantified by the sum of square errors, and the number of points in the cluster. In trying to minimize the total sum of square errors, Lloyd’s algorithm will tend to produce clusters of roughly equal size in precisely this sense.
The Measure of a Mistake
We have so far covered some of the most salient properties of the k-means problem and the method most commonly associated with it, Lloyd’s algorithm. Hopefully, you now have a firm grasp on both, as well as the distinction between them. This is a natural point at which to address a mistake that appears fairly common in both the literature and implementations, and which ultimately originates from an ignorance of the topics we’ve just covered.
If you look back at equation \eqref{1}, you’ll see that the k-means problem is one of variance minimization. The objective is to find a partition of our data into k clusters that produces the lowest possible total within cluster variance. Since the means of each cluster are the unique minimizers of the variance, they naturally show up.
The number of candidate partitions is huge,
Examining the assignment step, Lloyd’s algorithm can be classified as a greedy
optimization method: each
Since every point behaves greedily during the assignment step, some simple algebra shows that minimizing the square error to a centroid is the same thing as minimizing the Euclidean distance. Just take the square root, bada bing bada boom, you’ve got the distance. Since the square root is monotonic, the minimizer of one is the minimizer of the other:
And this is where the trouble starts. That the assignment step selects the nearest centroid for every point is incidental, a tautology of algebra. When explaining what Lloyd’s algorithm does, more internet posts than I’d like to remember describe the assignment step only as finding the nearest centroid. Just check out any Medium article on k-means. Sure, this is an equivalent way of describing the assignment step, but it is actually a consequence of the specific heuristics Lloyd’s algorithm uses: splitting the estimation task into separate assignment and update steps, and using greedy optimization in assignment. This heuristic works fairly well in minimizing the true k-means objective, the total within cluster variance, but as we’ve already seen it does not guarantee finding the true minimum. Unfortunately, many people conflate the algorithm with the problem, and relegate themselves to a superficial understanding polluted by epiphenomena.
The first issue that arises from this incomprehension commonly shows up when someone tries to roll their own version of Lloyd’s algorithm — they’ll actually perform a square root operation for every point and centroid pair in the assignment step to find the distance. These are garbage operations. Even if your square root implementation conforms to IEEE 754 or 854 floating point standards with exact rounding 20, you’re still introducing an unnecessary operation and losing precision. Those square roots might be inconsequential on toy data sets, but for massive data sets and distributed processing this is more likely to be an issue.
Now if you take that first mistake, pretend it’s insight, and smugly elaborate on
it, you’ll get the second, and much worse, error stemming from this confusion:
swapping arbitrary non-Euclidean metrics in place of the square error in the
assignment step while leaving the update step unchanged. If the last mistake was
garbage, this one is the dump that garbage took. Both the assignment and update
steps of Lloyd’s algorithm try to minimize variance — they each have the
same objective in mind. If instead of the square error, the assignment step
uses, say, the
Unfortunately, this perversion of Lloyd’s algorithm seems rampant — again, just check any Medium article on k-means. Such ridiculous methods also show up in the literature 21222324, as well as in primary implementations in MATLAB and mathematica. The MATLAB documentation contains an example of an ‘analysis’ using the Manhattan distance. MATLABS kmeans algorithm tries to correct for the potential bunkum of swapping in a non-Euclidean metric by performing a ‘corrective’ pass over every data point at the end. Good luck with that. The amap package in R, which according to rdocumentation.org is in the 95th percentile in terms of overall downloads, has a kmeans implementation that allows arbitrary distance functions with an inefficient implementation of Lloyd’s algorithm as a bonus. The kmeans function in amap should never be used; the standard stats package’s kmeans function is a far better implementation, which also uses the Hartigan-Wong algorithm by default.
Listen, sure, you’ll get some results. Hell, if you cap the iteration number, by definition your algorithm will ‘converge.’ But like my grandmammy said, just cause you put a cat in the oven, don’t make it a biscuit. The convergence of Lloyd’s algorithm for the k-means problem is based on both steps operating on the same objective - minimizing cluster variance. If you use a non-Euclidean metric in the assignment step, then this alteration has to be reflected in an appropriate modification of your update step. Now, there are in fact modified versions of Lloyd’s algorithm that do just that. Unfortunately, the term ‘k-means’ is sometimes used to describe these procedures, even though the cluster centers are no longer given by the arithmetic means of the cluster points. But at least these procedures make sense. The mathematical soundness of some such generalizations can be grounded in the theory of Bregman divergences, which I’ll touch on when discussing the connection of k-means algorithms to Gaussian mixture models.
Sequential K-Means
As mentioned above, Lloyd’s algorithm is a batch algorithm — only after every data point has been assigned to a cluster do we update our estimates of the centroids. This can lead to large changes in the centroid estimates, and can cause large memory overheads. An alternative approach is to interleave the assignment and update steps: after assigning a point to a cluster, update the centroid estimate immediately.
Properties
Solution Characteristics
Hartigan-Wong Algorithm
Neither the classic Lloyd-Forgy algorithm nor the online MacQueen algorithm takes into account the potential degradation of the within cluster variance when a single point is added to a cluster. Specifically, whenever a point is assigned to a cluster, the estimate of the mean will change — and this will cause the cluster variance to change not just because of the added point, but because the square error for all the other cluster points will also change. The Hartigan-Wong algorithm takes this into account and provides a better approximation method to minimizing the total within cluster variance.
Properties
Solution Characteristics
Initialization Methods
Data Normalization
Clustering Results
Benchmarks
Cluster Validity
Relationship to Other Methods
Gaussian Mixture Models
Principal Component Analysis
Expectation Maximization
Considerations in Application
Estimating k
Alternative Methods
Summary
-
, The Dark Side Of the Truffle Trade ↩
-
, The Information Bottleneck Method ↩
-
, Stirling Number of the Second Kind ↩
-
, On Stirling Numbers of the Second Kind ↩
-
, Clustering Methods - A History of k-Means Algorithms ↩
-
, Least squares quantization in PCM ↩
-
, An empirical comparison of four initialization methods for the k-means algorithm ↩ ↩2 ↩3
-
, Cluster analysis for applications ↩
-
, A comparative study of efficient initialization methods for the k-means clustering algorithm ↩ ↩2
-
, The effectiveness of Lloyd-type methods for the k-means problem ↩
-
, Clusterability: A theoretical study ↩
-
, k-means requires exponentially many iterations even in the plane ↩ ↩2 ↩3
-
, Smoothed Analysis With Applications in Machine Learning ↩
-
, How slow is the k-means method? ↩
-
, Optimization methods of cluster analysis ↩
-
, Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion ↩
-
, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering ↩
-
, What every computer scientist should know about floating-point arithmetic ↩
-
, K-means with Three different Distance Metrics ↩
-
, Effect of different distance measures on the performance of K-means algorithm: an experimental study in Matlab ↩
-
, The k-means clustering technique: General considerations and implementation in Mathematica ↩
-
, Performance evaluation of K-means clustering algorithm with various distance metrics ↩