K-means Clustering

You’ve just loaded a fresh dataset. It’s still so hot from transformation that your laptop fans haven’t stopped whirring. After performing some basic inspection, your eyes begin to glaze over and your face starts to go numb. Is that your dead grandmother, Maureen? Wait, Maureen isn’t dead, and she isn’t your grandma, she works in accounting. If you want to keep exploring the data, you’re going to need some patterns to hold on to before you pass out: market segments in consumer data, tissues in medical images, communities in social networks, topics in a text corpus. If only you could steady yourself on some underlying subset structure lurking in your data, group them into different categories according to the attribute values taken by related samples.

It’s cluster hunting season.

Clusters are the truffles in your dataset, waiting to be found and finely grated over your analysis. And like any good fungus hunter knows, the first task is getting the right hog: one with a keen sense of smell, a lust tempered by perspicacity, a snout for rooting but a maw you can muzzle. The second task is not getting shot in the face by a paranoid French farmer ¹. Ok, I may have lost the thread of the analogy there, but the hogs we’re talking about are clustering methods. Very often, the first choice in cluster hogs is the granddaddy of ‘em all — k-means clustering.

Not only is k-means one of the oldest forms of cluster analysis, it is also one of the most widely used. And it’s the first approach to unsupervised learning many budding datamancers encounter. K-means’ enduring popularity both in practice and pedagogy stems from its elegance and simplicity. In my experience, this unfortunately also means that k-means is one of the most frequently misunderstood, poorly explained, and outright abused methods in machine learning. Granted, my personal experience, like anyone’s, is statistically biased — no individual is an arbitrarily random sampler — but I have seen enough incorrect implementations and misleading characterizations to warrant address here. K-means may be elegant in its simplicity, but that doesn’t excuse simplicity in the minds of those who use it.

In this article I aim to impart an appreciation for the nuances of k-means cluster analysis: what it is, how it works, and when it breaks.

One of the major sources of confusion about k-means can be traced to conflating the problem with the algorithm used to solve it.

The K-Means Problem

Let’s go back to our hypothetical dataset. Hopefully your imaginary laptop’s fans have spun down by now. To make things more concrete, we’ll say that we have $$n$$ data points, each of which has $$d$$ features, and each feature is a real number; more succinctly, each data point $x_{i}$ is a $$d$$ -dimensional real vector for $1 \leq i \leq n$ .

Now suppose there is some sort of meaningful, useful subset structure lurking in the data, just waiting to be discovered. Before we can find it, we need to decide how we will operationalize the distinguishing features of the subsets; in other words, what measurable characteristics will we use to differentiate one cluster from another?

Means to an End

From a statistical point of view, an obvious choice to characterize subpopulations is the average, or mean, of each subpopulation. Sample means are easy to compute - just add and divide.

Furthermore, depending on the true underling cluster structure and what we plan to do with it, using only the means to characterize each cluster could be sufficient in a statistical sense - the information bottleneck ² imposed by using these single parameter cluster summaries could be immaterial to our purposes.

An elegant weapon for a more civilized age. But now we have new issues to address:

How many means do we need to estimate?
How are we going to estimate those means?

The ‘k’ in ‘k-means’ answers the first question. We assume we know how many clusters there are a priori. Of course, this assumption is artificial, as in practice if we knew how many clusters there were beforehand we’d already know a lot about the underlying structure of our data.

Estimating k when it is unknown is a separate issue that we’ll come back to later.

In order to deal with the second issue, we’ll use a standard property of the mean: it is the unique minimizer of the square error to all the sample points. Formally, for a random sample of $$n$$ independent and identically distributed observations $\{X_{i} \}_{i = 1}^{n}$ , the constant (vector) that uniquely minimizes the expected sum of the squared errors to each random observation is the mean $\mu$ :

$$ \DeclareMathOperator*{\argmin}{arg\,min} \begin{align} \mu = \argmin_{c \in \mathbb{R}^{d}} E\bigg[\sum_{i = 1}^{n}\vert\vert X_{i} - c \vert\vert_{2}^{2}\bigg] \end{align} $$

We want to apply this property of means not to the whole data set, but to each of our $$k$$ clusters individually. So for the $j^{th}$ cluster $P_{j}$ , we want to use just the mean $$\mu_{j}$ of that cluster.$

$$ \mu_{j} = \argmin_{c \in \mathbb{R}^{d}} E\bigg[\sum_{X \in P_{j}}\vert\vert X - c \vert\vert_{2}^{2}\bigg] $$

The main thing to notice here is that now our sum only runs over points in the $j^{th}$ cluster $P_{j}$ rather than all $$n$$ points.

We now have an objective function for each cluster. Since we don’t know the true mean $\mu_{j}$ for each cluster $P_{j}$ , we have to estimate it using our data. If your statistics skills can pay the bills, you know that the sample average $\bar{x}$ is an unbiased estimator of the population mean and that, mutatis mutandis, $\bar{x}_{j}$ minimizes the sum of square errors to all realized sample points in the $j^{th}$ cluster. So we just need to take the average of all the points in cluster $P_{j}$ , and we’ll have our estimate for $\mu_{j}$ :

While the sample mean is the unique minimizer of the square error to all sample points,

\sum_{x \in P_{j}}\vert\vert x - \bar{x}\vert\vert

, and an unbiased estimator of the population mean,

E[\bar{x}]=\mu_{j}

, it is not necessarily the unbiased estimator of the population mean with minimum variance,

\bar{x}\overset{?}{=} \argmin_{c \in \mathbb{R}^{d}}E[\vert\vert \mu_{j} - c\vert\vert_{2}^{2}]

.
If the population distribution is Gaussian, for example, then the sample mean is the MVUE for the population mean. In other cases, such as when the distribution is uniform over an unknown range

\([a, b]\)

, the sample mean is not the MVUE for the population mean. This perhaps subtle point becomes more relevant when considering assumptions and properties of solutions produced by the algorithms discussed in the next section.

$$ \bar{x}_{j} = \argmin_{c \in \mathbb{R}^{d}} \sum_{x \in P_{j}}\vert\vert x - c \vert\vert_{2}^{2} $$ $$ \text{where}\ \ \bar{x}_{j} = \frac{1}{|P_{j}|}\sum_{x \in P_{j}}x. $$

But this isn’t good enough - in order to estimate a cluster’s mean using the average of all data points belonging only to that cluster, we need to first know which data points belong to that cluster. Of course, if we knew which points belonged to each cluster, we wouldn’t need to do any damn cluster analysis in the first place! In order to address this dilemma, we incorporate the cluster assignment information into our objective function: assign points to clusters so that we minimize the total within cluster sum of square errors over all possible cluster assignments $\mathcal{P}$ . That is, out of all partitions $\mathcal{P}$ of our data into $$k$$ non-empty subsets $\mathbf{P} = \{P_{1}, P_{2}, \dots P_{k}\}$ , we want to find a particular partition $\mathbf{P}_{0}$ that minimizes the overall sum of squared errors,

$$ \mathbf{P}_{0} = \argmin_{\mathbf{P}\in\mathcal{P}}\sum_{j = 1}^{k}\sum_{x \in P_{j}}\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2}. \tag{1}\label{1} $$

A quick note on terminology — for a cluster

P_{j}

, the quantity

\small\sum_{x \in P_{j}}\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2}

is commonly reffered to as the Within Cluster Sum of Square Errors (WCSS). The Within Cluster Variance,

\small\text{Var}(P_{j})

, is the familiar sample variance formula applied to the cluster,

\small\frac{1}{\vert P_{j}\vert}\sum_{x \in P_{j}}\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2}

. The only difference between these two quantities is the normalization factor

\small\frac{1}{\vert P_{j}\vert}

. Except in cases where the distinction is necessary, I will adopt a common abuse of terminology and consider the WCSS as synonymous with the cluster variance, forgetting about the normalization. Since the mean minimizes both for a fixed cluster, which you can see for yourself with some elementary calculus, the distinction is mostly irrelevant to our purposes here.

Sounds simple enough, right? All we have to do is try out all the different ways to split our data into $$k$$ clusters, and find one that gives us the smallest total when we add up the sum of square errors within each cluster.

The Size of the Problem

Well I hope you’ve got a lot of free time, because the number of possible partitions is huge. How huge? Mathematically speaking, it makes Dwayne Johnson look like Kevin Hart - when he was five. The Stirling numbers of the second kind count the number of ways to partition a set of $$n$$ objects into $$k$$ non-empty subsets. For a given $$n$$ and $$k$$ , the corresponding Stirling number of the second kind, denoted by $n \brace k$ , is given by the formula ³

$$ {n \brace k} = \frac{1}{k!}\sum_{i=0}^{k}(-1)^{i}\binom{k}{i}(k - i)^{n}. \tag{2}\label{2} $$

See those factorials and binomial coefficients? Turns out we’re dealing with exponential growth. For $n \geq 2$ and $1 \leq k \leq n - 1$ , we have the following lower and upper bounds on these Stirling numbers ⁴:

$$ \frac{1}{2}(k^2 + k + 2)k^{n-k-1} - 1 \leq {n \brace k} \leq \frac{1}{2}\binom{n}{k}k^{n-k}. $$

For a concrete example, if you have $$n = 100$$ data points that you want to split into $$k = 3 $$ clusters, that means you have

$$ {100 \brace 3} \approx 8.58 \times 10^{46} $$

different clusterings to try out. It’s estimated that there are on the order of $10^{10}$ neurons in the average human cerebral cortex. Double the size of the dataset, and now you’ve got

$$ {200 \brace 3} \approx 4.426 \times 10^{94} $$

cluster configurations to search through. The number of baryons in the observable universe is estimated to be on the order to $10^{80}$ . Forget watching paint dry, you may as well try to observe proton decay.

So, brute force is out. Any algorithm of practical use is going to need to be more sophisticated than arm wrestling with time.

Is the Mean Meaningful?

Implicit in the description of the k-means problem so far is that our data can be modeled so that each cluster has a mean, and all $$k$$ of the means are distinct. In probabalistic terms, we have assumed that a mixture model reasonably describes our data, with a one-to-one correspondence between mixture components and clusters. In reality, our data might have been generated by either a deterministic or a stochastic process. Without delving into the thorny epistemological issues at the bottom of science itself and human experience generally, I’m going to suppose for the rest of this article that either a mixture model actually generated our data, or that a mixture model describes the cluster structure of our data arbitrarily well even though the true generating process is of some other type. If this assumption were violated by our data set, then the k-means model would have no relevance anyway.

We can say slightly more about the mixture model tacitly assumed by the k-means problem: for each component, computing the mean is meaningful. What I mean is (I’ll stop now, I mean it), we have assumed both the existence and identifiability of each component distribution’s first moment.

K-Means Algorithms

Since the space of possible solutions is generally exponential in the cluster number and dataset size, all commonly used algorithms applied to the k-means problem are heuristic in nature.

Lloyd’s Algorithm

The standard algorithm used to solve the k-means problem is often just called ‘the k-means algorithm.’ I find this name misleading for two reasons: it promotes conflation of the problem posed with the method of solving it, and there are actually several algorithms that can solve the problem. In order to avoid both these issues, I’m going to use the original name from the computer science and pattern recognition communities: Lloyd’s algorithm ⁵. Stuart Lloyd of Bell Laboratories originally proposed the procedure in 1957 for the purpose of finding the minimal distortion signal quantization in the context of pulse code modulation ⁶.

Unless you’re familiar with signal processing, that last sentence probably sounds like some nonsense J.J. Abrams used as dialogue in a Star Wars movie. Basically, Lloyd was working on transmitting digital signals over a noisy channel. The receiver stored $$k$$ representatives, and each transmitted sample, corrupted by noise over the channel, had to be assigned to one of the $$k$$ representatives. Lloyd offered a solution using the specific $$k$$ representatives that minimized the distortion between the original signal and the version reconstructed by the receiver. Distortion in this case was measured by the power in the difference between the signals — and this power is just the average square of the noise. So Lloyd needed to minimize square error, and historically, we can see why knowing the value of $$k$$ in advance wasn’t so outlandish in the context of signal recovery.

LLoyd’s algorithm is an alternating, iterative procedure that shares structural similarities with expectation maximization algorithms. The algorithm performs a greedy optimization at each step in order to avoid examining every possible partition of our data set. We can thus find an answer to our k-means problem without the necessity of immortality — the trade-off is that the solution produced is only guaranteed to be ‘locally’ optimal in a sense that we’ll make concrete later.

The simplicity of Lloyd’s algorithm is in large part responsible for the ubiquity of k-means in cluster analysis. In it’s simplest form, Lloyd’s algorithm is as follows:

The pseudocode above focuses on the main idea of the algorithm, and thus certain details covering edge cases and efficient implementation are left out. In lines 2 through 7, we select as initial centroids $$k$$ data points uniformly at random. The use of the uniform distribution should be augmented by a procedure that ensures each data point selected as an initial centroid is distinct. Distinct initial centroids can be guaranteed by a standard rejection sampling scheme. The main loop comprises the rest of the algorithm in lines 8 through 21, and alternates between two main steps.

The first for loop in lines 11 through 14 is called the assignment step, and assigns each data point $x_{i}$ to the cluster with the centroid that produces the smallest square error. The assignments are indicated by the values of the $r_{i}$ , where we have $\forall\ i, r_{i} \in \{1, \dots, k\}$ .

The second for loop in lines 15 through 18 is called the update or estimation step, and consists of computing new estimates for each cluster’s centroid by averaging all the points assigned to that cluster in the immediately preceding assignment step.

The main loop terminates when all centroid estimates are unchanged from one iteration to the next.

Forgy Initialization

The initialization method in Lloyd’s algorithm deserves more attention. In the above procedure, the initial centroids are selected from the input data uniformly at random. This initialization procedure is often attributed to E.W. Forgy ⁷, and the overall iterative method is sometimes called the Lloyd- Forgy algorithm: for instance, in the kmeans function of the stats package in R. There is no theoretical justification for such an initialization, and it can in fact produce undesirable results.

Ideally, the initial centroids will each be selected from distinct clusters. Such an ideal initialization would produce a much better preliminary representation of the data’s cluster structure, and almost certainly a final clustering that accurately captures the target patterns. Of course, guaranteeing such a representative initialization would require knowing the cluster structure, and we are again left only with heuristic methods.

For a dataset with $n \gg k$ , selecting the initial centroids independently may cause some clusters to be unrepresented. Because Lloyd’s algorithm is only guaranteed to find a local optimum, such a poor starting representation produced by Forgy initialization can significantly degrade performance.

Alternatively, we may determine the initial centroids by:

Initially assigning each point to a cluster uniformly at random. In other words, for $1 \leq i \leq n$ , select $r_{i} \sim \scriptsize{\text{UNIFORM}}\{1, k\}$ .
Compute each initial centroid as we would during a typical estimation step, i.e., set $\mu_{j}^{\scriptsize{(0)}} = \frac{1}{\vert P_{j}\vert}\scriptsize{\displaystyle\sum_{x \in P_{j}}}\hspace{0.2em} x$ .

The above initialization method is often referred to as Random Partition initialization, although there are conflicting claims as to who originated each of these initialization methods ⁷⁸⁹.

Some passing thought should make apparent that the initial centroids produced by random partition will tend to be near the collective centroid of the entire dataset. Such an initialization can lead to its own problems, and is as arbitrary a starting point as Forgy initialization.

Some authors attribute what I have called Forgy initialization here to MacQueen ⁵ ⁹, others to McRae ⁸, and instead credit Forgy with the Random Partition method. These conflicting claims of origination can not be arbitrated by primary sources, as Forgy's initial presentation was given as a conference talk, which is apparently only described in secondary sources ⁵ such as ⁸.
Regardless of who should be credited with either of these initialization methods, they are both equally crappy: three plunger worthy. The typical recommendation for dealing with such poor initialization methods is to simply perform them so many times that some coherence is achieved in the resulting clusters. I think a better approach is to use more principled initialization methods, which are discussed in a following section.

Extensive empirical results on real and synthetic data ⁹ demonstrate that both Forgy and Random Partition initialization are among the worst methods to start Lloyd’s algorithm. For even reasonably small datasets, Forgy initialization tends to vary wildly from one run to the next, and, as discussed above, may under represent the cluster structure by selecting initial centroids that are close together or outlying points. Random Partition initialization varies less between runs, but this is an obvious byproduct of the tendency to position initial centroids towards the overall center of the data set. This also means successive runs of Random Partition tend to produce similar initializations. If such central initializations cause Lloyd’s algorithm to converge to a local minimum with poor within cluster variance, then repeated runs using Random Partition are unlikely to produce any noticeable improvement. The idiosyncrasies of a particular data sets may make it appear as if Random Partition tends to produce lower within cluster variance than any one Forgy initialization ⁷, but such a conclusion fails to account for the similar, centrally located initial centroids of each Random Partition initialization. You might just get lucky and Random Partition initialization converges to a good clustering. The same might occur with a Forgy initialization, but with more variability. Both methods are still generally bad.

Since Lloyd’s algorithm is highly sensitive to initial conditions, as discussed below, careful initialization methods can greatly improve results. More principled approaches that address, albeit heuristically, the issue of finding initial centroids more representative of the cluster structure are discussed below.

Properties

For reference, here are some important properties of Lloyd’s algorithm:

Hard Assignment — although the k-means problem specifies that the data set should be partitioned into disjoint subsets, the assignment step in each iteration of Lloyd’s algorithm does not relax this requirement. Each iteration’s hard assignment influences subsequent update steps, and thus the possible clustering solutions found.
Batch Processing — all points are assigned to clusters before any centroid update occurs. As a consequence, the result of the algorithm is independent of the order in which the data are stored; for fixed initial centroids, permuting the index labels $\{1, \dots, n\}$ has no effect on the solution found.
Deterministic After Initialization — while the initialization method used may be stochastic, once initial centroids have been generated Lloyd’s algorithm is entirely deterministic. Thus, the initial centroids completely determine the final clustering.
Complexity — time complexity of $\Theta\left(nkdt\right)$ and space complexity of $\Theta\left(d(n + k)\right)$ , where $$n$$ is the number of data points, $$k$$ is the number of clusters, $$d$$ is the number of features, and $$t$$ is the number of iterations.

On data sets that have high measures of clusterability ¹⁰¹¹, Lloyd’s algorithm converges in an acceptably small number of iterations $$t$$ . Empirical results suggest that often convergence actually does occur in relatively few iterations. Even for dimensions as low as $$d = 2$$ , however, data sets can be constructed that demonstrate in the worst case $t = 2^{\Omega(n)}$ ¹². The apparent discrepancy between these empirical and theoretical results is reconciled by a smoothed analysis of Lloyd’s algorithm ¹³.

Worst case analysis can be framed as an adversarial game. In this setting, you are player 1 and will use Lloyd’s algorithm to perform k-means clustering on some data set; player 2 is some jerk who knows all this and tries to generate a data set that will require you to perform as many iterations as possible. A geometric construction in the plane can be used by such a jerk to cause you to take at least exponentially many iterations $t = 2^{\Omega(n)}$ ¹². This planar construction can be extended to higher dimensions, so you’re not wining this game by using bigger pieces.

Smoothed analysis is a hybrid of worst-case and average-case analysis. Worst-case complexity only focuses on the worst possible input, regardless of how rare such an input may be. Average-case complexity, on the other hand, assumes some probability distribution over all possible inputs, and then considers the expected outcome. Typically, a uniform distribution over inputs is assumed — I suppose this can be seen from a Bayesian viewpoint as an ‘uninformative’ prior, but that just raises the reference class problem. Smoothed complexity avoids the pessimism of worst-case complexity and the arbitrariness of average-case complexity, while retaining positive aspects of both. If worst-case complexity asks, ‘what is the longest possible running time?’, and average-case complexity asks, ‘what is the expected running time?’, then smoothed complexity asks, ‘what is the longest expected running time, given that your adversary can choose any input which will then be perturbed?’¹⁴

So for smoothed complexity, the game changes: there’s still some jerk trying to generate the worst possible data for you to cluster, but now the jerk’s input gets corrupted by Gaussian noise before it gets to you. More concretely, every data point $x_{i} \in \mathbb{R}^{d}$ gets independently perturbed by a $$d$$ -dimensional Gaussian random vector $\mathcal{N}(\mathbf{0}, \sigma^{2}\textbf{I}_{d})$ before you see it. Using smoothed analysis on Lloyd’s algorithm, it turns out that the expected number of iterations is¹³:

$$ E[t] = O(\frac{n^{34}\log^{4}(n)k^{34}d^{8}}{\sigma^{6}}). $$

So, Lloyd’s algorithm has a smoothed running time that is polynomial in the size of the data set, which explains the typically fast convergence in practice. Most implementations in popular packages in languages such as R, python, and MATLAB also typically set a maximum number of iterations, so just in case a jerk did generate your data, you don’t have to play. Alternatively, the smoothed analysis of Lloyd’s algorithm shows that if we add some Gaussian noise to our data set, we might reduce the number of iterations to convergence ¹²¹³¹⁵.

The determinism of Lloyd’s algorithm combined with hard assignment and batch processing cause any clustering results obtained to be highly sensitive to initial conditions. The next section addresses these details.

Solution Characteristics

When describing the clusters found by Lloyd’s algorithm, three traits tend to come up repeatedly: locally optimal, spherical, and equally sized.

Lloyd’s algorithm guarantees only locally, and not globally, optimal solutions. Although a nearly universal description of the clusters returned by Lloyd’s algorithm, the use of the analytic notion of local optimality may seen incongruous for a combinatorial problem. We can make this statement about local optimality rigorous by thinking of the problem in the right context.

Recall that the objective from the k-means problem is to find the lowest total within cluster variance:

$$ \sum_{j = 1}^{k}\sum_{x \in P_{j}}\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2}. \tag{3}\label{3} $$

Our given data set $\{x_{i}\}_{\scriptsize{i=1}}^{\scriptsize{n}}$ has a finite number of points $$n$$ . From our earlier discussion about Stirling numbers \eqref{2}, although the number of partitions may be huge, it’s still finite. Now as Lloyd’s algorithm only uses points in the data set to estimate each cluster mean, the set of possible centroids is also obviously finite. Since Lloyd’s algorithm can only change any estimate of a centroid mean $\bar{x}_{j}$ by changing which points are assigned to its cluster $P_{j}$ , it can only make discrete, not continuous, changes to those estimates — a defining property of combinatorial optimization problems.

Local optimality describes solutions that are better than any other ‘nearby’ solutions, but in order for this idea to make sense we need a well-defined concept of ‘nearness’; specifically, we need to be able to talk about arbitrarily ‘near’ solutions, which is equivalent to being able to move continuously from one solution to another in arbitrarily small steps. I’m talking calculus, and we’ve entered the realm of analysis. As we just saw above, however, the combinatorial nature of the k-means problem only permits discrete jumps in the value of our objective function \eqref{3}. So what does everyone mean when they use local optimality to describe the output of Lloyd’s algorithm?

Well, in order to make sense we’re going to have to get weird. Lloyd’s algorithm estimates a set of centroids $\{\bar{x}_{j}\}_{\scriptsize{j=1}}^{\scriptsize{k}}$ . To discuss the local optimality of this solution, we’re going to embed it in a vector space.

If we look inside the solution Lloyd’s algorithm gives us, it consists of a set of k different $$d$$ -dimensional vectors $\bar{x}_{j}$ . To perform the desired embedding, we’re going to vectorize this solution set: concatenate all k centroid vectors into one long $$dk$$ -dimensional vector

$$ \bar{\xi} = (\bar{x}_{11}, \dots,\, \bar{x}_{1d},\, \bar{x}_{21}, \dots,\, \bar{x}_{2d}, \dots,\, \bar{x}_{k1}, \dots,\, \bar{x}_{kd})'. $$

Now all of the possible partitions of our data set corresponds to a discrete set of possible points $\bar{\xi}\in\mathbb{R}^{df}$ . Starting from any intialization, it should be easy to see that after each assignment step, the cost \eqref{3} is either reduced or constant. Similarly, the variance minimizing property of means implies that after each update step, cost is either reduced or constant. So overall, each iteration through the main loop of Lloyd’s algorithm cannot increase the value of the objective; since we only have a finite number of possible objective values, we will eventually reach an objective value no lower than the previous value, which will cause Lloyd’s algorithm to terminate.

So every iteration changes $\bar{\xi}$ and lowers the cost until it can’t be lowered anymore — but the final $\bar{\xi}$ returned may not yield the lowest possible cost \eqref{1} for the given data set. What’s going on?

Well, this is where the hard assignment, batch processing, and determinism of Lloyd’s algorithm come in. Initialization picks some $\bar{\xi}^{(0)}$ from the discrete set of all possible $\bar{\xi}$ . Now at each successive iteration $$t$$ , the current cost defines a neighborhood of accessible $\bar{\xi}$ from which Lloyd’s algorithm can choose; if one of these $\bar{\xi}$ reduces the objective, the assignment step essentially computes a new value for \eqref{3} and the update step moves our long centroid vector from $\bar{\xi}^{(t)}$ to $\bar{\xi}^{(t+1)}$ . Because of the structure of Lloyd’s algorithm, every $\bar{\xi}^{(t+1)}$ is completely determined by $\bar{\xi}^{(t)}$ .

What happens is that eventually, Lloyd’s algorithm reaches a point $\bar{\xi}^{(t')}$ whose neighborhood contains no other $\bar{\xi}$ . In the language of analysis, we have reached an open neighborhood of $\bar{\xi}^{(t')}$ where the cost is constant. In this sense, the centroids returned by Lloyd’s algorithm corresponding to $\bar{\xi}^{(t')}$ yield a local minimum of the cost function.

The above argument moves the combinatorial optimization procedure used by Lloyd’s algorithm into the context of gradient descent ¹⁶. This gradient descent formulation also makes the size of the neighborhoods in each iteration more concrete. It turns out the learning rate in the $$d$$ -dimensional subspace of $\mathbb{R}^{dk}$ corresponding to $\bar{x}_{j}$ is equal to $\frac{1}{\vert P_{j} \vert}$ . In other words, the step size in the part of $\bar{\xi}$ that represents $\bar{x}_{j}$ is defined by the number of points currently assigned to cluster $P_{j}$ , and all k step sizes define the neighborhood of $\bar{\xi}$ described above¹⁶.

Another aspect of the clusters returned by Lloyd’s algorithm is their tendency toward spherical shape. The shape of the clusters is a direct consequence of the k-means objective and the hard assignment used.

For a fixed point $\hat{\mu}_{j}$ , an elementary property of the Euclidean metric is that all points a fixed distance $$r$$ from $\hat{\mu}_{j}$ form a sphere centered at $\hat{\mu}_{j}$ with radius $$r$$ . So in the assignment step, for each point $x_{i}$ , Lloyd’s algorithm essentially forms a sphere centered at each $\hat{\mu}_{j}$ with a radius of $\vert\vert x_{i} - \hat{\mu}_{j}\vert\vert_{2}$ and chooses the sphere with the smallest radius to determine assignment.

Obviously, the specific location of the points in the data set will determine the actual shape of the clusters found by Lloyd’s algorithm. But for each cluster, if we take the point in that cluster farthest from its centroid, we can form a sphere containing all points in the cluster.

The last common attribute of clusters resulting from Lloyd’s algorithm we will consider is the claim that they tend to be of equal size. This description tends to cause confusion, as the notion of size intended is often left undefined. The centroids produced by Lloyd’s algorithm define a centroidal Voronoi tessellation: we can split up $\mathbb{R}^{d}$ into k cells, where the $j^{th}$ cell consists of all the points closer to $\hat{\mu}_{j}$ than any other centroid. When describing the ‘size’ of clusters, visualizations where $$d=2$$ are often used, with each cell distinctly colored. When plotted just right, the area covered by each color can be made roughly equal, and the equal size claim is considered justified. Of course, such a justification is absurd: some of the voronoi cells are unbounded and so have infinite area. All you have to do is move around the plot, and the proportion of each color will change.

We can attempt to correct this error by drawing a box around our data set with lengths defined by the smallest and largest coordinates in each dimension. While this forces each Voronoi cell to be compact, it doesn’t capture the real notion of ‘size’ used by Lloyd’s algorithm; besides, the area of those formerly infinite Voronoi cells are now governed by the outliers in our data.

Some descriptions instead use the number of points in each cluster to define their sizes, but this too does not completely describe the results of Lloyd’s algorithm: it is easy to construct examples of clusters containing very different numbers of points, but which produce lower total within cluster variance than a more equal division of point into clusters.

Combining the variance and the cardinality of the clusters in a certain way, however, does give us a strict meaning of ‘size’ which can describe the behavior of Lloyd’s algorithm. For a given data set $\{x_{i}\}_{\scriptsize{i=1}}^{\scriptsize{n}}$ , consider the $j^{th}$ cluster $P_{j}$ . As before, the unnormalized variance within this cluster is given by its total sum of square errors, $\scriptsize{\displaystyle\sum_{x \in P_{j}}}\normalsize\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2}$ . What happens if we add some point $x^{\ast}$ to cluster $P_{j}$ ?

Well, if you work out some mildly tedious algebra, you’ll find that after adding point $x^{\ast}$ to cluster $P_{j}$ , its total sum of square errors becomes ¹⁷

$$ \scriptsize{\displaystyle\sum_{x \in P_{j}}}\normalsize\vert\vert x - \bar{x}_{j}\vert\vert_{2}^{2} + \scriptsize{\frac{\vert P_{j}\vert}{\vert P_{j}\vert + 1}}\normalsize{\vert\vert x^{\ast} - \bar{x}_{j}\vert\vert_{2}^{2}}. \tag{4}\label{4} $$

That scale factor of $\hspace{0.2em}^{\vert P_{j}\vert}/_{\vert P_{j}\vert + 1}$ actually accounts for the change in the cluster mean caused by the presence of $x^{\ast}$ ; that tedious algebra I mentioned shows that replacing the new mean $\bar{x}_{j}'$ with the old mean $\bar{x}_{j}$ is equivalent to multiplying the square error of the new point $x^{\ast}$ by that scale factor, i.e. \eqref{4} is equal to

$$ \scriptsize{\displaystyle\sum_{x \in P_{j}}}\normalsize\vert\vert x - \bar{x}_{j}'\vert\vert_{2}^{2} + \vert\vert x^{\ast} - \bar{x}_{j}'\vert\vert_{2}^{2}. \tag{5}\label{5} $$

Using this expression for the change in total sum of square errors, we can see that the ‘size’ meant when describing the clusters returned by Lloyd’s algorithm is a product of both the number of points in each cluster $\vert P_{j}\vert$ and its unnormalized variance. The batch processing of Lloyd’s algorithm means that two iterations are required for this information to be used by the algorithm. When a new point is assigned to some cluster $\vert P_{j}\vert$ , the update step has to re-estimate the mean before the change to the cluster’s total sum of squares is reflected, and only then is this information usable by the next iteration’s assignment step.

To make things more concrete, let’s compare two clusters $P_{j_{1}}$ and $P_{j_{2}}$ in the special case where we only need to assign one point $x^{\ast}$ to one of these clusters¹⁸. Assume both clusters have the same total sum of square errors, and the distance from $x^{\ast}$ to each cluster’s centroid is equal:

$$ \begin{align} \scriptsize{\displaystyle\sum_{x \in P_{j_{1}}}}\normalsize\vert\vert x - \bar{x}_{j_{1}}\vert\vert_{2}^{2} &= \scriptsize{\displaystyle\sum_{x \in P_{j_{2}}}}\normalsize\vert\vert x - \bar{x}_{j_{2}}\vert\vert_{2}^{2},\\[0.2em] \vert\vert x^{\ast} - \bar{x}_{j_{1}}\vert\vert_{2}^{2} &= \vert\vert x^{\ast} - \bar{x}_{j_{2}}\vert\vert_{2}^{2}. \end{align} $$

In this case, the cardinalities $\vert P_{j_{1}}\vert$ and $\vert P_{j_{2}}\vert$ of each cluster become the determining factor. Since we’re only assigning $x^{\ast}$ and the distances to each centroid are equal, Lloyd’s algorithm will break the tie and assign $x^{\ast}$ to one of the clusters arbitrarily. Let’s say $x^{\ast}$ was assigned to $P_{j_{1}}$ . Following the update step, the sum of square errors for cluster $P_{j_{1}}$ has increased by $\scriptsize{\frac{\vert P_{j_{1}}\vert}{\vert P_{j_{1}}\vert + 1}}\normalsize{\vert\vert x^{\ast} - \bar{x}_{j_{1}}\vert\vert_{2}^{2}}$ .

If $\vert P_{j_{1}}\vert > \vert P_{j_{2}}\vert$ , then $\frac{\vert P_{j_{1}}\vert}{\vert P_{j_{1}}\vert + 1}\vert\vert x^{\ast} - \bar{x}_{j_{1}}\vert\vert_{2}^{2} > \frac{\vert P_{j_{2}}\vert}{\vert P_{j_{2}}\vert + 1}\vert\vert x^{\ast} - \bar{x}_{j_{2}}\vert\vert_{2}^{2}$ , and the total sum of square errors can be reduced by moving $x^{\ast}$ to $P_{j_{2}}$ . So if the sums of squares of our two clusters are equal, $x^{\ast}$ is assigned to the cluster with fewer points in at most two iterations.

Of course, if the sums of square errors of our two clusters are not equal, then the behavior of Lloyd’s algorithm is a more complicated interplay between the square errors and cardinalities of the different clusters. The spherical shape of the clusters can provide some insight into this aspect of a cluster’s size. And of course, assigning only one point to one of two clusters is not a typical iteration of LLoyd’s algorithm. But the above ain’t just shootin’ the breeze — that would be a waste of ammunition. It illustrates that the size of a cluster is determined by both the spread of the points from its centroid, as quantified by the sum of square errors, and the number of points in the cluster. In trying to minimize the total sum of square errors, Lloyd’s algorithm will tend to produce clusters of roughly equal size in precisely this sense.

The Measure of a Mistake

We have so far covered some of the most salient properties of the k-means problem and the method most commonly associated with it, Lloyd’s algorithm. Hopefully, you now have a firm grasp on both, as well as the distinction between them. This is a natural point at which to address a mistake that appears fairly common in both the literature and implementations, and which ultimately originates from an ignorance of the topics we’ve just covered.

If you look back at equation \eqref{1}, you’ll see that the k-means problem is one of variance minimization. The objective is to find a partition of our data into k clusters that produces the lowest possible total within cluster variance. Since the means of each cluster are the unique minimizers of the variance, they naturally show up.

The number of candidate partitions is huge, $$O(k^n)$$ from bounds on Stirling’s numbers of the second kind. Using those equivalence class vectors of means $\bar{\xi} \in \mathbb{R}^{dk}$ from our discussion of local optimality, a combinatorial argument can be applied to the algebraic surfaces defining the boundaries of the Voronoi tesselations corresponding to each partition, and our bound can be improved to $O(n^{dk + 1})$ ¹⁹. That’s still huge. So we resorted to heuristic methods — sacrificing optimality for speed. The heuristic we’ve studied is Lloyd’s algorithm. This procedure makes some further assumptions not included in the k-means problem, some pertinent consequences of which we’ve examined above.

Examining the assignment step, Lloyd’s algorithm can be classified as a greedy optimization method: each $x_{i}$ minimizes its own square error with gluttonous disregard for the square errors of any other point $x_{\ell}$ , where $\small\ell \neq i$ .

Since every point behaves greedily during the assignment step, some simple algebra shows that minimizing the square error to a centroid is the same thing as minimizing the Euclidean distance. Just take the square root, bada bing bada boom, you’ve got the distance. Since the square root is monotonic, the minimizer of one is the minimizer of the other:

$$ \argmin_{\scriptsize\mu \in \{\mu_{i}\}_{1}^{k}}\vert\vert x_{i} - \mu \vert\vert_{2}^{2} = \argmin_{\scriptsize\mu \in \{\mu_{i}\}_{1}^{k}}\vert\vert x_{i} - \mu \vert\vert_{2} $$

And this is where the trouble starts. That the assignment step selects the nearest centroid for every point is incidental, a tautology of algebra. When explaining what Lloyd’s algorithm does, more internet posts than I’d like to remember describe the assignment step only as finding the nearest centroid. Just check out any Medium article on k-means. Sure, this is an equivalent way of describing the assignment step, but it is actually a consequence of the specific heuristics Lloyd’s algorithm uses: splitting the estimation task into separate assignment and update steps, and using greedy optimization in assignment. This heuristic works fairly well in minimizing the true k-means objective, the total within cluster variance, but as we’ve already seen it does not guarantee finding the true minimum. Unfortunately, many people conflate the algorithm with the problem, and relegate themselves to a superficial understanding polluted by epiphenomena.

The first issue that arises from this incomprehension commonly shows up when someone tries to roll their own version of Lloyd’s algorithm — they’ll actually perform a square root operation for every point and centroid pair in the assignment step to find the distance. These are garbage operations. Even if your square root implementation conforms to IEEE 754 or 854 floating point standards with exact rounding ²⁰, you’re still introducing an unnecessary operation and losing precision. Those square roots might be inconsequential on toy data sets, but for massive data sets and distributed processing this is more likely to be an issue.

Now if you take that first mistake, pretend it’s insight, and smugly elaborate on it, you’ll get the second, and much worse, error stemming from this confusion: swapping arbitrary non-Euclidean metrics in place of the square error in the assignment step while leaving the update step unchanged. If the last mistake was garbage, this one is the dump that garbage took. Both the assignment and update steps of Lloyd’s algorithm try to minimize variance — they each have the same objective in mind. If instead of the square error, the assignment step uses, say, the $L_{1}$ metric to determine the ‘nearest’ centroid for every point, the assignment and update steps of Lloyd’s algorithm now have two different objectives. You know what minimizes the $L_{1}$ error? The median. Since the result of one step provides the parameters of the other step, this just creates a cycle of compounding bad estimates. That guarantee that we’d at least find a locally optimal solution using Lloyd’s algorithm? Gone. You’re bada bing bada screwed.

Unfortunately, this perversion of Lloyd’s algorithm seems rampant — again, just check any Medium article on k-means. Such ridiculous methods also show up in the literature ²¹²²²³²⁴, as well as in primary implementations in MATLAB and mathematica. The MATLAB documentation contains an example of an ‘analysis’ using the Manhattan distance. MATLABS kmeans algorithm tries to correct for the potential bunkum of swapping in a non-Euclidean metric by performing a ‘corrective’ pass over every data point at the end. Good luck with that. The amap package in R, which according to rdocumentation.org is in the 95th percentile in terms of overall downloads, has a kmeans implementation that allows arbitrary distance functions with an inefficient implementation of Lloyd’s algorithm as a bonus. The kmeans function in amap should never be used; the standard stats package’s kmeans function is a far better implementation, which also uses the Hartigan-Wong algorithm by default.

Listen, sure, you’ll get some results. Hell, if you cap the iteration number, by definition your algorithm will ‘converge.’ But like my grandmammy said, just cause you put a cat in the oven, don’t make it a biscuit. The convergence of Lloyd’s algorithm for the k-means problem is based on both steps operating on the same objective - minimizing cluster variance. If you use a non-Euclidean metric in the assignment step, then this alteration has to be reflected in an appropriate modification of your update step. Now, there are in fact modified versions of Lloyd’s algorithm that do just that. Unfortunately, the term ‘k-means’ is sometimes used to describe these procedures, even though the cluster centers are no longer given by the arithmetic means of the cluster points. But at least these procedures make sense. The mathematical soundness of some such generalizations can be grounded in the theory of Bregman divergences, which I’ll touch on when discussing the connection of k-means algorithms to Gaussian mixture models.

Sequential K-Means

As mentioned above, Lloyd’s algorithm is a batch algorithm — only after every data point has been assigned to a cluster do we update our estimates of the centroids. This can lead to large changes in the centroid estimates, and can cause large memory overheads. An alternative approach is to interleave the assignment and update steps: after assigning a point to a cluster, update the centroid estimate immediately.

Properties

Solution Characteristics

Hartigan-Wong Algorithm

Neither the classic Lloyd-Forgy algorithm nor the online MacQueen algorithm takes into account the potential degradation of the within cluster variance when a single point is added to a cluster. Specifically, whenever a point is assigned to a cluster, the estimate of the mean will change — and this will cause the cluster variance to change not just because of the added point, but because the square error for all the other cluster points will also change. The Hartigan-Wong algorithm takes this into account and provides a better approximation method to minimizing the total within cluster variance.

Properties

Solution Characteristics

Initialization Methods

Data Normalization

Clustering Results

Benchmarks

Cluster Validity

Relationship to Other Methods

Gaussian Mixture Models

Principal Component Analysis

Expectation Maximization

Considerations in Application

Estimating k

Alternative Methods

Summary

Ryan Jacobs, The Dark Side Of the Truffle Trade ↩
Naftali TishbyFernando C. PereiraWilliam Bialek, The Information Bottleneck Method ↩
Weisstein, Eric W., Stirling Number of the Second Kind ↩
Basil Cameron RennieAnnette Jane Dobson, On Stirling Numbers of the Second Kind ↩
Hans-Hermann Bock, Clustering Methods - A History of k-Means Algorithms ↩
Stuart Lloyd, Least squares quantization in PCM ↩
José M. PenaJose Antonio LozanoPedro Larranaga, An empirical comparison of four initialization methods for the k-means algorithm ↩ ↩² ↩³
Michael R. Anderberg, Cluster analysis for applications ↩
M. Emre CelebiHassan A. KingraviPatricio A. Vela, A comparative study of efficient initialization methods for the k-means clustering algorithm ↩ ↩²
Rafail OstrovskyYuval RabaniLeonard J. SchulmanChaitanya Swamy, The effectiveness of Lloyd-type methods for the k-means problem ↩
Margareta AckermanShai Ben-David, Clusterability: A theoretical study ↩
Andrea Vattani, k-means requires exponentially many iterations even in the plane ↩ ↩² ↩³
David ArthurBodo MantheyHeiko Röglin, k-Means has Polynomial Smoothed Complexity ↩ ↩² ↩³
Abigail See, Smoothed Analysis With Applications in Machine Learning ↩
David ArthurSergei Vassilvitskii, How slow is the k-means method? ↩
Leon BottouYoshua Bengio, Convergence properties of the k-means algorithms ↩ ↩²
Marriott, F. H. C., Optimization methods of cluster analysis ↩
Mingjin Yan, Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion ↩
Mary InabaNaoki KatohHiroshi Imai, Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering ↩
David Goldberg, What every computer scientist should know about floating-point arithmetic ↩
Singh, ArchanaAvantika YadavAjay Rana, K-means with Three different Distance Metrics ↩
Dibya Jyoti BoraAnil Kumar Gupta, Effect of different distance measures on the performance of K-means algorithm: an experimental study in Matlab ↩
Laurence MorissetteSylvain Chartier, The k-means clustering technique: General considerations and implementation in Mathematica ↩
Thakare, Y. S.S. B. Bagal., Performance evaluation of K-means clustering algorithm with various distance metrics ↩