(This is the thirteenth post in a sequence on Machine Learning based on this book. Click here for part I.)

In the previous post, I’ve mentioned the distinction between supervised and unsupervised learning. Here is a more comprehensive picture:

Supervised learning: learning a fixed predictor based on a fixed training sequence generated by a fixed distribution
Unsupervised learning: learning without training data
Online learning: updating a predictor over time based on data that arrives over time, generated by a process that might change and/or depend on past predictions

In particular, the performance of a predictor in supervised learning is measured after all learning has taken place, whereas, in online learning, the accuracy of our predictor matters at every step along the way.

As an example: while it is possible to model spam detection as a supervised learning problem, it is more accurate to model it as online learning, since (a), we may wish to update our spam filter continually, and (b), people composing spam emails may change their behavior based on existing spam filters. Online learning is also the model that comes closest to describing how real humans learn (especially children).

In this post, we’ll look at Online Learning as a whole and at clustering, which is a particular problem of unsupervised learning.

Online Learning

For this post, we restrict ourselves to binary classification problems, i.e., $Y = {0, 1}$ .

Recall that, in supervised learning, we assume there is a fixed probability distribution $D$ over $X \times Y$ according to which the environment generates labeled points. Here, each $x \in X$ is a representation of the real-world thing we wish to label rather than the thing itself. For example, if the task is spam detection of emails, then $x$ would be a vector of some of the email’s features (length, address of the sender, number of times the word “money” appears, number of images, …) rather than an encoding that includes the entire body of text. This means that two different emails may have the same feature representation, and, consequently, the same domain point $x \in X$ might sometimes have label $1$ and sometimes label $0$ .

In some cases, we may wish to assume that labels are deterministic regardless, either because it’s a realistic assumption for our particular problem, or for reasons of simplicity. If we do, we can alternatively model the environment as having a probability distribution $D$ over just $X$ , as well as a true labeling function $h_{environment} : X \to Y$ . The environment then generates labels by sampling $x \in X$ , according to $D$ , and outputting $(x, h_{environment} (x))$ .

These last two paragraphs are pure repetition that I’ve included to precisely highlight the differences between supervised learning and online learning.

In the most general formulation of online learning, we do not assume that the labeled domain points are generated by a probability distribution $D$ . This implies two differences to supervised learning:

(1): the process by which labels are generated (be it $h_{environment}$ or the conditional probability distribution $D (y | x)$ ) is allowed to change over time; and
(2): the process by which domain points are generated is not fixed, and might even depend on our predictions for previous labels

Furthermore, I’ve stated in the introduction that

(3) accuracy of a learning algorithm is measured based on its performance throughout the process

More precisely, we model the learning process to proceed in rounds. In every round, first, the learner $A$ is presented with some instance $x \in X$ ; then, it predicts a label (either $0$ or $1$ ); finally, it is given the true label for $x$ . If the prediction was wrong, we say that $A$ made a mistake, and the goal is to design algorithms that minimize the number of mistakes. Thus, we don’t have a training phase followed by a prediction phase as in supervised learning, but rather a single phase of both training and prediction. We still model this in terms of a sequence $S = ((x_{1}, y_{1}), . . ., (x_{m}, y_{m}))$ , but we will call it a data sequence (term I made up) rather than a training sequence, and we measure how well an algorithm predicts each $y_{i}$ based on the $(x_{j}, y_{j})$ with $j < i$ .

It’s not difficult to see that, in this general setting, establishing learning guarantees is impossible. For example, if there is no correlation between the current label and past labels, an algorithm cannot do better than to guess. Worse, if the environment always chooses the label which $A$ didn’t guess, $A$ will inevitably make a mistake every round.

This means we have to make further assumptions to simplify the problem. One such assumption is that labels are consistent with (but not generated by) a predictor in our hypothesis class. That is, after all rounds are complete, there needs to be a predictor $h : X \to Y$ out of the class $H$ such that all labels reported by the environment are consistent with $h$ . However, unlike with supervised learning, we do not require that the environment “chooses” this function ahead of time. Instead, think of the environment as a bullshitter – it may throw you the least useful domain point at every time, and then declare your prediction wrong (whatever it is). As long as, in the end, it can show you a predictor in $H$ that would have behaved in precisely this way, this is legal behavior.

It is not obvious, at least to me, that this is a particularly useful way to model things – which real-life problem behaves in this way? – but it’s what the book does, and I’m too short on time to do independent research, so we’ll just roll with it.

To recap the setting in full:

The learner $A$ has access to the set $X$ of domain points, the label set $Y = {0, 1}$ , and a hypothesis class $H \subseteq {f : X \to Y}$ . Given any predictor $h_{e} \in H$ and any sequence of points $x_{1}, . . ., x_{m} \in X$ , the data sequence $S = ((x_{1}, h_{e} (x_{1})), . . ., (x_{m}, h_{e} (x_{m}))$ determines a process of $m$ rounds. At round $j$ , the learner $A$ will be given a domain point $x_{j}$ and be asked to predict a label $y_{j} \in {0, 1}$ . For every $j \in [m]$ such that $y_{j} \neq h_{e} (x_{j})$ , $A$ has made a mistake.

And our goal is to design $A$ such that it will make as few mistakes as possible. Formally, $A$ is just an algorithm, so we have a lot of freedom in this construction.

Note that we will measure the performance of $A$ in terms of the highest number of mistakes across all possible data sequences. Thus, even though the formalization above makes it sound like the $x_{j}$ are fixed ahead of time, it is more accurate to think of them as being chosen by a malicious environment.

In the context of supervised learning, we’ve written $A (S)$ to refer to the predictor $A$ produces after having trained on the training sequence $S$ . In online learning, this definition might still make sense (any reasonable algorithm will have its choices consistent with some predictor at every step); however, it’s not useful, because the performance of the final predictor isn’t what interests us. Instead, given a learner $A$ and a data sequence $S$ , we define $M_{A} (S)$ as the number of mistakes which $A$ made throughout while predicting the labels of $S$ , one by one. Note that $M_{A} (S) \in {0, . . ., | S |}$ .

Given a hypothesis class $H$ , we then define $M_{A} (H) := sup {M_{A} (S) | S \in S}$ , where $S$ is the set of all data sequences that the environment is allowed to come up with (i.e., all data sequences that are consistent with some predictor in $H$ ). Note that $S$ is allowed to contain sequences of arbitrary length. Therefore, depending on what $A$ does, it might be that the set ${M_{A} (S) | S \in S}$ is unbounded, in which case $M_{A} (H) = \infty$ . With that, we can define:

A hypothesis class $H$ is learnable iff $\exists$ an algorithm $A$ such that $M_{A} (H) \neq \infty$ .

This is a good time to think about how to construct $A$ . Assume that the hypothesis class $H$ is finite. How should $A$ behave to make sure the environment can only screw it over for so long?

Here’s one possible algorithm, which we call $A_{halving}$ . This learner keeps track of a set $V^{(t)} \subseteq H$ of possible hypotheses at every time step $t$ . When presented with a domain point $x_{t} \in X$ , it divides the set in two, depending on which label they give $x$ . I.e.:

$V_{-}^{(t)} := {h \in V^{(t)} : h (x_{t}) = 0} V_{+}^{(t)} := {h \in V^{(t)} : h (x_{t}) = 1}$

We have $V^{(t)} = V_{-}^{(t)} ⊔ V_{+}^{(t)}$ (disjoint union), which means that at least one of them contains half or more of the predictors in $V^{(t)}$ . Our learner $A_{halving}$ chooses the label $y_{t}$ according to that set. I.e., if $V_{+}^{(t)} \geq \frac{1}{2} V^{(t)}$ and half or more of the predictors in $V^{(t)}$ say that $x$ has label $1$ , $A_{halving}$ outputs label $1$ .

If the environment declares the prediction wrong, $A_{halving}$ halves the remaining hypotheses by setting $V^{(t + 1)} := V_{-}^{(t)}$ . That way, it can make at most $ld (| H |)$ mistakes total.

Informally speaking, think of the hypothesis class $H$ as the bullshitter’s ammunition. The only way we avoid making mistakes is to reduce this ammunition as fast as possible. $A_{halving}$ reduces it by at least 50% at each step.

Note that any $A$ only gets to make a binary decision at each step. The decomposition $V^{(t)} = V_{-}^{(t)} ⊔ V_{+}^{(t)}$ is a natural one, i.e., not specific to $A_{halving}$ . After the true label has been announced, the set of remaining candidates will always be one of the two – and each algorithm gets to decide which one.

L-Dimension

Unlike what one might naively suspect, the halving algorithm above is not the optimal way to minimize the number of mistakes. This is because it chooses the next subset (i.e., $V_{-}^{(t)}$ or $V_{+}^{(t)}$ ) based on the number of hypotheses in them. However, number-of-hypotheses is a flawed measure for complexity, as you may recall from chapters I-III. In the context of supervised learning, a much better measure is the VC-dimension.

(If you’re not familiar with the VC-dimension is, see post II. The one-sentence definition is that the VC-dimension of a hypothesis class $H$ is the largest integer $k$ such that there exists a set of $k$ domain points in $X$ that is shattered by $H$ – where a set $P \subseteq X$ is shattered by $H$ iff, for every possible combination of labels of points in $P$ , there exists a predictor $h \in H$ assigning them these labels.)

Our goal now is to work out the analogous concept for this particular brand of online learning. This will be called the L-dimension, also named after a person (in this case, “Littlestone”). The L-dimension of a hypothesis class (written $L-dim (H)$ ) will be the largest integer $k$ such that, for every learner $A$ , there exists a data sequence $S$ such that $M_{A} (S) = k$ . This means that $M_{A} (H) \geq L-dim (H)$ for every algorithm $A$ .

The L-dimension is lower-bounded by the VC-dimension. Informally, given a set $P$ of $k$ many points that is shattered by $H$ , the environment can simply present $A$ with all those points in any order and declare that all of $A$ ’s predictions were wrong. Since every combination of labels of points in $P$ is realized by some predictor in $H$ , this will be legit regardless of what $A$ does. (Formally, one would have to construct a [data sequence based on the shattered set] on which $A$ fails.)

The reverse is not true – a set of points that is not shattered could still be presented in an inconvenient order such that every learner fails.

Let’s take a close look at how both concepts are different. We can characterize the VC-dimension like so:

$The VC-dimension of H is at least k ⟺ \exists a set of k domain points that is shattered by H$

By vacuously quantifying over possible orders and labels, we can state the same in a more complicated way thus:

$The VC-dimension of H is at least k ⟺ \exists a set of k domain points such that \forall order on the set \forall combination of labels no points' label follows from the previous ones$

Similarly (although less cleanly since the formalism is trickier), we can characterize the L-dimension like so:

$The L-dimension of H is at least k ⟺ \exists f : (X \times Y)^{*} \to X such that for k steps \forall combination of labels no points' label follows from the previous ones$

where $f$ is the environment’s function that chooses the next domain point each round.

This highlights how exactly both concepts differ. The VC-dimension is about a set of points for which knowing the labels of any subset of them does not imply the label of the remaining points. The $L$ -dimension is about the ability to pull out new points on the spot (depending on the labels of the previous ones) such that the labels of previous points don’t imply the labels of future points.

Sometimes, both concepts coincide. Consider the hypothesis class of all predictors that assign label 1 to three domain points and label 0 to all others, i.e.,

$H_{3-positive} := {f : R \to Y | | {x \in R : f (x) = 1} | = 3}$

If $A$ starts by repeatedly guessing $0$ , it can make at most three mistakes – it’s not possible to present domain points in an order such that $A$ doesn’t obtain the relevant information. This is because $A$ ‘s knowledge is evenly distributed across all domain points it hasn’t seen yet; first, it doesn’t know the label of any, then it suddenly learns the label of them all. In such cases, nothing is “gained” (or “lost, from $A$ ’s perspective) from the ability to choose points dynamically.

On the other hand, take the class of threshold predictors $H_{threshold} := {θ_{x} | x \in R}$ where $θ_{x} (y) := 1 ⟺ y \geq x$ , and consider what happens if $A$ learns that the label of the domain point is $1$ . In that case, it knows the threshold is to the left of $x$ , which means that all points to the right of $x$ also have label 1...

… but it doesn’t know the labels of the points to the left of $x$ . In this case, $A$ ’s knowledge about domain points is unevenly distributed. In such a case, the ability to choose new points dynamically matters a lot – if we were trying to construct a shattered set, we would already have hit a wall.

Now, suppose $A$ guesses $1$ on a point to the left next, which will then have label 0. Now the picture looks like so:

The area that $A$ can classify safely has increased, but there is still the middle area that is up in the air. This allows the environment to present $A$ with another point it doesn’t know yet – say the point right in the middle. Then, if $A$ guesses label 1, the environment will declare that the label is 0. In that case, $A$ knows that all points to the left of $x$ are labeled 0, and the area it doesn’t know has been cut in half. Conversely, if $A$ guesses label 0, the environment will declare that the label is 1. In that case, $A$ knows that all points to the right of $x$ are labeled 1, and the area it doesn’t know has been cut in half.

In both cases, the [area $A$ cannot safely classify] is cut in half. Since the area is a segment of the real line, it can be cut in half arbitrarily often. This means that, even though $VC-dim (H_{threshold}) = 1$ , we have $L-dim (H_{threshold}) = \infty$ .

With this bit of theory established, we can design another algorithm $A_{little-L}$ . Whereas $A_{halving}$ , confronted with the choice between $V_{+}^{(t)}$ and $V_{-}^{(t)}$ , makes its decision based on which set has fewer predictors in it, $A_{little-L}$ makes its decision based on which class has lower L-dimension.

If $L-dim (V_{-}^{(t)}) = L-dim (V_{+}^{(t)}) = k$ , then $L-dim (V^{(t)})$ is at least $k + 1$ . This is so because otherwise, the environment can let $A$ make a mistake at round $t$ , followed by at least $k$ more in subsequent rounds (recall what the L-dimension represents). Therefore, whenever $A_{little-L}$ plays, the L-dimension of $V^{(t)}$ decreases by at least 1 every step. This implies that $M_{A} (H) \leq L-dim (H)$ . Since $M_{A} (H) \geq L-dim (H)$ also holds, this implies that $A_{little-L}$ is the optimal algorithm in this model.

Note that, if $H$ has $2^{k}$ elements, then $M_{A_{halving}} (H) \leq k$ . Furthermore, $L-dim (H) \leq k$ and thus $M_{A_{little-L}} \leq k$ . If the L-dimension is exactly $k$ (this happens if $H$ is simply the set of all possible predictors on a domain set with $k$ elements), then both learners have identical mistake bounds. However, if $L-dim (H) = d < k$ , then $A_{little-L}$ is guaranteed to make at most $d$ mistakes, while $A_{halving}$ might make up to $k$ .

Clustering

Clustering is about dividing a set of points into disjoint subsets, called clusters, which are meant to group similar elements together. For example, consider the following instance:

Here is a possible clustering:

Here’s another:

According to the book, there is no “ground truth” when it comes to clustering, in part because there could be multiple meaningful ways to cluster any given data set. I would dispute that claim – given a finite set of points, there trivially has to be an optimum clustering for any precise criterion. However, there is usually no way to evaluate how well this criterion has been met. Suppose you are given some data set, and run a clustering algorithm just to understand it a little better, without even knowing what you are looking for. There will be some clustering that provides you with an optimal amount of insight (otherwise, there wouldn’t be any point in having nontrivial algorithms), but it is impossible to tell whether any given clustering is optimal. Furthermore, upon seeing one clustering, you will learn some things about the data, which changes the metric of which clustering is most informative next.

Changing metrics aside, learning clustering from training data seems possible in principle. One could input several training sets and their respective optimal clusterings according to human judgment. However, this would require a significantly more complex formalism than what we utilized for supervised learning. For example, the quality of the clustering could no longer be evaluated locally – the same point might “belong” into a different cluster if the training set is extended.

The practical consequence is that one doesn’t have a “learner” in the same sense (at least, the book doesn’t discuss any such approaches). Instead, all one can do is to define various algorithms that seem to make intuitive sense. This makes our job simpler; all the clustering algorithms we’ll look at are quite easy to understand. Consequently, and because there are pretty good resources out there, I’ll mostly skip on describing the algorithms in detail and link to external resources instead.

Formalizing the setting

We assume our input is a finite metric space. That is, a pair $(X, d)$ where $X$ is any finite set and $d : X \times X \to R$ a metric on the set (it measures distances between points). Saying that $d$ is a metric is equivalent to saying that it has the following four properties:

$d (x, x) = 0 \forall x \in X$ (#1)
$d (x, y) = d (y, x) \forall x, y \in X$ (symmetry)
$d (x, y) > 0 \forall x, y \in X s.t. x \neq y$ (#3)
$d (x, z) \leq d (x, y) + d (y, z)$ (triangle inequality)

Something we do not require is the ability to draw a line between points, compute the midpoint of that line, or anything like that. Clustering algorithms work just based on distances. Given $(X, d)$ , the output of a clustering algorithm is a set $C_{1}, . . ., C_{k}$ such that $X = C_{1} ⊔ \dots ⊔ C_{k}$ . Some clustering algorithms also take $k$ as an input parameter, i.e., they’re being told how many clusters they ought to put out.

Impossibility Results

Here’s something interesting, before we look at/link to explanations of the actual algorithms. You might have heard the claim that “there is no optimal voting system.” This is based on something called Arrow’s impossibiltiy theorem, and I’m going to explain exactly what it means (this will be relevant for clustering in just a bit). Consider a situation where there are a bunch of candidates and a bunch of parties doing ranked-choice voting. Each party’s vote is an ordered list of how much they like all candidates. Now some voting system evaluates these votes and outputs an ordered list of candidates as the result.

Consider the following properties such a system might have:

Non-dictatorship: there is no one party such that the system always outputs exactly the order this party submitted
Independence of irrelevant alternatives: any party changing their order of $C$ and $D$ cannot change the system’s order of A and B. Only changes that are about either $A$ or $B$ can do that.
Pareto efficiency: if all votes prefer $A$ over $B$ , so must the system’s output

Everyone to whom all three properties seem highly desirable has a problem because Arrow’s Theorem states that no system can meet all three properties unless there are only two candidates. (If there are only two candidates, a majority vote with a deterministic way to break ties satisfies all three properties). The proof consists of taking a system that has properties #2 and #3 and showing that it is dictatorial (i.e., there is a party such that the system always outputs that party’s submission). If one considers #1 and #3 to be essential (and #2 somewhat less so), Arrow’s theorem can be summarized as “in every reasonable voting system with more than two candidates, strategic voting must be a thing.” These kinds of results are sometimes called impossibility theorems: we list a bunch of properties that all seem to make sense and then prove that they’re incompatible.

For clustering algorithms, we have something similar, except that it’s much less impressive because the properties aren’t as obviously desirable. Nonetheless, it’s worth bringing up. The properties are

Scale-invariance: multiplying all positions with a constant factor doesn’t change the clustering
Consistency: moving a point closer towards all points in a cluster cannot cause it to be dropped from that cluster, and moving it farther away from all points in a cluster cannot cause it to become part of that cluster
Richness: any clustering is possible (for some distance function)

If the distance function measures euclidean distances in 2-dimensional space, then the last property says that, for any possible clustering, you can arrange your points such that the algorithm will return that clustering.

The impossibility theorem states that, while any two of these properties are achievable, no algorithm can achieve all three.

To me, consistency feels like an obvious requirement, but the other two less so. If one shares this view, the impossibility theorem can be summarized as “any reasonable clustering algorithm either depends on scale or is restricted to only a subset of possible clusters.”

In this case, the proof is simple. Suppose $X = {x_{1}, . . ., x_{n}}$ where $n > 1$ . We assume a clustering algorithm $A$ that meets all three properties above and derive a contradiction.

One possible clustering is the trivial clustering, $C_{trivial} := {{x_{1}}, . . ., {x_{n}}}$ . Due to the richness property, there must be some metric $d^{*}$ such that $A (X, d^{*}) = C_{trivial}$ . Furthermore, let $C_{?}$ be an arbitrary clustering other than $C_{trivial}$ . Due to richness, there has to, again, be some distance function $d_{?}$ such that $A (X, d_{?}) = C_{?}$ .

Now we show that $A (X, d^{*}) = A (X, d_{?})$ , which will be our contradiction. We do this by transforming $d^{*}$ into $d_{?}$ in such a way that (due to the three properties of $A$ ) the cluster cannot change.

First, we change $d^{*}$ to $d_{small}$ by reducing all distances by the same factor such that the largest distance in $d_{small}$ becomes the smallest distance in $d^{*}$ . (Then, any distance in $d_{small}$ is at most as large as any distance in $d_{?}$ .) Formally, we set $d_{small} (x, y) := α d^{*} (x, y) \forall x, y \in X$ where

$α := \frac{min {d_{?} (x, y) | x, y \in X}}{max {d^{*} (x, y) | x, y \in X s.t. x \neq y}}$

due to scale invariance, we have $A (X, d^{*}) = A (X, d_{small})$ . Well, and now we increase every distance from $d_{small}$ until it is equal to that of $d_{?}$ . Due to consistency, there cannot be any cluster in $A (X, d_{?})$ that consists of more points than before because every point either remained at the same distance to that cluster or moved farther away. Thus, since $A (X, d_{small})$ had each point in its own cluster, so does $A (X, d_{?})$ , which implies that $A (X, d_{small}) = A (X, d_{?})$ .

Finally, let’s take a (brief) look at two families of clustering algorithms.

I

Suppose we have some way of measuring the distance between a point and a cluster. Then, we can do the following:

Begin with the trivial clustering, i.e., $C^{(0)} := C_{trivial}$
Merge the cluster/point pair with minimal distance
- (Note that $| C^{(t + 1)} | = | C^{(t)} | - 1$ , i.e., we have one fewer cluster.)
Repeat the previous step until all points are in a single cluster
Output the entire history; at every time step, we have one possible clustering

Alternatively, we can stop based on some criterion (max distance or max number of clusters).

This is called agglomerative clustering. See here for a decent video explanation.

Note that every definition for the distance between a point $p$ and a cluster $C$ yields its own variant of the algorithm. Possible choices are

$min {d (p, q) | q \in C}$
$max {d (p, q) | q \in C}$
$\frac{1}{| C |} \sum_{q \in C} d (p, q)$

II

A different approach is called the $k$ -means algorithm, which works as follows:

Choose $k$ points $c_{1}, . . ., c_{k}$ at random called the centers
Let the corresponding clusters be $C_{1}, . . ., C_{k}$
Assign each point to the cluster whose center it is closest to. I.e., assign $p \in X$ to $C_{j}$ where $j \in arg {min}_{i \in {1, . . ., k}} [d (p, c_{i})]$ .
Move each center to the center of mass of its cluster. I.e., set $c_{j} := \frac{1}{| C_{j} |} \sum_{p \in C_{j}} c_{j}$ . (If we are not in a vector space and cannot do this, instead choose $c_{j} \in {argmin}_{x \in X} \sum_{p \in C_{j}} d (x, c_{j})$ .)
Repeat until the assignment doesn’t move any point to a different cluster

The result depends heavily on the initial randomizing step. To remedy this, one can run the algorithm many times, and then choose the best cluster, where “best” can be measured in a couple of ways, such as by comparing the sums $\sum_{j = 1}^{k} (\sum_{p \in C_{j}} d (c_{j}, p)^{2})$ and choosing the clustering with the smallest sum.

See here for a great video explanation (I recommend 2 $\times$ speed).