(This is the thirteenth post in a sequence on Machine Learning based on this book. Click here for part I.)
In the previous post, I’ve mentioned the distinction between supervised and unsupervised learning. Here is a more comprehensive picture:
Supervised learning: learning a fixed predictor based on a fixed training sequence generated by a fixed distribution
Unsupervised learning: learning without training data
Online learning: updating a predictor over time based on data that arrives over time, generated by a process that might change and/or depend on past predictions
In particular, the performance of a predictor in supervised learning is measured after all learning has taken place, whereas, in online learning, the accuracy of our predictor matters at every step along the way.
As an example: while it is possible to model spam detection as a supervised learning problem, it is more accurate to model it as online learning, since (a), we may wish to update our spam filter continually, and (b), people composing spam emails may change their behavior based on existing spam filters. Online learning is also the model that comes closest to describing how real humans learn (especially children).
In this post, we’ll look at Online Learning as a whole and at clustering, which is a particular problem of unsupervised learning.
Online Learning
For this post, we restrict ourselves to binary classification problems, i.e., Y={0,1}.
Recall that, in supervised learning, we assume there is a fixed probability distribution D over X×Y according to which the environment generates labeled points. Here, each x∈X is a representation of the real-world thing we wish to label rather than the thing itself. For example, if the task is spam detection of emails, then x would be a vector of some of the email’s features (length, address of the sender, number of times the word “money” appears, number of images, …) rather than an encoding that includes the entire body of text. This means that two different emails may have the same feature representation, and, consequently, the same domain point x∈X might sometimes have label 1 and sometimes label 0.
In some cases, we may wish to assume that labels are deterministic regardless, either because it’s a realistic assumption for our particular problem, or for reasons of simplicity. If we do, we can alternatively model the environment as having a probability distribution D over just X, as well as a true labeling functionhenvironment:X→Y. The environment then generates labels by sampling x∈X, according to D, and outputting (x,henvironment(x)).
These last two paragraphs are pure repetition that I’ve included to precisely highlight the differences between supervised learning and online learning.
In the most general formulation of online learning, we do not assume that the labeled domain points are generated by a probability distribution D. This implies two differences to supervised learning:
(1): the process by which labels are generated (be it henvironment or the conditional probability distribution D(y|x)) is allowed to change over time; and
(2): the process by which domain points are generated is not fixed, and might even depend on our predictions for previous labels
Furthermore, I’ve stated in the introduction that
(3) accuracy of a learning algorithm is measured based on its performance throughout the process
More precisely, we model the learning process to proceed in rounds. In every round, first, the learner A is presented with some instance x∈X; then, it predicts a label (either 0 or 1); finally, it is given the true label for x. If the prediction was wrong, we say that Amade a mistake, and the goal is to design algorithms that minimize the number of mistakes. Thus, we don’t have a training phase followed by a prediction phase as in supervised learning, but rather a single phase of both training and prediction. We still model this in terms of a sequence S=((x1,y1),...,(xm,ym)), but we will call it a data sequence (term I made up)rather than a training sequence, and we measure how well an algorithm predicts each yi based on the (xj,yj) with j<i.
It’s not difficult to see that, in this general setting, establishing learning guarantees is impossible. For example, if there is no correlation between the current label and past labels, an algorithm cannot do better than to guess. Worse, if the environment always chooses the label which A didn’t guess, A will inevitably make a mistake every round.
This means we have to make further assumptions to simplify the problem. One such assumption is that labels are consistent with (but not generated by) a predictor in our hypothesis class. That is, after all rounds are complete, there needs to be a predictor h:X→Y out of the class H such that all labels reported by the environment are consistent with h. However, unlike with supervised learning, we do not require that the environment “chooses” this function ahead of time. Instead, think of the environment as a bullshitter – it may throw you the least useful domain point at every time, and then declare your prediction wrong (whatever it is). As long as, in the end, it can show you a predictor in H that would have behaved in precisely this way, this is legal behavior.
It is not obvious, at least to me, that this is a particularly useful way to model things – which real-life problem behaves in this way? – but it’s what the book does, and I’m too short on time to do independent research, so we’ll just roll with it.
To recap the setting in full:
The learner A has access to the set X of domain points, the label set Y={0,1}, and a hypothesis class H⊆{f:X→Y}. Given any predictor he∈H and any sequence of points x1,...,xm∈X, the data sequence S=((x1,he(x1)),...,(xm,he(xm)) determines a process of m rounds. At round j, the learner A will be given a domain point xj and be asked to predict a label yj∈{0,1}. For every j∈[m] such that yj≠he(xj), A has made a mistake.
And our goal is to design A such that it will make as few mistakes as possible. Formally, A is just an algorithm, so we have a lot of freedom in this construction.
Note that we will measure the performance of A in terms of the highest number of mistakes across all possible data sequences. Thus, even though the formalization above makes it sound like the xj are fixed ahead of time, it is more accurate to think of them as being chosen by a malicious environment.
In the context of supervised learning, we’ve written A(S) to refer to the predictor A produces after having trained on the training sequence S. In online learning, this definition might still make sense (any reasonable algorithm will have its choices consistent with some predictor at every step); however, it’s not useful, because the performance of the final predictor isn’t what interests us. Instead, given a learner A and a data sequence S, we define MA(S) as the number of mistakes which A made throughout while predicting the labels of S, one by one. Note that MA(S)∈{0,...,|S|}.
Given a hypothesis class H, we then define MA(H):=sup{MA(S)|S∈S}, where S is the set of all data sequences that the environment is allowed to come up with (i.e., all data sequences that are consistent with some predictor in H). Note that S is allowed to contain sequences of arbitrary length. Therefore, depending on what A does, it might be that the set {MA(S)|S∈S} is unbounded, in which case MA(H)=∞. With that, we can define:
A hypothesis class H is learnable iff ∃ an algorithm A such that MA(H)≠∞.
This is a good time to think about how to construct A. Assume that the hypothesis class H is finite. How should A behave to make sure the environment can only screw it over for so long?
Here’s one possible algorithm, which we call Ahalving. This learner keeps track of a set V(t)⊆H of possible hypotheses at every time step t. When presented with a domain point xt∈X, it divides the set in two, depending on which label they give x. I.e.:
V(t)−:={h∈V(t):h(xt)=0}V(t)+:={h∈V(t):h(xt)=1}
We have V(t)=V(t)−⊔V(t)+ (disjoint union), which means that at least one of them contains half or more of the predictors in V(t). Our learner Ahalving chooses the label yt according to that set. I.e., if V(t)+≥12V(t) and half or more of the predictors in V(t) say that x has label 1, Ahalving outputs label 1.
If the environment declares the prediction wrong, Ahalving halves the remaining hypotheses by setting V(t+1):=V(t)−. That way, it can make at most ld(|H|) mistakes total.
Informally speaking, think of the hypothesis class H as the bullshitter’s ammunition. The only way we avoid making mistakes is to reduce this ammunition as fast as possible.Ahalving reduces it by at least 50% at each step.
Note that any A only gets to make a binary decision at each step. The decomposition V(t)=V(t)−⊔V(t)+ is a natural one, i.e., not specific to Ahalving. After the true label has been announced, the set of remaining candidates will always be one of the two – and each algorithm gets to decide which one.
L-Dimension
Unlike what one might naively suspect, the halving algorithm above is not the optimal way to minimize the number of mistakes. This is because it chooses the next subset (i.e., V(t)− or V(t)+) based on the number of hypotheses in them. However, number-of-hypotheses is a flawed measure for complexity, as you may recall from chapters I-III. In the context of supervised learning, a much better measure is the VC-dimension.
(If you’re not familiar with the VC-dimension is, see post II. The one-sentence definition is that the VC-dimension of a hypothesis class H is the largest integer ksuch that there exists a set of k domain points in X that is shattered by H – where a set P⊆X is shattered by H iff, for every possible combination of labels of points in P, there exists a predictor h∈H assigning them these labels.)
Our goal now is to work out the analogous concept for this particular brand of online learning. This will be called the L-dimension, also named after a person (in this case, “Littlestone”). The L-dimension of a hypothesis class (written L-dim(H)) will be the largest integer k such that, for every learner A, there exists a data sequence S such that MA(S)=k. This means that MA(H)≥L-dim(H) for every algorithm A.
The L-dimension is lower-bounded by the VC-dimension. Informally, given a set P of k many points that is shattered by H, the environment can simply present A with all those points in any order and declare that all of A’s predictions were wrong. Since every combination of labels of points in P is realized by some predictor in H, this will be legit regardless of what A does. (Formally, one would have to construct a [data sequence based on the shattered set] on which A fails.)
The reverse is not true – a set of points that is not shattered could still be presented in an inconvenient order such that every learner fails.
Let’s take a close look at how both concepts are different. We can characterize the VC-dimension like so:
The VC-dimension of H is at least k⟺∃ a set of k domain points that is shattered by H
By vacuously quantifying over possible orders and labels, we can state the same in a more complicated way thus:
The VC-dimension of H is at least k⟺∃ a set of k domain points such that ∀ order on the set ∀ combination of labels no points' label follows from the previous ones
Similarly (although less cleanly since the formalism is trickier), we can characterize the L-dimension like so:
The L-dimension of H is at least k⟺∃f:(X×Y)∗→X such that for k steps∀ combination of labels no points' label follows from the previous ones
where f is the environment’s function that chooses the next domain point each round.
This highlights how exactly both concepts differ. The VC-dimension is about a set of points for which knowing the labels of any subset of them does not imply the label of the remaining points. The L-dimension is about the ability to pull out new points on the spot (depending on the labels of the previous ones) such that the labels of previous points don’t imply the labels of future points.
Sometimes, both concepts coincide. Consider the hypothesis class of all predictors that assign label 1 to three domain points and label 0 to all others, i.e.,
H3-positive:={f:R→Y||{x∈R:f(x)=1}|=3}
If A starts by repeatedly guessing 0, it can make at most three mistakes – it’s not possible to present domain points in an order such that A doesn’t obtain the relevant information. This is because A‘s knowledge is evenly distributed across all domain points it hasn’t seen yet; first, it doesn’t know the label of any, then it suddenly learns the label of them all. In such cases, nothing is “gained” (or “lost, from A’s perspective) from the ability to choose points dynamically.
On the other hand, take the class of threshold predictors Hthreshold:={θx|x∈R} where θx(y):=1⟺y≥x, and consider what happens if A learns that the label of the domain point is 1. In that case, it knows the threshold is to the left of x, which means that all points to the right of x also have label 1...
… but it doesn’t know the labels of the points to the left of x. In this case, A’s knowledge about domain points is unevenly distributed. In such a case, the ability to choose new points dynamically matters a lot – if we were trying to construct a shattered set, we would already have hit a wall.
Now, suppose A guesses 1 on a point to the left next, which will then have label 0. Now the picture looks like so:
The area that A can classify safely has increased, but there is still the middle area that is up in the air. This allows the environment to present A with another point it doesn’t know yet – say the point right in the middle. Then, if A guesses label 1, the environment will declare that the label is 0. In that case, A knows that all points to the left of x are labeled 0, and the area it doesn’t know has been cut in half. Conversely, if A guesses label 0, the environment will declare that the label is 1. In that case, A knows that all points to the right of x are labeled 1, and the area it doesn’t know has been cut in half.
In both cases, the [area A cannot safely classify] is cut in half. Since the area is a segment of the real line, it can be cut in half arbitrarily often. This means that, even though VC-dim(Hthreshold)=1, we have L-dim(Hthreshold)=∞.
With this bit of theory established, we can design another algorithm Alittle-L. Whereas Ahalving, confronted with the choice between V(t)+ and V(t)−, makes its decision based on which set has fewer predictors in it, Alittle-L makes its decision based on which class has lower L-dimension.
If L-dim(V(t)−)=L-dim(V(t)+)=k, then L-dim(V(t)) is at least k+1. This is so because otherwise, the environment can let A make a mistake at round t, followed by at least k more in subsequent rounds (recall what the L-dimension represents). Therefore, whenever Alittle-L plays, the L-dimension of V(t) decreases by at least 1 every step. This implies that MA(H)≤L-dim(H). Since MA(H)≥L-dim(H) also holds, this implies that Alittle-L is the optimal algorithm in this model.
Note that, if H has 2k elements, then MAhalving(H)≤k. Furthermore, L-dim(H)≤k and thus MAlittle-L≤k. If the L-dimension is exactly k (this happens if H is simply the set of all possible predictors on a domain set with k elements), then both learners have identical mistake bounds. However, if L-dim(H)=d<k, then Alittle-L is guaranteed to make at most d mistakes, while Ahalving might make up to k.
Clustering
Clustering is about dividing a set of points into disjoint subsets, called clusters,which are meant to group similar elements together. For example, consider the following instance:
Here is a possible clustering:
Here’s another:
According to the book, there is no “ground truth” when it comes to clustering, in part because there could be multiple meaningful ways to cluster any given data set. I would dispute that claim – given a finite set of points, there trivially has to be an optimum clustering for any precise criterion. However, there is usually no way to evaluate how well this criterion has been met. Suppose you are given some data set, and run a clustering algorithm just to understand it a little better, without even knowing what you are looking for. There will be some clustering that provides you with an optimal amount of insight (otherwise, there wouldn’t be any point in having nontrivial algorithms), but it is impossible to tell whether any given clustering is optimal. Furthermore, upon seeing one clustering, you will learn some things about the data, which changes the metric of which clustering is most informative next.
Changing metrics aside, learning clustering from training data seems possible in principle. One could input several training sets and their respective optimal clusterings according to human judgment. However, this would require a significantly more complex formalism than what we utilized for supervised learning. For example, the quality of the clustering could no longer be evaluated locally – the same point might “belong” into a different cluster if the training set is extended.
The practical consequence is that one doesn’t have a “learner” in the same sense (at least, the book doesn’t discuss any such approaches). Instead, all one can do is to define various algorithms that seem to make intuitive sense. This makes our job simpler; all the clustering algorithms we’ll look at are quite easy to understand. Consequently, and because there are pretty good resources out there, I’ll mostly skip on describing the algorithms in detail and link to external resources instead.
Formalizing the setting
We assume our input is a finitemetric space. That is, a pair (X,d) where X is any finite set and d:X×X→R a metric on the set (it measures distances between points). Saying that d is a metric is equivalent to saying that it has the following four properties:
d(x,x)=0∀x∈X (#1)
d(x,y)=d(y,x)∀x,y∈X (symmetry)
d(x,y)>0∀x,y∈X s.t. x≠y (#3)
d(x,z)≤d(x,y)+d(y,z) (triangle inequality)
Something we do not require is the ability to draw a line between points, compute the midpoint of that line, or anything like that. Clustering algorithms work just based on distances. Given (X,d), the output of a clustering algorithm is a set C1,...,Ck such that X=C1⊔⋯⊔Ck. Some clustering algorithms also take k as an input parameter, i.e., they’re being told how many clusters they ought to put out.
Impossibility Results
Here’s something interesting, before we look at/link to explanations of the actual algorithms. You might have heard the claim that “there is no optimal voting system.” This is based on something called Arrow’s impossibiltiy theorem, and I’m going to explain exactly what it means (this will be relevant for clustering in just a bit). Consider a situation where there are a bunch of candidates and a bunch of parties doing ranked-choice voting. Each party’s vote is an ordered list of how much they like all candidates. Now some voting system evaluates these votes and outputs an ordered list of candidates as the result.
Consider the following properties such a system might have:
Non-dictatorship: there is no one party such that the system always outputs exactly the order this party submitted
Independence of irrelevant alternatives: any party changing their order of C and D cannot change the system’s order of A and B. Only changes that are about either A or B can do that.
Pareto efficiency: if all votes prefer A over B, so must the system’s output
Everyone to whom all three properties seem highly desirable has a problem because Arrow’s Theorem states that no system can meet all three properties unless there are only two candidates. (If there are only two candidates, a majority vote with a deterministic way to break ties satisfies all three properties). The proof consists of taking a system that has properties #2 and #3 and showing that it is dictatorial (i.e., there is a party such that the system always outputs that party’s submission). If one considers #1 and #3 to be essential (and #2 somewhat less so), Arrow’s theorem can be summarized as “in every reasonable voting system with more than two candidates, strategic voting must be a thing.” These kinds of results are sometimes called impossibility theorems: we list a bunch of properties that all seem to make sense and then prove that they’re incompatible.
For clustering algorithms, we have something similar, except that it’s much less impressive because the properties aren’t as obviously desirable. Nonetheless, it’s worth bringing up. The properties are
Scale-invariance: multiplying all positions with a constant factor doesn’t change the clustering
Consistency: moving a point closer towards all points in a cluster cannot cause it to be dropped from that cluster, and moving it farther away from all points in a cluster cannot cause it to become part of that cluster
Richness: any clustering is possible (for some distance function)
If the distance function measures euclidean distances in 2-dimensional space, then the last property says that, for any possible clustering, you can arrange your points such that the algorithm will return that clustering.
The impossibility theorem states that, while any two of these properties are achievable, no algorithm can achieve all three.
To me, consistency feels like an obvious requirement, but the other two less so. If one shares this view, the impossibility theorem can be summarized as “any reasonable clustering algorithm either depends on scale or is restricted to only a subset of possible clusters.”
In this case, the proof is simple. Suppose X={x1,...,xn} where n>1. We assume a clustering algorithm A that meets all three properties above and derive a contradiction.
One possible clustering is the trivial clustering, Ctrivial:={{x1},...,{xn}}. Due to the richness property, there must be some metric d∗ such that A(X,d∗)=Ctrivial. Furthermore, let C? be an arbitrary clustering other than Ctrivial. Due to richness, there has to, again, be some distance function d? such that A(X,d?)=C?.
Now we show that A(X,d∗)=A(X,d?), which will be our contradiction. We do this by transforming d∗ into d? in such a way that (due to the three properties of A) the cluster cannot change.
First, we change d∗ to dsmall by reducing all distances by the same factor such that the largest distance in dsmall becomes the smallest distance in d∗. (Then, any distance in dsmall is at most as large as any distance in d?.) Formally, we set dsmall(x,y):=αd∗(x,y)∀x,y∈X where
α:=min{d?(x,y)|x,y∈X}max{d∗(x,y)|x,y∈X s.t. x≠y}
due to scale invariance, we have A(X,d∗)=A(X,dsmall). Well, and now we increase every distance from dsmall until it is equal to that of d?. Due to consistency, there cannot be any cluster in A(X,d?) that consists of more points than before because every point either remained at the same distance to that cluster or moved farther away. Thus, since A(X,dsmall) had each point in its own cluster, so does A(X,d?), which implies that A(X,dsmall)=A(X,d?).
Finally, let’s take a (brief) look at two families of clustering algorithms.
I
Suppose we have some way of measuring the distance between a point and a cluster. Then, we can do the following:
Begin with the trivial clustering, i.e., C(0):=Ctrivial
Merge the cluster/point pair with minimal distance
(Note that |C(t+1)|=|C(t)|−1, i.e., we have one fewer cluster.)
Repeat the previous step until all points are in a single cluster
Output the entire history; at every time step, we have one possible clustering
Alternatively, we can stop based on some criterion (max distance or max number of clusters).
This is called agglomerative clustering. See here for a decent video explanation.
Note that every definition for the distance between a point p and a cluster C yields its own variant of the algorithm. Possible choices are
min{d(p,q)|q∈C}
max{d(p,q)|q∈C}
1|C|∑q∈Cd(p,q)
II
A different approach is called the k-means algorithm, which works as follows:
Choose k points c1,...,ck at random called the centers
Let the corresponding clusters be C1,...,Ck
Assign each point to the cluster whose center it is closest to. I.e., assign p∈X to Cj where j∈argmini∈{1,...,k}[d(p,ci)].
Move each center to the center of mass of its cluster. I.e., set cj:=1|Cj|∑p∈Cjcj. (If we are not in a vector space and cannot do this, instead choose cj∈argminx∈X∑p∈Cjd(x,cj).)
Repeat until the assignment doesn’t move any point to a different cluster
The result depends heavily on the initial randomizing step. To remedy this, one can run the algorithm many times, and then choose the best cluster, where “best” can be measured in a couple of ways, such as by comparing the sums ∑kj=1(∑p∈Cjd(cj,p)2) and choosing the clustering with the smallest sum.
See here for a great video explanation (I recommend 2× speed).
UML XIII: Online Learning and Clustering
(This is the thirteenth post in a sequence on Machine Learning based on this book. Click here for part I.)
In the previous post, I’ve mentioned the distinction between supervised and unsupervised learning. Here is a more comprehensive picture:
Supervised learning: learning a fixed predictor based on a fixed training sequence generated by a fixed distribution
Unsupervised learning: learning without training data
Online learning: updating a predictor over time based on data that arrives over time, generated by a process that might change and/or depend on past predictions
In particular, the performance of a predictor in supervised learning is measured after all learning has taken place, whereas, in online learning, the accuracy of our predictor matters at every step along the way.
As an example: while it is possible to model spam detection as a supervised learning problem, it is more accurate to model it as online learning, since (a), we may wish to update our spam filter continually, and (b), people composing spam emails may change their behavior based on existing spam filters. Online learning is also the model that comes closest to describing how real humans learn (especially children).
In this post, we’ll look at Online Learning as a whole and at clustering, which is a particular problem of unsupervised learning.
Online Learning
For this post, we restrict ourselves to binary classification problems, i.e., Y={0,1}.
Recall that, in supervised learning, we assume there is a fixed probability distribution D over X×Y according to which the environment generates labeled points. Here, each x∈X is a representation of the real-world thing we wish to label rather than the thing itself. For example, if the task is spam detection of emails, then x would be a vector of some of the email’s features (length, address of the sender, number of times the word “money” appears, number of images, …) rather than an encoding that includes the entire body of text. This means that two different emails may have the same feature representation, and, consequently, the same domain point x∈X might sometimes have label 1 and sometimes label 0.
In some cases, we may wish to assume that labels are deterministic regardless, either because it’s a realistic assumption for our particular problem, or for reasons of simplicity. If we do, we can alternatively model the environment as having a probability distribution D over just X, as well as a true labeling function henvironment:X→Y. The environment then generates labels by sampling x∈X, according to D, and outputting (x,henvironment(x)).
These last two paragraphs are pure repetition that I’ve included to precisely highlight the differences between supervised learning and online learning.
In the most general formulation of online learning, we do not assume that the labeled domain points are generated by a probability distribution D. This implies two differences to supervised learning:
(1): the process by which labels are generated (be it henvironment or the conditional probability distribution D(y|x)) is allowed to change over time; and
(2): the process by which domain points are generated is not fixed, and might even depend on our predictions for previous labels
Furthermore, I’ve stated in the introduction that
(3) accuracy of a learning algorithm is measured based on its performance throughout the process
More precisely, we model the learning process to proceed in rounds. In every round, first, the learner A is presented with some instance x∈X; then, it predicts a label (either 0 or 1); finally, it is given the true label for x. If the prediction was wrong, we say that A made a mistake, and the goal is to design algorithms that minimize the number of mistakes. Thus, we don’t have a training phase followed by a prediction phase as in supervised learning, but rather a single phase of both training and prediction. We still model this in terms of a sequence S=((x1,y1),...,(xm,ym)), but we will call it a data sequence (term I made up) rather than a training sequence, and we measure how well an algorithm predicts each yi based on the (xj,yj) with j<i.
It’s not difficult to see that, in this general setting, establishing learning guarantees is impossible. For example, if there is no correlation between the current label and past labels, an algorithm cannot do better than to guess. Worse, if the environment always chooses the label which A didn’t guess, A will inevitably make a mistake every round.
This means we have to make further assumptions to simplify the problem. One such assumption is that labels are consistent with (but not generated by) a predictor in our hypothesis class. That is, after all rounds are complete, there needs to be a predictor h:X→Y out of the class H such that all labels reported by the environment are consistent with h. However, unlike with supervised learning, we do not require that the environment “chooses” this function ahead of time. Instead, think of the environment as a bullshitter – it may throw you the least useful domain point at every time, and then declare your prediction wrong (whatever it is). As long as, in the end, it can show you a predictor in H that would have behaved in precisely this way, this is legal behavior.
It is not obvious, at least to me, that this is a particularly useful way to model things – which real-life problem behaves in this way? – but it’s what the book does, and I’m too short on time to do independent research, so we’ll just roll with it.
To recap the setting in full:
And our goal is to design A such that it will make as few mistakes as possible. Formally, A is just an algorithm, so we have a lot of freedom in this construction.
Note that we will measure the performance of A in terms of the highest number of mistakes across all possible data sequences. Thus, even though the formalization above makes it sound like the xj are fixed ahead of time, it is more accurate to think of them as being chosen by a malicious environment.
In the context of supervised learning, we’ve written A(S) to refer to the predictor A produces after having trained on the training sequence S. In online learning, this definition might still make sense (any reasonable algorithm will have its choices consistent with some predictor at every step); however, it’s not useful, because the performance of the final predictor isn’t what interests us. Instead, given a learner A and a data sequence S, we define MA(S) as the number of mistakes which A made throughout while predicting the labels of S, one by one. Note that MA(S)∈{0,...,|S|}.
Given a hypothesis class H, we then define MA(H):=sup{MA(S)|S∈S}, where S is the set of all data sequences that the environment is allowed to come up with (i.e., all data sequences that are consistent with some predictor in H). Note that S is allowed to contain sequences of arbitrary length. Therefore, depending on what A does, it might be that the set {MA(S)|S∈S} is unbounded, in which case MA(H)=∞. With that, we can define:
This is a good time to think about how to construct A. Assume that the hypothesis class H is finite. How should A behave to make sure the environment can only screw it over for so long?
Here’s one possible algorithm, which we call Ahalving. This learner keeps track of a set V(t)⊆H of possible hypotheses at every time step t. When presented with a domain point xt∈X, it divides the set in two, depending on which label they give x. I.e.:
V(t)−:={h∈V(t):h(xt)=0}V(t)+:={h∈V(t):h(xt)=1}
We have V(t)=V(t)−⊔V(t)+ (disjoint union), which means that at least one of them contains half or more of the predictors in V(t). Our learner Ahalving chooses the label yt according to that set. I.e., if V(t)+≥12V(t) and half or more of the predictors in V(t) say that x has label 1, Ahalving outputs label 1.
If the environment declares the prediction wrong, Ahalving halves the remaining hypotheses by setting V(t+1):=V(t)−. That way, it can make at most ld(|H|) mistakes total.
Informally speaking, think of the hypothesis class H as the bullshitter’s ammunition. The only way we avoid making mistakes is to reduce this ammunition as fast as possible.Ahalving reduces it by at least 50% at each step.
Note that any A only gets to make a binary decision at each step. The decomposition V(t)=V(t)−⊔V(t)+ is a natural one, i.e., not specific to Ahalving. After the true label has been announced, the set of remaining candidates will always be one of the two – and each algorithm gets to decide which one.
L-Dimension
Unlike what one might naively suspect, the halving algorithm above is not the optimal way to minimize the number of mistakes. This is because it chooses the next subset (i.e., V(t)− or V(t)+) based on the number of hypotheses in them. However, number-of-hypotheses is a flawed measure for complexity, as you may recall from chapters I-III. In the context of supervised learning, a much better measure is the VC-dimension.
(If you’re not familiar with the VC-dimension is, see post II. The one-sentence definition is that the VC-dimension of a hypothesis class H is the largest integer ksuch that there exists a set of k domain points in X that is shattered by H – where a set P⊆X is shattered by H iff, for every possible combination of labels of points in P, there exists a predictor h∈H assigning them these labels.)
Our goal now is to work out the analogous concept for this particular brand of online learning. This will be called the L-dimension, also named after a person (in this case, “Littlestone”). The L-dimension of a hypothesis class (written L-dim(H)) will be the largest integer k such that, for every learner A, there exists a data sequence S such that MA(S)=k. This means that MA(H)≥L-dim(H) for every algorithm A.
The L-dimension is lower-bounded by the VC-dimension. Informally, given a set P of k many points that is shattered by H, the environment can simply present A with all those points in any order and declare that all of A’s predictions were wrong. Since every combination of labels of points in P is realized by some predictor in H, this will be legit regardless of what A does. (Formally, one would have to construct a [data sequence based on the shattered set] on which A fails.)
The reverse is not true – a set of points that is not shattered could still be presented in an inconvenient order such that every learner fails.
Let’s take a close look at how both concepts are different. We can characterize the VC-dimension like so:
By vacuously quantifying over possible orders and labels, we can state the same in a more complicated way thus:
Similarly (although less cleanly since the formalism is trickier), we can characterize the L-dimension like so:
where f is the environment’s function that chooses the next domain point each round.
This highlights how exactly both concepts differ. The VC-dimension is about a set of points for which knowing the labels of any subset of them does not imply the label of the remaining points. The L-dimension is about the ability to pull out new points on the spot (depending on the labels of the previous ones) such that the labels of previous points don’t imply the labels of future points.
Sometimes, both concepts coincide. Consider the hypothesis class of all predictors that assign label 1 to three domain points and label 0 to all others, i.e.,
H3-positive:={f:R→Y||{x∈R:f(x)=1}|=3}
If A starts by repeatedly guessing 0, it can make at most three mistakes – it’s not possible to present domain points in an order such that A doesn’t obtain the relevant information. This is because A‘s knowledge is evenly distributed across all domain points it hasn’t seen yet; first, it doesn’t know the label of any, then it suddenly learns the label of them all. In such cases, nothing is “gained” (or “lost, from A’s perspective) from the ability to choose points dynamically.
On the other hand, take the class of threshold predictors Hthreshold:={θx|x∈R} where θx(y):=1⟺y≥x, and consider what happens if A learns that the label of the domain point is 1. In that case, it knows the threshold is to the left of x, which means that all points to the right of x also have label 1...
… but it doesn’t know the labels of the points to the left of x. In this case, A’s knowledge about domain points is unevenly distributed. In such a case, the ability to choose new points dynamically matters a lot – if we were trying to construct a shattered set, we would already have hit a wall.
Now, suppose A guesses 1 on a point to the left next, which will then have label 0. Now the picture looks like so:
The area that A can classify safely has increased, but there is still the middle area that is up in the air. This allows the environment to present A with another point it doesn’t know yet – say the point right in the middle. Then, if A guesses label 1, the environment will declare that the label is 0. In that case, A knows that all points to the left of x are labeled 0, and the area it doesn’t know has been cut in half. Conversely, if A guesses label 0, the environment will declare that the label is 1. In that case, A knows that all points to the right of x are labeled 1, and the area it doesn’t know has been cut in half.
In both cases, the [area A cannot safely classify] is cut in half. Since the area is a segment of the real line, it can be cut in half arbitrarily often. This means that, even though VC-dim(Hthreshold)=1, we have L-dim(Hthreshold)=∞.
With this bit of theory established, we can design another algorithm Alittle-L. Whereas Ahalving, confronted with the choice between V(t)+ and V(t)−, makes its decision based on which set has fewer predictors in it, Alittle-L makes its decision based on which class has lower L-dimension.
If L-dim(V(t)−)=L-dim(V(t)+)=k, then L-dim(V(t)) is at least k+1. This is so because otherwise, the environment can let A make a mistake at round t, followed by at least k more in subsequent rounds (recall what the L-dimension represents). Therefore, whenever Alittle-L plays, the L-dimension of V(t) decreases by at least 1 every step. This implies that MA(H)≤L-dim(H). Since MA(H)≥L-dim(H) also holds, this implies that Alittle-L is the optimal algorithm in this model.
Note that, if H has 2k elements, then MAhalving(H)≤k. Furthermore, L-dim(H)≤k and thus MAlittle-L≤k. If the L-dimension is exactly k (this happens if H is simply the set of all possible predictors on a domain set with k elements), then both learners have identical mistake bounds. However, if L-dim(H)=d<k, then Alittle-L is guaranteed to make at most d mistakes, while Ahalving might make up to k.
Clustering
Clustering is about dividing a set of points into disjoint subsets, called clusters, which are meant to group similar elements together. For example, consider the following instance:
Here is a possible clustering:
Here’s another:
According to the book, there is no “ground truth” when it comes to clustering, in part because there could be multiple meaningful ways to cluster any given data set. I would dispute that claim – given a finite set of points, there trivially has to be an optimum clustering for any precise criterion. However, there is usually no way to evaluate how well this criterion has been met. Suppose you are given some data set, and run a clustering algorithm just to understand it a little better, without even knowing what you are looking for. There will be some clustering that provides you with an optimal amount of insight (otherwise, there wouldn’t be any point in having nontrivial algorithms), but it is impossible to tell whether any given clustering is optimal. Furthermore, upon seeing one clustering, you will learn some things about the data, which changes the metric of which clustering is most informative next.
Changing metrics aside, learning clustering from training data seems possible in principle. One could input several training sets and their respective optimal clusterings according to human judgment. However, this would require a significantly more complex formalism than what we utilized for supervised learning. For example, the quality of the clustering could no longer be evaluated locally – the same point might “belong” into a different cluster if the training set is extended.
The practical consequence is that one doesn’t have a “learner” in the same sense (at least, the book doesn’t discuss any such approaches). Instead, all one can do is to define various algorithms that seem to make intuitive sense. This makes our job simpler; all the clustering algorithms we’ll look at are quite easy to understand. Consequently, and because there are pretty good resources out there, I’ll mostly skip on describing the algorithms in detail and link to external resources instead.
Formalizing the setting
We assume our input is a finite metric space. That is, a pair (X,d) where X is any finite set and d:X×X→R a metric on the set (it measures distances between points). Saying that d is a metric is equivalent to saying that it has the following four properties:
Something we do not require is the ability to draw a line between points, compute the midpoint of that line, or anything like that. Clustering algorithms work just based on distances. Given (X,d), the output of a clustering algorithm is a set C1,...,Ck such that X=C1⊔⋯⊔Ck. Some clustering algorithms also take k as an input parameter, i.e., they’re being told how many clusters they ought to put out.
Impossibility Results
Here’s something interesting, before we look at/link to explanations of the actual algorithms. You might have heard the claim that “there is no optimal voting system.” This is based on something called Arrow’s impossibiltiy theorem, and I’m going to explain exactly what it means (this will be relevant for clustering in just a bit). Consider a situation where there are a bunch of candidates and a bunch of parties doing ranked-choice voting. Each party’s vote is an ordered list of how much they like all candidates. Now some voting system evaluates these votes and outputs an ordered list of candidates as the result.
Consider the following properties such a system might have:
Non-dictatorship: there is no one party such that the system always outputs exactly the order this party submitted
Independence of irrelevant alternatives: any party changing their order of C and D cannot change the system’s order of A and B. Only changes that are about either A or B can do that.
Pareto efficiency: if all votes prefer A over B, so must the system’s output
Everyone to whom all three properties seem highly desirable has a problem because Arrow’s Theorem states that no system can meet all three properties unless there are only two candidates. (If there are only two candidates, a majority vote with a deterministic way to break ties satisfies all three properties). The proof consists of taking a system that has properties #2 and #3 and showing that it is dictatorial (i.e., there is a party such that the system always outputs that party’s submission). If one considers #1 and #3 to be essential (and #2 somewhat less so), Arrow’s theorem can be summarized as “in every reasonable voting system with more than two candidates, strategic voting must be a thing.” These kinds of results are sometimes called impossibility theorems: we list a bunch of properties that all seem to make sense and then prove that they’re incompatible.
For clustering algorithms, we have something similar, except that it’s much less impressive because the properties aren’t as obviously desirable. Nonetheless, it’s worth bringing up. The properties are
Scale-invariance: multiplying all positions with a constant factor doesn’t change the clustering
Consistency: moving a point closer towards all points in a cluster cannot cause it to be dropped from that cluster, and moving it farther away from all points in a cluster cannot cause it to become part of that cluster
Richness: any clustering is possible (for some distance function)
If the distance function measures euclidean distances in 2-dimensional space, then the last property says that, for any possible clustering, you can arrange your points such that the algorithm will return that clustering.
The impossibility theorem states that, while any two of these properties are achievable, no algorithm can achieve all three.
To me, consistency feels like an obvious requirement, but the other two less so. If one shares this view, the impossibility theorem can be summarized as “any reasonable clustering algorithm either depends on scale or is restricted to only a subset of possible clusters.”
In this case, the proof is simple. Suppose X={x1,...,xn} where n>1. We assume a clustering algorithm A that meets all three properties above and derive a contradiction.
One possible clustering is the trivial clustering, Ctrivial:={{x1},...,{xn}}. Due to the richness property, there must be some metric d∗ such that A(X,d∗)=Ctrivial. Furthermore, let C? be an arbitrary clustering other than Ctrivial. Due to richness, there has to, again, be some distance function d? such that A(X,d?)=C?.
Now we show that A(X,d∗)=A(X,d?), which will be our contradiction. We do this by transforming d∗ into d? in such a way that (due to the three properties of A) the cluster cannot change.
First, we change d∗ to dsmall by reducing all distances by the same factor such that the largest distance in dsmall becomes the smallest distance in d∗. (Then, any distance in dsmall is at most as large as any distance in d?.) Formally, we set dsmall(x,y):=αd∗(x,y)∀x,y∈X where
α:=min{d?(x,y)|x,y∈X}max{d∗(x,y)|x,y∈X s.t. x≠y}
due to scale invariance, we have A(X,d∗)=A(X,dsmall). Well, and now we increase every distance from dsmall until it is equal to that of d?. Due to consistency, there cannot be any cluster in A(X,d?) that consists of more points than before because every point either remained at the same distance to that cluster or moved farther away. Thus, since A(X,dsmall) had each point in its own cluster, so does A(X,d?), which implies that A(X,dsmall)=A(X,d?).
Finally, let’s take a (brief) look at two families of clustering algorithms.
I
Suppose we have some way of measuring the distance between a point and a cluster. Then, we can do the following:
Begin with the trivial clustering, i.e., C(0):=Ctrivial
Merge the cluster/point pair with minimal distance
(Note that |C(t+1)|=|C(t)|−1, i.e., we have one fewer cluster.)
Repeat the previous step until all points are in a single cluster
Output the entire history; at every time step, we have one possible clustering
Alternatively, we can stop based on some criterion (max distance or max number of clusters).
This is called agglomerative clustering. See here for a decent video explanation.
Note that every definition for the distance between a point p and a cluster C yields its own variant of the algorithm. Possible choices are
II
A different approach is called the k-means algorithm, which works as follows:
Choose k points c1,...,ck at random called the centers
Let the corresponding clusters be C1,...,Ck
Assign each point to the cluster whose center it is closest to. I.e., assign p∈X to Cj where j∈argmini∈{1,...,k}[d(p,ci)].
Move each center to the center of mass of its cluster. I.e., set cj:=1|Cj|∑p∈Cjcj. (If we are not in a vector space and cannot do this, instead choose cj∈argminx∈X∑p∈Cjd(x,cj).)
Repeat until the assignment doesn’t move any point to a different cluster
The result depends heavily on the initial randomizing step. To remedy this, one can run the algorithm many times, and then choose the best cluster, where “best” can be measured in a couple of ways, such as by comparing the sums ∑kj=1(∑p∈Cjd(cj,p)2) and choosing the clustering with the smallest sum.
See here for a great video explanation (I recommend 2× speed).