Having written a few papers about ERM and its variants, I feel personally attacked! I feel obliged to step in and try to defend ERM’s honour.
First of all, I don’t think I would call ERM a learning framework. ERM is a solution to the mathematical problem of PAC learning. Theoretical computer scientists like to precisely define the problem they want to solve before trying to solve it. When they were faced with the problem of learning, they decided that the mathematical problem of PAC learning was a good representation of the real-world problem of learning. Once you decide that PAC learning is what you want to solve, then ERM is the solution. Whether or not PAC learning truly represents the real world problem of learning is debatable and theorists have always agreed about that, with the main shortcoming being the insistence on the worst case probability distribution. But PAC learning is also a pretty solid first attempt at defining what learning means, which, in fact, used to produce useful predictions about practical learning methods before neural networks took over. Neural networks seem to work even in cases where PAC learning can be proved not to, and that means we need to come up with a better definition of what learning means. But PAC learning does have some very neat ideas built into it that whatever the better definition happens to be shouldn’t ignore.
So what’s so great about PAC learning? There’s this famous quote by Vladimir Vapnik: “When solving a problem of interest, do not solve a more general problem as an intermediate step.” PAC learning incorporates this advice in earnest. You point out that learning is inherently about probabilities, which is true. But using this to infer that we first need to model the probability distribution over (X, Y) is precisely the thing Vapnik warns against. In the end, we simply want to make predictions and we should directly model that. You can say that a predictor is a (potentially randomized) algorithm that maps X’s to Y’s. Then your learner’s task is to look at a finite sample and output a predictor that makes good predictions in expectation (where the expectation is taken over the true data distribution and the predictor’s randomness). If it turns out that the learner needs to somehow model the distribution over (X, Y) to come up with a good predictor then so be it, but that shouldn’t be a part of the definition of the problem. But having defined the problem this way, one soon realizes that randomness in f is now unnecessary for the same reason that when trying to guess the outcome of a (biased) coin toss the best guess is deterministic. The best coin toss guess strategy is to simply guess the face that has the higher chance of showing up. Similarly the best predictor for a classification problem is the deterministic predictor that, for each x, outputs the label that has the highest chance of being the true label for x. Outputting a deterministic predictor does not mean we haven’t modelled the inherent probabilities of the learning problem.
Empirical risk minimization does not feel fundamentally confused to me. ERM is, in fact, an amazingly elegant solution to the problem of PAC learning. One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!
Looking back at this, I think this post is outdated and was trying a little too hard to be provocative. I agree with everything you say here. Especially: “One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!”
Forgive my youthful naïvité. For what it’s worth, I think the generalization post in this sequence has stood the test of time much better.
Having written a few papers about ERM and its variants, I feel personally attacked! I feel obliged to step in and try to defend ERM’s honour.
First of all, I don’t think I would call ERM a learning framework. ERM is a solution to the mathematical problem of PAC learning. Theoretical computer scientists like to precisely define the problem they want to solve before trying to solve it. When they were faced with the problem of learning, they decided that the mathematical problem of PAC learning was a good representation of the real-world problem of learning. Once you decide that PAC learning is what you want to solve, then ERM is the solution. Whether or not PAC learning truly represents the real world problem of learning is debatable and theorists have always agreed about that, with the main shortcoming being the insistence on the worst case probability distribution. But PAC learning is also a pretty solid first attempt at defining what learning means, which, in fact, used to produce useful predictions about practical learning methods before neural networks took over. Neural networks seem to work even in cases where PAC learning can be proved not to, and that means we need to come up with a better definition of what learning means. But PAC learning does have some very neat ideas built into it that whatever the better definition happens to be shouldn’t ignore.
So what’s so great about PAC learning? There’s this famous quote by Vladimir Vapnik: “When solving a problem of interest, do not solve a more general problem as an intermediate step.” PAC learning incorporates this advice in earnest. You point out that learning is inherently about probabilities, which is true. But using this to infer that we first need to model the probability distribution over (X, Y) is precisely the thing Vapnik warns against. In the end, we simply want to make predictions and we should directly model that. You can say that a predictor is a (potentially randomized) algorithm that maps X’s to Y’s. Then your learner’s task is to look at a finite sample and output a predictor that makes good predictions in expectation (where the expectation is taken over the true data distribution and the predictor’s randomness). If it turns out that the learner needs to somehow model the distribution over (X, Y) to come up with a good predictor then so be it, but that shouldn’t be a part of the definition of the problem. But having defined the problem this way, one soon realizes that randomness in f is now unnecessary for the same reason that when trying to guess the outcome of a (biased) coin toss the best guess is deterministic. The best coin toss guess strategy is to simply guess the face that has the higher chance of showing up. Similarly the best predictor for a classification problem is the deterministic predictor that, for each x, outputs the label that has the highest chance of being the true label for x. Outputting a deterministic predictor does not mean we haven’t modelled the inherent probabilities of the learning problem.
Empirical risk minimization does not feel fundamentally confused to me. ERM is, in fact, an amazingly elegant solution to the problem of PAC learning. One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!
Looking back at this, I think this post is outdated and was trying a little too hard to be provocative. I agree with everything you say here. Especially: “One could reasonably say that PAC learning is somewhat confused, but learning theorists are working on it!”
Forgive my youthful naïvité. For what it’s worth, I think the generalization post in this sequence has stood the test of time much better.
Ah, I just noticed it’s an old post. I was just clicking through all the SLT links. :)