Just want to draw out and highlight something mentioned in passing in the “You want to solve a problem in as much generality as possible...” section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption—“this problem never comes up”). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don’t think it’s correct to call it the “second-simplest solution”, since there are many choices of what facet of the environment is worst-case.
One keyword for this is “partial specification”, e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: https://arxiv.org/abs/1606.05313. (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler—in general it is way harder than just working with the empirical distribution in front of you.
My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it’s trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.
(However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)
Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.
And in fact, since min_x f(theta,x) ⇐ E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.
Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don’t think it’s correct to call it the “second-simplest solution”, since there are many choices of what facet of the environment is worst-case.
Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just “solve existing problems and then try the same thing tomorrow,” and generally I’d agree “solve an existing problem for which you can verify success” is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it’s feasible.
max and min are both “easier” monoids than E essentially because of dominance relations; for any θ, there’s going to be a single x that dominates all others, in the sense that all other x′≠x can be excluded from consideration and have no impact on the outcome. Whereas when calculating E, the only x′ that can be excluded are those outside the distribution’s support.
max is even easier than min because it commutes with the outer max; not only is there a single x that dominates all others, it doesn’t necessarily even depend on θ (the problem can be solved as maxxmaxθf(θ,x) or maxθ,xf(θ,x)). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.
Just want to draw out and highlight something mentioned in passing in the “You want to solve a problem in as much generality as possible...” section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption—“this problem never comes up”). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don’t think it’s correct to call it the “second-simplest solution”, since there are many choices of what facet of the environment is worst-case.
One keyword for this is “partial specification”, e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: https://arxiv.org/abs/1606.05313. (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler—in general it is way harder than just working with the empirical distribution in front of you.
My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it’s trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.
(However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)
Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.
And in fact, since min_x f(theta,x) ⇐ E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.
Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.
Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just “solve existing problems and then try the same thing tomorrow,” and generally I’d agree “solve an existing problem for which you can verify success” is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it’s feasible.
To elaborate this formally,
maxθmaxxf(θ,x) is best-case
maxθminxf(θ,x) is worst-case
maxθExf(θ,x) is average-case
max and min are both “easier” monoids than E essentially because of dominance relations; for any θ, there’s going to be a single x that dominates all others, in the sense that all other x′≠x can be excluded from consideration and have no impact on the outcome. Whereas when calculating E, the only x′ that can be excluded are those outside the distribution’s support.
max is even easier than min because it commutes with the outer max; not only is there a single x that dominates all others, it doesn’t necessarily even depend on θ (the problem can be solved as maxxmaxθf(θ,x) or maxθ,xf(θ,x)). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.