Suppose there are two experts, and two horses. Expert 1 always predicts horse A, expert 2 always predicts horse B, the truth is that the winning horse cycles ABABABABABA… The frequentist randomizes choice of expert according to weights; the Bayesian always chooses the expert who currently has more successes, and flips a coin when the experts are tied. (Disclaimer: I am not saying that this is the only possible strategies consistent with these philosophies, I am just saying that that these seem like the simplest possible instantiations of “when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.”)
If the frequentist starts with weights (1,1), then we will have weights (1/2^k, 1/2^k) half the time and (1/2^k, 1/2^{k+1}) half the time. In the former case, we will guess A and B with equal probability and have a success rate of 50%; in the latter case, we will guess A (the most recent winner) with probability 2⁄3 and B with probability 1⁄3, for a success rate of 33%. On average, we achieve 5⁄12 = 41.7% success. Note that 0.583/0.5 = 1.166… < 1.39, as predicted.
On half the other horses, expert 1 has one more correct guess than expert 2, so the Bayesian will lose everyone of these bets. In addition, the Bayesian is guessing at random on the other horses, so his or her total success rate is 25%. Note that the experts are getting 50% success rates. Note that 0.75/0.5 = 1.5 < 2.41, as we expect.
As is usually the case, reducing the penalty from 1⁄2 to (for example) 0.9, would yield to slower convergence but better ultimate behavior. In the limit where the penalty goes to 1, the frequentist is achieving the same 50% that the “experts” are, while the Bayesian is still stuck at 25%.
Now, of course, one would hope that no algorithm would be so Sphexish as to ignore pure alternation like this. But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
Well, I was trying to make the simplest possible example. We can of course add the monkey to our pool of experts. But part of the problem of machine learning is figuring out how long we need to watch an expert fail before we go to the monkey.
Consider all the possible outcomes of the races. Any algorithm will be
right half the time (on average for the non-deterministic ones), on any
subset of those races algorithms (other than random guessing) some
algorithms will do better than others. We’re looking for algorithms that do well in the subsets that match up to reality.
The more randomness in an algorithm, the less the algorithm varies across those subsets. By doing better in subsets that don’t match reality the weighted maximum algorithm does worse in the subsets that do, which are the ones we care about. There are algorithms that does better in reality, and they have less randomness. (Now if none can be reduced from giant lookup tables, that’d be interesting...)
But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
How often are the models both perfectly contradictory and equal to chance? How often is reality custom tailored to make the algorithm fail?
Those are the cases you’re protecting against, no? I imagine there are more effective ways.
Suppose there are two experts, and two horses. Expert 1 always predicts horse A, expert 2 always predicts horse B, the truth is that the winning horse cycles ABABABABABA… The frequentist randomizes choice of expert according to weights; the Bayesian always chooses the expert who currently has more successes, and flips a coin when the experts are tied. (Disclaimer: I am not saying that this is the only possible strategies consistent with these philosophies, I am just saying that that these seem like the simplest possible instantiations of “when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.”)
If the frequentist starts with weights (1,1), then we will have weights (1/2^k, 1/2^k) half the time and (1/2^k, 1/2^{k+1}) half the time. In the former case, we will guess A and B with equal probability and have a success rate of 50%; in the latter case, we will guess A (the most recent winner) with probability 2⁄3 and B with probability 1⁄3, for a success rate of 33%. On average, we achieve 5⁄12 = 41.7% success. Note that 0.583/0.5 = 1.166… < 1.39, as predicted.
On half the other horses, expert 1 has one more correct guess than expert 2, so the Bayesian will lose everyone of these bets. In addition, the Bayesian is guessing at random on the other horses, so his or her total success rate is 25%. Note that the experts are getting 50% success rates. Note that 0.75/0.5 = 1.5 < 2.41, as we expect.
As is usually the case, reducing the penalty from 1⁄2 to (for example) 0.9, would yield to slower convergence but better ultimate behavior. In the limit where the penalty goes to 1, the frequentist is achieving the same 50% that the “experts” are, while the Bayesian is still stuck at 25%.
Now, of course, one would hope that no algorithm would be so Sphexish as to ignore pure alternation like this. But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
A monkey who picked randomly between the experts would do better than both the “frequentist” and the “bayesian”. Maybe that should worry us...
Well, I was trying to make the simplest possible example. We can of course add the monkey to our pool of experts. But part of the problem of machine learning is figuring out how long we need to watch an expert fail before we go to the monkey.
Consider all the possible outcomes of the races. Any algorithm will be right half the time (on average for the non-deterministic ones), on any subset of those races algorithms (other than random guessing) some algorithms will do better than others. We’re looking for algorithms that do well in the subsets that match up to reality.
The more randomness in an algorithm, the less the algorithm varies across those subsets. By doing better in subsets that don’t match reality the weighted maximum algorithm does worse in the subsets that do, which are the ones we care about. There are algorithms that does better in reality, and they have less randomness. (Now if none can be reduced from giant lookup tables, that’d be interesting...)
How often are the models both perfectly contradictory and equal to chance? How often is reality custom tailored to make the algorithm fail?
Those are the cases you’re protecting against, no? I imagine there are more effective ways.