Also, I should note that for the sake of simplicity I’ve labeled everything that is non-Bayesian as a “frequentist” method, even though I think there’s actually a fair amount of variation among these methods, although also a fair amount of overlap (e.g. I’m throwing in statistical learning theory with minimax estimation, which certainly have a lot of overlap in ideas but were also in some sense developed by different communities).
I realize that using the label in this way is in some sense very inaccurate and a gross over-simplification, but on the other hand being purely Bayesian means you are likely to reject all of these techniques, so I think it’s fair for the purposes of this post. And I think a large subset of the audience of this post may not even be aware of the distinctions between classical statistics, statistical learning theory, etc. Also I know both statistical learning theorists and classical statisticians who are happy to adopt the label of frequentist.
Without elaboration about what that strategy is, it sounds a lot like Bayesian updating based on other’s beliefs. i.e. I see that this person is betting more successfully than me; I update my method to reflect theirs more closely.
I see that this person is betting more successfully than me; I update my method to reflect theirs more closely
This seems like a description of pretty much all statistical methods ever. Okay that’s not quite true, but I think “update parameters to get closer to the truth over time” is a pretty generic property for an algorithm to have.
Now, in the case of horse-racing, it is true that what you maintain is a probability distribution over the other people. However, there are a few key differences. The first is that when I actually choose which person to follow
on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.
The other difference is that I update my weight of person i by a multiplicative factor of, say, (1+a)^(-L(i,t)), where a is some constant and L(i,t) is the loss of person i in round t. The multiplicative factor is certainly reminiscent of Bayesian updating, and you could maybe find a way to interpret the updates as Bayesian updates under some well-motivated model, but it’s at least not obvious to me how to do so.
As an aside, in the horse-racing case, this algorithm is called the weighted majority algorithm, and Eliezer spent an entire blog post talking about how he doesn’t like it, so at least someone other than me doesn’t think that it’s “just being Bayesian implicitly”.
Suppose there are two experts, and two horses. Expert 1 always predicts horse A, expert 2 always predicts horse B, the truth is that the winning horse cycles ABABABABABA… The frequentist randomizes choice of expert according to weights; the Bayesian always chooses the expert who currently has more successes, and flips a coin when the experts are tied. (Disclaimer: I am not saying that this is the only possible strategies consistent with these philosophies, I am just saying that that these seem like the simplest possible instantiations of “when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.”)
If the frequentist starts with weights (1,1), then we will have weights (1/2^k, 1/2^k) half the time and (1/2^k, 1/2^{k+1}) half the time. In the former case, we will guess A and B with equal probability and have a success rate of 50%; in the latter case, we will guess A (the most recent winner) with probability 2⁄3 and B with probability 1⁄3, for a success rate of 33%. On average, we achieve 5⁄12 = 41.7% success. Note that 0.583/0.5 = 1.166… < 1.39, as predicted.
On half the other horses, expert 1 has one more correct guess than expert 2, so the Bayesian will lose everyone of these bets. In addition, the Bayesian is guessing at random on the other horses, so his or her total success rate is 25%. Note that the experts are getting 50% success rates. Note that 0.75/0.5 = 1.5 < 2.41, as we expect.
As is usually the case, reducing the penalty from 1⁄2 to (for example) 0.9, would yield to slower convergence but better ultimate behavior. In the limit where the penalty goes to 1, the frequentist is achieving the same 50% that the “experts” are, while the Bayesian is still stuck at 25%.
Now, of course, one would hope that no algorithm would be so Sphexish as to ignore pure alternation like this. But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
Well, I was trying to make the simplest possible example. We can of course add the monkey to our pool of experts. But part of the problem of machine learning is figuring out how long we need to watch an expert fail before we go to the monkey.
Consider all the possible outcomes of the races. Any algorithm will be
right half the time (on average for the non-deterministic ones), on any
subset of those races algorithms (other than random guessing) some
algorithms will do better than others. We’re looking for algorithms that do well in the subsets that match up to reality.
The more randomness in an algorithm, the less the algorithm varies across those subsets. By doing better in subsets that don’t match reality the weighted maximum algorithm does worse in the subsets that do, which are the ones we care about. There are algorithms that does better in reality, and they have less randomness. (Now if none can be reduced from giant lookup tables, that’d be interesting...)
But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
How often are the models both perfectly contradictory and equal to chance? How often is reality custom tailored to make the algorithm fail?
Those are the cases you’re protecting against, no? I imagine there are more effective ways.
I just presented you with an incredibly awesome algorithm, indeed one of the most awesome algorithms I can think of. I then showed how to use it to obtain a frequentist version of Solomonoff induction that is superior to the Bayesian version. Your response is to repeat the Bayesian party line. Is there really no respect for truth and beauty these days?
But okay, I’ll bite. Better than what? What is the “average” case here?
Well, I’m not familiar enough with Solomoff induction to check your assertion that the frequentist induction is better, but your second question is easy. The average case would clearly be calculating an expected Regret rather than a bound. The proof is accurate, but it’s measuring a slightly-wrong thing.
EDIT: Looking at the Blum paper, Blum even acknowledges the motivation for EY’s objection as a space for future work. (Conclusion 5.2.)
The distribution of the ‘expert adivsors’ or whatever they actually are, their accuracy, and the actual events being predicted. I recognize this is difficult to compute (maybe Solomonoff hard), and bounding the error is a good, very-computable proxy. But it’s just a proxy; we care about the expected result, not the result assuming that the universe hates us and wants us to suffer.
If we had a bound for the randomized case, but no bound for the deterministic one, that would be different. But we have bounds for both, and they’re within a small constant multiple of each other. We’re not opening ourselves up to small-probability large risk, so we just want the best expectation of value. And that’s going to be the deterministic version, not the randomized one.
In what sense is it Bayesian?
I said at the beginning:
I realize that using the label in this way is in some sense very inaccurate and a gross over-simplification, but on the other hand being purely Bayesian means you are likely to reject all of these techniques, so I think it’s fair for the purposes of this post. And I think a large subset of the audience of this post may not even be aware of the distinctions between classical statistics, statistical learning theory, etc. Also I know both statistical learning theorists and classical statisticians who are happy to adopt the label of frequentist.
Without elaboration about what that strategy is, it sounds a lot like Bayesian updating based on other’s beliefs. i.e. I see that this person is betting more successfully than me; I update my method to reflect theirs more closely.
This seems like a description of pretty much all statistical methods ever. Okay that’s not quite true, but I think “update parameters to get closer to the truth over time” is a pretty generic property for an algorithm to have.
Now, in the case of horse-racing, it is true that what you maintain is a probability distribution over the other people. However, there are a few key differences. The first is that when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.
The other difference is that I update my weight of person i by a multiplicative factor of, say, (1+a)^(-L(i,t)), where a is some constant and L(i,t) is the loss of person i in round t. The multiplicative factor is certainly reminiscent of Bayesian updating, and you could maybe find a way to interpret the updates as Bayesian updates under some well-motivated model, but it’s at least not obvious to me how to do so.
As an aside, in the horse-racing case, this algorithm is called the weighted majority algorithm, and Eliezer spent an entire blog post talking about how he doesn’t like it, so at least someone other than me doesn’t think that it’s “just being Bayesian implicitly”.
And how do you expect to do better by using the weighted majority algorithm? Not in the worst case, but on average?
Suppose there are two experts, and two horses. Expert 1 always predicts horse A, expert 2 always predicts horse B, the truth is that the winning horse cycles ABABABABABA… The frequentist randomizes choice of expert according to weights; the Bayesian always chooses the expert who currently has more successes, and flips a coin when the experts are tied. (Disclaimer: I am not saying that this is the only possible strategies consistent with these philosophies, I am just saying that that these seem like the simplest possible instantiations of “when I actually choose which person to follow on a given round, I randomize according to my weights, whereas a Bayesian would always want to deterministically pick the person with the highest expected value.”)
If the frequentist starts with weights (1,1), then we will have weights (1/2^k, 1/2^k) half the time and (1/2^k, 1/2^{k+1}) half the time. In the former case, we will guess A and B with equal probability and have a success rate of 50%; in the latter case, we will guess A (the most recent winner) with probability 2⁄3 and B with probability 1⁄3, for a success rate of 33%. On average, we achieve 5⁄12 = 41.7% success. Note that 0.583/0.5 = 1.166… < 1.39, as predicted.
On half the other horses, expert 1 has one more correct guess than expert 2, so the Bayesian will lose everyone of these bets. In addition, the Bayesian is guessing at random on the other horses, so his or her total success rate is 25%. Note that the experts are getting 50% success rates. Note that 0.75/0.5 = 1.5 < 2.41, as we expect.
As is usually the case, reducing the penalty from 1⁄2 to (for example) 0.9, would yield to slower convergence but better ultimate behavior. In the limit where the penalty goes to 1, the frequentist is achieving the same 50% that the “experts” are, while the Bayesian is still stuck at 25%.
Now, of course, one would hope that no algorithm would be so Sphexish as to ignore pure alternation like this. But the point of the randomized weighted majority guarantee is that it holds (up to the correctness of the random number generator) regardless of how much more complicated reality may be than the experts’ models.
A monkey who picked randomly between the experts would do better than both the “frequentist” and the “bayesian”. Maybe that should worry us...
Well, I was trying to make the simplest possible example. We can of course add the monkey to our pool of experts. But part of the problem of machine learning is figuring out how long we need to watch an expert fail before we go to the monkey.
Consider all the possible outcomes of the races. Any algorithm will be right half the time (on average for the non-deterministic ones), on any subset of those races algorithms (other than random guessing) some algorithms will do better than others. We’re looking for algorithms that do well in the subsets that match up to reality.
The more randomness in an algorithm, the less the algorithm varies across those subsets. By doing better in subsets that don’t match reality the weighted maximum algorithm does worse in the subsets that do, which are the ones we care about. There are algorithms that does better in reality, and they have less randomness. (Now if none can be reduced from giant lookup tables, that’d be interesting...)
How often are the models both perfectly contradictory and equal to chance? How often is reality custom tailored to make the algorithm fail?
Those are the cases you’re protecting against, no? I imagine there are more effective ways.
I just presented you with an incredibly awesome algorithm, indeed one of the most awesome algorithms I can think of. I then showed how to use it to obtain a frequentist version of Solomonoff induction that is superior to the Bayesian version. Your response is to repeat the Bayesian party line. Is there really no respect for truth and beauty these days?
But okay, I’ll bite. Better than what? What is the “average” case here?
Well, I’m not familiar enough with Solomoff induction to check your assertion that the frequentist induction is better, but your second question is easy. The average case would clearly be calculating an expected Regret rather than a bound. The proof is accurate, but it’s measuring a slightly-wrong thing.
EDIT: Looking at the Blum paper, Blum even acknowledges the motivation for EY’s objection as a space for future work. (Conclusion 5.2.)
Expectation with respect to what distribution?
The distribution of the ‘expert adivsors’ or whatever they actually are, their accuracy, and the actual events being predicted. I recognize this is difficult to compute (maybe Solomonoff hard), and bounding the error is a good, very-computable proxy. But it’s just a proxy; we care about the expected result, not the result assuming that the universe hates us and wants us to suffer.
If we had a bound for the randomized case, but no bound for the deterministic one, that would be different. But we have bounds for both, and they’re within a small constant multiple of each other. We’re not opening ourselves up to small-probability large risk, so we just want the best expectation of value. And that’s going to be the deterministic version, not the randomized one.