I think it would be silly to resist to the idea that “X with probability P(X)” is equivalent to “~X with probability 1-P(X)”. This statement is simply true.
However, it does not imply that prediction lists like this should include X and ~X as possible claims. To see this, let’s consider person A who only lists “X, probability P”, and person B who lists “X, probability P, and ~X, probability 1-P”. Clearly these two are making the exact same claim about the future of the world. If we use an entropy rule to grade both of these people, we will find that no matter the outcome person B will have exactly twice the entropy (penalty) that person A has, so if we afterwards want to compare results of two people, only one of whom doubled up on the predictions, there is an easy way to do it (just double the penalty for those who didn’t). So far so good: everything logically consistent, making the same claim about the world still easily lets you compare results aftewards. Nevertheless, there are two (related) things that need to be remarked, which is what I think all the controversy is over:
1) If, instead of the correct log weight rule, we use something stupid like a least-squares (or just eyeballing it per bracket), there is a significant difference between our people A and B above, precisely in their 50% predictions. For any probability assignment other than 50% the error rate at probability P and at 1-P are related and opposite, since getting a probability P prediction right (say, X), means getting a probability 1-P prediction wrong (~X). But for 50% these two get added up (with our stupid scoring rules) before being used to deduce calibration results. As a result we find that our doubler, player B, will always have exactly half of his 50% predictions right, which will score really well on stupid scoring rules (as an extreme example, to a naive scoring rule somebody who predicts 50% on every claim, regardless of logical constency, will seem to be perfectly calibrated).
2) Once we use a good scoring rule, i.e. the log rule, we can easily jump back and forth between people who double up on the claims and those who do not, as claimed/shown above.
In view of these two points I think that all of the magic is hidden in the scoring rule, not in the procedure used when recording the predictions. In other words, this doubling up does nothing useful. And since on calibration graphs people tend to think that getting half of your 50% predictions is really good, I say that the doubling version is actually slightly more misleading. The solution is clearly to use a proper scoring rule, and then you can do whatever you wish. But in reality it is best to not confuse your audience by artificially creating more dependencies between your predictions.
X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.
Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn’t!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).
I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.
I agree that there’s no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.
I think we both agree that there should be no controversy about this—however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That’s what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions—a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way—is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there’s certainly no reason to think that there is any significance to the choice between writing “X, with probability P” and “~X, with probability 1-P”, even when P=50%.
If you still object to doubling the predictions, you can instead choose to take Scott’s predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott’s calibration as his original prediction set?
I think it would be silly to resist to the idea that “X with probability P(X)” is equivalent to “~X with probability 1-P(X)”. This statement is simply true.
However, it does not imply that prediction lists like this should include X and ~X as possible claims. To see this, let’s consider person A who only lists “X, probability P”, and person B who lists “X, probability P, and ~X, probability 1-P”. Clearly these two are making the exact same claim about the future of the world. If we use an entropy rule to grade both of these people, we will find that no matter the outcome person B will have exactly twice the entropy (penalty) that person A has, so if we afterwards want to compare results of two people, only one of whom doubled up on the predictions, there is an easy way to do it (just double the penalty for those who didn’t). So far so good: everything logically consistent, making the same claim about the world still easily lets you compare results aftewards. Nevertheless, there are two (related) things that need to be remarked, which is what I think all the controversy is over:
1) If, instead of the correct log weight rule, we use something stupid like a least-squares (or just eyeballing it per bracket), there is a significant difference between our people A and B above, precisely in their 50% predictions. For any probability assignment other than 50% the error rate at probability P and at 1-P are related and opposite, since getting a probability P prediction right (say, X), means getting a probability 1-P prediction wrong (~X). But for 50% these two get added up (with our stupid scoring rules) before being used to deduce calibration results. As a result we find that our doubler, player B, will always have exactly half of his 50% predictions right, which will score really well on stupid scoring rules (as an extreme example, to a naive scoring rule somebody who predicts 50% on every claim, regardless of logical constency, will seem to be perfectly calibrated).
2) Once we use a good scoring rule, i.e. the log rule, we can easily jump back and forth between people who double up on the claims and those who do not, as claimed/shown above.
In view of these two points I think that all of the magic is hidden in the scoring rule, not in the procedure used when recording the predictions. In other words, this doubling up does nothing useful. And since on calibration graphs people tend to think that getting half of your 50% predictions is really good, I say that the doubling version is actually slightly more misleading. The solution is clearly to use a proper scoring rule, and then you can do whatever you wish. But in reality it is best to not confuse your audience by artificially creating more dependencies between your predictions.
X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.
Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn’t!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).
I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.
I agree that there’s no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.
I think we both agree that there should be no controversy about this—however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That’s what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions—a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way—is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there’s certainly no reason to think that there is any significance to the choice between writing “X, with probability P” and “~X, with probability 1-P”, even when P=50%.
If you still object to doubling the predictions, you can instead choose to take Scott’s predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott’s calibration as his original prediction set?