X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.
Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn’t!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).
I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.
I agree that there’s no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.
I think we both agree that there should be no controversy about this—however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That’s what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions—a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way—is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there’s certainly no reason to think that there is any significance to the choice between writing “X, with probability P” and “~X, with probability 1-P”, even when P=50%.
If you still object to doubling the predictions, you can instead choose to take Scott’s predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott’s calibration as his original prediction set?
X and ~X will always receive the same score by both the logarithmic and least-squares scoring rules that I described in my post, although I certainly agree that the logarithm is a better measure. If you dispute that point, please provide a numerical example.
Because of the 1/N factor outside the sum, doubling predictions does not affect your calibration score (as it shouldn’t!). This factor is necessary or your score would only ever get successively worse the more predictions you make, regardless of how good they are. Thus, including X and ~X in the enumeration neither hurts nor helps your calibration score (regardless of whether using the log or the least-squares rule).
I agree that eyeballing a calibration graph is no good either. That was precisely the point I made with the lottery ticket example in the main post, where the prediction score is lousy but the graph looks perfect.
I agree that there’s no magic in the scoring rule. Doubling predictions is unnecessary for practical purposes; the reason I detail it here is to make a very important point about how calibration works in principle. This point needed to be made, in order to address the severe confusion that was apparent in the Slate Star Codex comment threads, because there was widespread disagreement about what exactly happens at 50%.
I think we both agree that there should be no controversy about this—however, go ahead and read through the SSC thread to see how many absurd solutions were being proposed! That’s what this post is responding to! What is made clear by enumerating both X and ~X in the bookkeeping of predictions—a move for which there is no possible objection, because it is no different than the original prediction, nor is does it affecting a proper score in any way—is that there is no reason to treat 50% as though it has special properties that are different than 50.01%, and there’s certainly no reason to think that there is any significance to the choice between writing “X, with probability P” and “~X, with probability 1-P”, even when P=50%.
If you still object to doubling the predictions, you can instead choose to take Scott’s predictions and replace all X all with ~X, and all P with 1-P. Do you agree that this new set should be just as representative of Scott’s calibration as his original prediction set?