I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not?
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
adjust the scales of graphs to make them the same size
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).