I’m not a fan of the traditional method—I am particularly unenthusiastic about the way it depends on allowing only a limited number of specific probability estimates—but I could do with a little more information and/or persuasion before being convinced that this proposal is Doing It Right.
If I have one of your graphs, how do I (1) quantify (if I want to) how well/badly I’m doing and (2) figure out what I need to change by how much?
Consider the graph you took from Slate Star Codex (incidentally, you have a typo—it says “Start” in your post at present). If I’m Scott looking at that graph, I infer that maybe I should trust myself a little more when I feel 70% confident of something, and that maybe I’m not distinguishing clearly between 70% and 80%; and that when I feel like I have just barely enough evidence for something to mention it as a 50% “prediction”, I probably actually have a little bit more. And, overall, I see that across the board I’m getting my probabilities reasonably close, and should probably feel fairly good about that.
(Note just in case there should be the slightest doubt: I am not in fact Scott.)
On the other hand, if I’m Scott looking at this
which is, if I’ve done it right, the result of applying your approach to his calibration data … well, I’m not sure what to make of it. By eye and without thinking much, it looks as if it gets steadily worse at higher probabilities (which I really don’t think is a good description of the facts); since it’s a cumulative plot, perhaps I should be looking at changes in the gap sizes, in which case it correctly suggests that 0.5 is bad, 0.6 isn’t, 0.7 is, 0.8 isn’t … but it gives the impression that what happens at 0.9-0.99 is much worse than what happens at lower probabilities, and I really don’t buy that. And (to me) it doesn’t give much indication of how good or bad things are overall.
Do you have some advice on how to read one of your graphs? What should Scott make of the graph shown above? Do you think the available evidence really does indicate a serious problem around 0.9-0.99?
I also wonder if there’s mileage in trying to include some sort of error bars, though I’m not sure how principled a way there is to do that. For instance, we might say “well, for all we know the next question of each type might have gone either way” and plot corresponding curves with 1 added to all the counts:
The way the ranges overlap at the right-hand side seems to me to back up what I was saying above about the data not really indicating a serious problem for probabilities near 1.
First of all, thanks for your insightful reply. Let me know if you feel that I haven’t done it justice with my counter-reply below.
I’ll start by pointing out that you got the graph slightly wrong, when you included some 50% predictions in one curve and some in the other. Instead, include all of them in both or in none (actually, it would be even better to force-split them in perfect halves). I neglected to point this out in my description, but obviously 50% predictions aren’t any more “failed” than “successful” [edit: fixed now].
Here’s my version:
It’s still pretty similar to what you made, and your other concerns remain valid.
As you correctly noted, the version of the graph I’m proposing is cumulative. So at any point on the horizontal axis, the amount of divergence between the two lines in telling you how badly you are doing, with your predictions up to that level of confidence.
Looking at changes in gap sizes has more noise problem. So with a lot of data—sure, you can look at the graph and say “the gap seems to becoming larger the fastest around 70%, so I’m underconfident around that range”. But this is pretty weak.
If instead, I’d look at the graph above and say: “the gap between lines grows to around 20 expected predictions by 90%, so I can be pretty certain that I’m underconfident, and it seems to be happening somewhere in the 65-90% bracket”… than this is based on much more data points, and I also have a quantifiable piece of information that I’m missing around 20 additional hypothetical failed predictions to be perfectly calibrated.
Also, from my graph, the 70% range does not appear very special. It looks more like the whole area around 70%-90% data points has a slight problem, and I would conclude that I have not enough information to say if this is happening specifically at the 70% confidence level. So the overall result is somewhat different from the “traditional” method.
As for the values above 90%, I’d simply ignore them—there are too few of them for the result to be significant. Your idea with brackets or error bars might help to visualize that in the high ranges, you’d need much more data to get significant results. I am of course happy with people adding them whenever they think they are helpful—or simply cutting the graph at some reasonable value like 0.9 if they have not much data.
[Note: maybe someone could alert Scott to this discussion?]
I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
the 70% range does not appear very special. It looks more like the whole area around 70% to 90% data points has a slight problem
Interesting observation. I’m torn between saying “no, 70% really is special and your graph misleads by drawing attention to the separation between the lines rather than how fast it’s increasing” and saying “yes, you’re right, 70% isn’t special, and the traditional plot misleads by focusing attention on single probabilities”. I think adjudicating between the two comes down to how “smoothly” it’s reasonable to expect calibration errors to vary: is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not? If we had an agreed answer to that question, actually quantifying our expectations of smoothness, then we could use it to draw a smooth estimated-calibration curve, and it seems to me that that would actually be a better solution to the problem.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there. Hmm, perhaps if we used a logarithmic scale for the vertical axis or something it would help with that, but I think that would make it more confusing.
I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not?
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
adjust the scales of graphs to make them the same size
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).
I’m not a fan of the traditional method—I am particularly unenthusiastic about the way it depends on allowing only a limited number of specific probability estimates—but I could do with a little more information and/or persuasion before being convinced that this proposal is Doing It Right.
If I have one of your graphs, how do I (1) quantify (if I want to) how well/badly I’m doing and (2) figure out what I need to change by how much?
Consider the graph you took from Slate Star Codex (incidentally, you have a typo—it says “Start” in your post at present). If I’m Scott looking at that graph, I infer that maybe I should trust myself a little more when I feel 70% confident of something, and that maybe I’m not distinguishing clearly between 70% and 80%; and that when I feel like I have just barely enough evidence for something to mention it as a 50% “prediction”, I probably actually have a little bit more. And, overall, I see that across the board I’m getting my probabilities reasonably close, and should probably feel fairly good about that.
(Note just in case there should be the slightest doubt: I am not in fact Scott.)
On the other hand, if I’m Scott looking at this
which is, if I’ve done it right, the result of applying your approach to his calibration data … well, I’m not sure what to make of it. By eye and without thinking much, it looks as if it gets steadily worse at higher probabilities (which I really don’t think is a good description of the facts); since it’s a cumulative plot, perhaps I should be looking at changes in the gap sizes, in which case it correctly suggests that 0.5 is bad, 0.6 isn’t, 0.7 is, 0.8 isn’t … but it gives the impression that what happens at 0.9-0.99 is much worse than what happens at lower probabilities, and I really don’t buy that. And (to me) it doesn’t give much indication of how good or bad things are overall.
Do you have some advice on how to read one of your graphs? What should Scott make of the graph shown above? Do you think the available evidence really does indicate a serious problem around 0.9-0.99?
I also wonder if there’s mileage in trying to include some sort of error bars, though I’m not sure how principled a way there is to do that. For instance, we might say “well, for all we know the next question of each type might have gone either way” and plot corresponding curves with 1 added to all the counts:
The way the ranges overlap at the right-hand side seems to me to back up what I was saying above about the data not really indicating a serious problem for probabilities near 1.
First of all, thanks for your insightful reply. Let me know if you feel that I haven’t done it justice with my counter-reply below.
I’ll start by pointing out that you got the graph slightly wrong, when you included some 50% predictions in one curve and some in the other. Instead, include all of them in both or in none (actually, it would be even better to force-split them in perfect halves). I neglected to point this out in my description, but obviously 50% predictions aren’t any more “failed” than “successful” [edit: fixed now].
Here’s my version:
It’s still pretty similar to what you made, and your other concerns remain valid.
As you correctly noted, the version of the graph I’m proposing is cumulative. So at any point on the horizontal axis, the amount of divergence between the two lines in telling you how badly you are doing, with your predictions up to that level of confidence.
Looking at changes in gap sizes has more noise problem. So with a lot of data—sure, you can look at the graph and say “the gap seems to becoming larger the fastest around 70%, so I’m underconfident around that range”. But this is pretty weak.
If instead, I’d look at the graph above and say: “the gap between lines grows to around 20 expected predictions by 90%, so I can be pretty certain that I’m underconfident, and it seems to be happening somewhere in the 65-90% bracket”… than this is based on much more data points, and I also have a quantifiable piece of information that I’m missing around 20 additional hypothetical failed predictions to be perfectly calibrated.
Also, from my graph, the 70% range does not appear very special. It looks more like the whole area around 70%-90% data points has a slight problem, and I would conclude that I have not enough information to say if this is happening specifically at the 70% confidence level. So the overall result is somewhat different from the “traditional” method.
As for the values above 90%, I’d simply ignore them—there are too few of them for the result to be significant. Your idea with brackets or error bars might help to visualize that in the high ranges, you’d need much more data to get significant results. I am of course happy with people adding them whenever they think they are helpful—or simply cutting the graph at some reasonable value like 0.9 if they have not much data.
[Note: maybe someone could alert Scott to this discussion?]
I’m not sure I’m convinced by what you say about 50% predictions. Rather than making an explicit counterargument, I will appeal to one of the desirable features you mentioned in your blog post, namely continuity. If your policy is to treat all 50% predictions as being half “right” and half “wrong” (or any other special treatment of this sort) then you introduce a discontinuity as the probability attached to those predictions changes from 0.5 to, say 0.50001.
Interesting observation. I’m torn between saying “no, 70% really is special and your graph misleads by drawing attention to the separation between the lines rather than how fast it’s increasing” and saying “yes, you’re right, 70% isn’t special, and the traditional plot misleads by focusing attention on single probabilities”. I think adjudicating between the two comes down to how “smoothly” it’s reasonable to expect calibration errors to vary: is it really plausible that Scott has some sort of weird miscalibration blind spot at 70%, or not? If we had an agreed answer to that question, actually quantifying our expectations of smoothness, then we could use it to draw a smooth estimated-calibration curve, and it seems to me that that would actually be a better solution to the problem.
Relatedly: your plot doesn’t (I think) distinguish very clearly between more predictions at a given level of miscalibration, and more miscalibration there. Hmm, perhaps if we used a logarithmic scale for the vertical axis or something it would help with that, but I think that would make it more confusing.
This is true, but it is also true that the discontinuity is always present somewhere around 0.5 regardless of which method you use (you’ll see it at some point when changing from 0.49999 to 0.50001).
To really get rid of this problem, you’d need a more clever trick: e.g. you could draw a single unified curve on the full interval from 0 to 1, and then draw another version of it that is its point reflection (using the (0.5, 0.5) point as center).
Yes, this appears to be the crux here. My intuitive prior is against this “single blind spot” theory, but I don’t have any evidence beyond Occam’s razor and what I tend to observe in statistics of my personal prediction results.
I’m not sure what exactly you think it doesn’t distinguish. The same proportional difference, but with more predictions, is in fact more evidence for miscalibration (which is what my graph shows).
Yes, but it’s not evidence for more miscalibration, and I think “how miscalibrated?” is usually at least as important a question as “how sure are we of how miscalibrated?”.
Sure. So “how miscalibrated” is simply the proportional difference between values of the two curves. I.e. if you adjust the scales of graphs to make them the same size, it’s simply how far they appear to be visually.
Note that if you have substantially different numbers of predictions at different confidence levels, you will need to do this adjustment within a single graph. That was the point of my remark about maybe using a logarithmic scale on the y-axis. But I still think that would be confusing.
No, you don’t always have a discontinuity. You have to throw out predictions at 0.5, but this could be a consequence of a treatment that is continuous as a function of p. You could simply weight predictions and say that those close to 0.5 count less. I don’t know if that is reasonable for your approach, but similar things are forced upon us. For example, if you want to know whether you are overconfident at 0.5+ε you need 1/ε predictions. It is not just that calibration is impossible to discern at 0.5, but it is also difficult to discern near 0.5.
Yes, thank you, I was speaking about a more narrow set of options (which we were considering).
I don’t currently have an elegant idea about how to do weighing (but I suspect that to fit in nicely, it would be most likely done by subtraction not multiplication).