We could test our calibration by simply answering a lot of these pairs of questions, then applying a proper scoring rule. But that seems like throwing out information. Surely we could calibrate faster if we’re allowed to use our accuracy as evidence?
Huh? No, that’s getting more information, not throwing out information. You can’t calibrate if you don’t know your accuracy, so it’s not clear to me what you mean by “allowed.”
Essentially, the only information you can get from a single datapoint is whether or not you understand the acceptability of 0 or 100 as probabilities (hint: they aren’t). If you want to a useful calibration, you need lots of datapoints.
Not what I meant. For “simply,” read “merely.” I mean that when you get lots of datapoints, and calibrate from them, you can use your accuracy on the first parts in a richer way than just the binary check of whether they fell within your confidence intervals or not.
The way this is typically done is by eliciting more than two numbers to build the distribution out of. For example, I might ask you for a date so early that you think there’s only a 5% chance it happened before that date, then a date so late that you think there’s only a 5% chance it happened after that date, then try to figure out the tertiles or quartiles.
Notice that I worked from the outside in- when people try to come up with a central estimate and then imagine variance around that central estimate, like in Yvain’s elicitation, they do significantly worse than if guided by an well-designed process. (You can see an example of an expert elicitation process here.)
One you’ve done this, you’ve got more detailed bins, and you can evaluate the bin populations. (“Hm, I only have 10% in my lower tertile- I ought to adjust my estimates downwards.”)
People often fit distributions based on elicited values, but they’ll talk a lot with the experts about shape, to make sure it fits the expert’s beliefs. (They tend to use things a lot more sophisticated than uniforms, generally chosen so that Bayesian updates are convenient.) I don’t think I’ve seen much of that in the domain of calibration, though.
[edit] You could use that fitting procedure to produce a more precise estimate of your p, and then use that in your proper scoring rule to determine your score in negentropy, and so this could be useful for calibration. While I think this could increase precision in your calibration measurement, I don’t know if it would actually improve the accuracy of your calibration measurement. When doing statistics, it’s hard to make up for lack of data through use of clever techniques.
Huh? No, that’s getting more information, not throwing out information. You can’t calibrate if you don’t know your accuracy, so it’s not clear to me what you mean by “allowed.”
Essentially, the only information you can get from a single datapoint is whether or not you understand the acceptability of 0 or 100 as probabilities (hint: they aren’t). If you want to a useful calibration, you need lots of datapoints.
Not what I meant. For “simply,” read “merely.” I mean that when you get lots of datapoints, and calibrate from them, you can use your accuracy on the first parts in a richer way than just the binary check of whether they fell within your confidence intervals or not.
The way this is typically done is by eliciting more than two numbers to build the distribution out of. For example, I might ask you for a date so early that you think there’s only a 5% chance it happened before that date, then a date so late that you think there’s only a 5% chance it happened after that date, then try to figure out the tertiles or quartiles.
Notice that I worked from the outside in- when people try to come up with a central estimate and then imagine variance around that central estimate, like in Yvain’s elicitation, they do significantly worse than if guided by an well-designed process. (You can see an example of an expert elicitation process here.)
One you’ve done this, you’ve got more detailed bins, and you can evaluate the bin populations. (“Hm, I only have 10% in my lower tertile- I ought to adjust my estimates downwards.”)
People often fit distributions based on elicited values, but they’ll talk a lot with the experts about shape, to make sure it fits the expert’s beliefs. (They tend to use things a lot more sophisticated than uniforms, generally chosen so that Bayesian updates are convenient.) I don’t think I’ve seen much of that in the domain of calibration, though.
[edit] You could use that fitting procedure to produce a more precise estimate of your p, and then use that in your proper scoring rule to determine your score in negentropy, and so this could be useful for calibration. While I think this could increase precision in your calibration measurement, I don’t know if it would actually improve the accuracy of your calibration measurement. When doing statistics, it’s hard to make up for lack of data through use of clever techniques.
Thanks for that link, and for pointing out the technique which seems like a good hack. (In the nice sense of the word.)