Yes, I see—it seems like there are two ways to do this exercise.
1) Everybody writes their own predictions and arranges them into probability bins (either artificially after coming up with them, or just writing 5 at 60%, 5 at 70%, etc.) You then check your calibration with a graph like Scott Alexander’s.
2) Everybody writes their estimations for the same set of predictions—maybe you generate 50 as a group, and everyone writes down their most likely outcome and how confident they are in it. You then check your Brier score.
Both of these seem useful for different things—in 2), it’s a sort of raw measure of how good at making accurate guesses you are. Lower confidence levels make your score worse. In 1), you’re looking at calibration across probabilities—there are always going to be things you’re only 50% or 70% sure about, and making those intervals reflect reality is as important as things you’re 95% certain on.
I will edit the original post (in a bit) to reflect this.
Right, the two measures are calibration and accuracy. But calibration is part of accuracy.
Lower confidence levels make your score worse
Only if you guessed right. If you guessed wrong, lower confidence makes your score better. Under a “proper” scoring rule like Brier, you get the best possible score by honestly describing your uncertainty. Thus calibration — whether your 70% really happens 70% of the time — is a component of Brier score. If you improve your calibration, your Brier score will improve.
I think one should work on calibration before working on accuracy. Its mainly about knowing what 70% really means. Also, you can judge calibration on any set of questions, so you can tell that you are improving. While it is hard to compare Brier scores across questions. All you can do is compete with other people (or algorithms). Some questions are harder than others and that means that you must get worse Brier scores on them. But that doesn’t mean that you will not be calibrated on hard questions, it just means that you should be less confident.
Yes, I see—it seems like there are two ways to do this exercise.
1) Everybody writes their own predictions and arranges them into probability bins (either artificially after coming up with them, or just writing 5 at 60%, 5 at 70%, etc.) You then check your calibration with a graph like Scott Alexander’s.
2) Everybody writes their estimations for the same set of predictions—maybe you generate 50 as a group, and everyone writes down their most likely outcome and how confident they are in it. You then check your Brier score.
Both of these seem useful for different things—in 2), it’s a sort of raw measure of how good at making accurate guesses you are. Lower confidence levels make your score worse. In 1), you’re looking at calibration across probabilities—there are always going to be things you’re only 50% or 70% sure about, and making those intervals reflect reality is as important as things you’re 95% certain on.
I will edit the original post (in a bit) to reflect this.
Right, the two measures are calibration and accuracy. But calibration is part of accuracy.
Only if you guessed right. If you guessed wrong, lower confidence makes your score better. Under a “proper” scoring rule like Brier, you get the best possible score by honestly describing your uncertainty. Thus calibration — whether your 70% really happens 70% of the time — is a component of Brier score. If you improve your calibration, your Brier score will improve.
I think one should work on calibration before working on accuracy. Its mainly about knowing what 70% really means. Also, you can judge calibration on any set of questions, so you can tell that you are improving. While it is hard to compare Brier scores across questions. All you can do is compete with other people (or algorithms). Some questions are harder than others and that means that you must get worse Brier scores on them. But that doesn’t mean that you will not be calibrated on hard questions, it just means that you should be less confident.