50% predictions can be useful if you are systematic about which option you count as “yes”. e.g., “I estimate a 50% chance that I will finish writing my book this year” is a meaningful prediction. If I am subject to standard biases, then we would expect this to have less than a 50% chance of happening, so the outcomes of predictions like this provide a meaningful test of my prediction ability.
2 conventions you could use for 50% predictions: 1) pose the question such that “yes” means an event happened and “no” is the default, or 2) pose the question such that “yes” is your preferred outcome and “no” is the less desirable outcome.
Actually, it is probably better to pick one of these conventions and use it for all predictions (so you’d use the whole range from 0-100, rather than just the top half of 50-100). “70% chance I will finish my book” is meaningfully different than “70% chance I will not finish my book”; we are throwing away information about possible miscalibrated by treating them both merely as 70% predictions.
Even better, you could pose the question however you like and also note when you make your prediction 1) which outcome (if either) is an event rather than the default and 2) which outcome (if either) you prefer. Then at the end of the year you could look at 3 graphs, one which looks at whether the outcome that you considered more likely occurred, one that looks at whether the (non-default) event occurred, and one which looks at whether your preferred outcome occurred.
Sorry, I misread your comment originally. You were careful to say that you were talking about 3 different biases, while most people say that there is a right way to orient each question.
But you weren’t careful to say that calibration — the measure of over- and under-confidence — is different from bias. There are four questions here. Introducing new questions that make sense at 50% is irrelevant to the fact that calibration doesn’t make sense at 50%. If we are just doing calibration, some of our tests are wasted. If we add a test of a bias, that part of the calibration test is still wasted. If we force the bin away from 50%, then that improves the calibration test. Moreover, I don’t think that it harms the test of bias.
Ideally, we would look at everything, but is it worth the effort? If we start with one thing, what is most important? I think that overconfidence is the biggest problem and one should start there. In some sense the annotations you suggest are not much more work, but in making the difference between doing and not doing, I think small increments matter.
(While most people are overconfident and calibration exercises are mainly about reducing overconfidence, the problem of 50% is actually a problem of underconfidence.)
Most questions don’t have a preferred direction. Look at Scott’s predictions. Which direction should you point each one?
Most people don’t make enough predictions to get a statistically significant difference between the two sides of the scale. And even if they do, their bias to the extremes (“overconfidence”) swamps the effect.
Just looking at the 50% questions, here is how I would score 1) if either direction is an event rather than the default and 2) if either direction is probably preferred by Scott:
US unemployment to be lower at end of year than beginning: 50%
Neither direction is an event, Yes is preferred.
SpaceX successfully launches a reused rocket: 50%
Yes is an event, Yes is preferred.
California’s drought not officially declared over: 50%
No is an event, No is preferred.
At least one SSC post > 100,000 hits: 50%
Yes is an event, Yes is preferred.
UNSONG will get > 1,000,000 hits: 50%
Yes is an event, Yes is preferred.
UNSONG will not miss any updates: 50%
No is an event, Yes is preferred.
I will be involved in at least one published/accepted-to-publish research paper by the end of 2016: 50%
Yes is an event, Yes is preferred.
[Over] 10,000 Twitter followers by end of this year: 50%
Yes is an event, Yes is preferred.
I will not get any new girlfriends: 50%
No is an event, perhaps No is preferred.
I will score 95th percentile or above in next year’s PRITE: 50%
Yes is an event, Yes is preferred.
I will not have any inpatient rotations: 50%
No is an event, perhaps Yes is preferred.
I get at least one article published on a major site like Huffington Post or Vox or New Statesman or something: 50%
Yes is an event, Yes is preferred.
I don’t attend any weddings this year: 50%
No is an event, perhaps No is preferred.
Scott would know better than I do, and he also could have marked a subset that he actually cared about.
Including the “perhaps”es, I count that 7⁄12 happened in the preferred direction, and 5⁄11 of the events happened. With this small sample there’s no sign of optimism bias, and he’s also well-calibrated on whether a non-default event will happen. Obviously you’d want to do this with the full set of questions and not just the 50% ones to get a more meaningful sample size.
US unemployment to be lower at end of year than beginning: 50%
Neither direction is an event
Well, to be pedantic if the US unemployment was exactly the same at the end of the year as at the beginning the prediction as worded by Scott would be false, so it could be argued that Yes is an event. (But the same would apply if he had written “higher” instead of “lower”.)
I would imagine that at the 50% level, you can put down a prediction in the positive or negative phrasing, and since it’ll be fixed at the beginning of the year (IE, you won’t be rephrasing it six months in), you should expect 50% of them to end up happening either way. Right?
(50% predictions are meaningless for calculating Brier scores, but seem valuable for general calibration levels. I suppose forcing them to 45/55% so that you can incorporate them in Brier scores / etc isn’t a bad idea. I’m not much of a statistician. Is that what you were saying, Douglas_Knight?)
The 99%/97% thing is true in that you’re jumping from one probability to a probability that’s 3 times as high, but it seems practically less necessary in that A) if you’re making fewer than 30 predictions at that interval, you shouldn’t expect any of them to be true, and B) I have a hard time mentally distinguishing 97% and 99% chances, and would expect other people to be similarly bad at it (unless they practiced or did some rigorous evaluation of the evidence.) I’m not sure how much credence I should lend to this.
Your first paragraph is correct. That is calibration. That is why 50⁄50 items are not useful for calibration. If you get less than 90% of your 90% items correct, you are a normal overconfident person. If your 50⁄50 items are not 50% correct, something odd is going on, like you are abnormally biased by the way questions are phrased.
Brier scores allow any input. 50% is a useful prediction for Brier scores. If you say that the French incumbent has a 50% chance of winning the election, that doesn’t affect your calibration, but it is bad for your Brier score.
Yes, I see—it seems like there are two ways to do this exercise.
1) Everybody writes their own predictions and arranges them into probability bins (either artificially after coming up with them, or just writing 5 at 60%, 5 at 70%, etc.) You then check your calibration with a graph like Scott Alexander’s.
2) Everybody writes their estimations for the same set of predictions—maybe you generate 50 as a group, and everyone writes down their most likely outcome and how confident they are in it. You then check your Brier score.
Both of these seem useful for different things—in 2), it’s a sort of raw measure of how good at making accurate guesses you are. Lower confidence levels make your score worse. In 1), you’re looking at calibration across probabilities—there are always going to be things you’re only 50% or 70% sure about, and making those intervals reflect reality is as important as things you’re 95% certain on.
I will edit the original post (in a bit) to reflect this.
Right, the two measures are calibration and accuracy. But calibration is part of accuracy.
Lower confidence levels make your score worse
Only if you guessed right. If you guessed wrong, lower confidence makes your score better. Under a “proper” scoring rule like Brier, you get the best possible score by honestly describing your uncertainty. Thus calibration — whether your 70% really happens 70% of the time — is a component of Brier score. If you improve your calibration, your Brier score will improve.
I think one should work on calibration before working on accuracy. Its mainly about knowing what 70% really means. Also, you can judge calibration on any set of questions, so you can tell that you are improving. While it is hard to compare Brier scores across questions. All you can do is compete with other people (or algorithms). Some questions are harder than others and that means that you must get worse Brier scores on them. But that doesn’t mean that you will not be calibrated on hard questions, it just means that you should be less confident.
50% predictions can be useful if you are systematic about which option you count as “yes”. e.g., “I estimate a 50% chance that I will finish writing my book this year” is a meaningful prediction. If I am subject to standard biases, then we would expect this to have less than a 50% chance of happening, so the outcomes of predictions like this provide a meaningful test of my prediction ability.
2 conventions you could use for 50% predictions: 1) pose the question such that “yes” means an event happened and “no” is the default, or 2) pose the question such that “yes” is your preferred outcome and “no” is the less desirable outcome.
Actually, it is probably better to pick one of these conventions and use it for all predictions (so you’d use the whole range from 0-100, rather than just the top half of 50-100). “70% chance I will finish my book” is meaningfully different than “70% chance I will not finish my book”; we are throwing away information about possible miscalibrated by treating them both merely as 70% predictions.
Even better, you could pose the question however you like and also note when you make your prediction 1) which outcome (if either) is an event rather than the default and 2) which outcome (if either) you prefer. Then at the end of the year you could look at 3 graphs, one which looks at whether the outcome that you considered more likely occurred, one that looks at whether the (non-default) event occurred, and one which looks at whether your preferred outcome occurred.
Sorry, I misread your comment originally. You were careful to say that you were talking about 3 different biases, while most people say that there is a right way to orient each question.
But you weren’t careful to say that calibration — the measure of over- and under-confidence — is different from bias. There are four questions here. Introducing new questions that make sense at 50% is irrelevant to the fact that calibration doesn’t make sense at 50%. If we are just doing calibration, some of our tests are wasted. If we add a test of a bias, that part of the calibration test is still wasted. If we force the bin away from 50%, then that improves the calibration test. Moreover, I don’t think that it harms the test of bias.
Ideally, we would look at everything, but is it worth the effort? If we start with one thing, what is most important? I think that overconfidence is the biggest problem and one should start there. In some sense the annotations you suggest are not much more work, but in making the difference between doing and not doing, I think small increments matter.
(While most people are overconfident and calibration exercises are mainly about reducing overconfidence, the problem of 50% is actually a problem of underconfidence.)
Most questions don’t have a preferred direction. Look at Scott’s predictions. Which direction should you point each one?
Most people don’t make enough predictions to get a statistically significant difference between the two sides of the scale. And even if they do, their bias to the extremes (“overconfidence”) swamps the effect.
Just looking at the 50% questions, here is how I would score 1) if either direction is an event rather than the default and 2) if either direction is probably preferred by Scott:
Neither direction is an event, Yes is preferred.
Yes is an event, Yes is preferred.
No is an event, No is preferred.
Yes is an event, Yes is preferred.
Yes is an event, Yes is preferred.
No is an event, Yes is preferred.
Yes is an event, Yes is preferred.
Yes is an event, Yes is preferred.
No is an event, perhaps No is preferred.
Yes is an event, Yes is preferred.
No is an event, perhaps Yes is preferred.
Yes is an event, Yes is preferred.
No is an event, perhaps No is preferred.
Scott would know better than I do, and he also could have marked a subset that he actually cared about.
Including the “perhaps”es, I count that 7⁄12 happened in the preferred direction, and 5⁄11 of the events happened. With this small sample there’s no sign of optimism bias, and he’s also well-calibrated on whether a non-default event will happen. Obviously you’d want to do this with the full set of questions and not just the 50% ones to get a more meaningful sample size.
Well, to be pedantic if the US unemployment was exactly the same at the end of the year as at the beginning the prediction as worded by Scott would be false, so it could be argued that Yes is an event. (But the same would apply if he had written “higher” instead of “lower”.)
I would imagine that at the 50% level, you can put down a prediction in the positive or negative phrasing, and since it’ll be fixed at the beginning of the year (IE, you won’t be rephrasing it six months in), you should expect 50% of them to end up happening either way. Right?
(50% predictions are meaningless for calculating Brier scores, but seem valuable for general calibration levels. I suppose forcing them to 45/55% so that you can incorporate them in Brier scores / etc isn’t a bad idea. I’m not much of a statistician. Is that what you were saying, Douglas_Knight?)
The 99%/97% thing is true in that you’re jumping from one probability to a probability that’s 3 times as high, but it seems practically less necessary in that A) if you’re making fewer than 30 predictions at that interval, you shouldn’t expect any of them to be true, and B) I have a hard time mentally distinguishing 97% and 99% chances, and would expect other people to be similarly bad at it (unless they practiced or did some rigorous evaluation of the evidence.) I’m not sure how much credence I should lend to this.
You seem to mix up calibration and Brier scores.
Your first paragraph is correct. That is calibration. That is why 50⁄50 items are not useful for calibration. If you get less than 90% of your 90% items correct, you are a normal overconfident person. If your 50⁄50 items are not 50% correct, something odd is going on, like you are abnormally biased by the way questions are phrased.
Brier scores allow any input. 50% is a useful prediction for Brier scores. If you say that the French incumbent has a 50% chance of winning the election, that doesn’t affect your calibration, but it is bad for your Brier score.
Yes, I see—it seems like there are two ways to do this exercise.
1) Everybody writes their own predictions and arranges them into probability bins (either artificially after coming up with them, or just writing 5 at 60%, 5 at 70%, etc.) You then check your calibration with a graph like Scott Alexander’s.
2) Everybody writes their estimations for the same set of predictions—maybe you generate 50 as a group, and everyone writes down their most likely outcome and how confident they are in it. You then check your Brier score.
Both of these seem useful for different things—in 2), it’s a sort of raw measure of how good at making accurate guesses you are. Lower confidence levels make your score worse. In 1), you’re looking at calibration across probabilities—there are always going to be things you’re only 50% or 70% sure about, and making those intervals reflect reality is as important as things you’re 95% certain on.
I will edit the original post (in a bit) to reflect this.
Right, the two measures are calibration and accuracy. But calibration is part of accuracy.
Only if you guessed right. If you guessed wrong, lower confidence makes your score better. Under a “proper” scoring rule like Brier, you get the best possible score by honestly describing your uncertainty. Thus calibration — whether your 70% really happens 70% of the time — is a component of Brier score. If you improve your calibration, your Brier score will improve.
I think one should work on calibration before working on accuracy. Its mainly about knowing what 70% really means. Also, you can judge calibration on any set of questions, so you can tell that you are improving. While it is hard to compare Brier scores across questions. All you can do is compete with other people (or algorithms). Some questions are harder than others and that means that you must get worse Brier scores on them. But that doesn’t mean that you will not be calibrated on hard questions, it just means that you should be less confident.