Point taken about forecast updating—information changing that drastically may be merely worthless noise.
However, on the coin toss/blackjack thing...
In your blackjack example, the answer you give is binary—Bob will either say “hit me” or “[whatever the opposite is, I’ve never played].” The meteorologists are giving answers in terms of probabilities: “there is a 70% chance that it will rain.”
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
My complaint is that the author interpreted forecasters’ probabilities as certainties, rounding them up to 1 or down to 0. This was unfair as it ignored their self-stated levels of confidence.
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
Correct. However, suppose we repeat this experiment 100 times, each time reducing my probability estimate to a binary prediction of hit-stay. Suppose that Bob hits 60 times, 50 of which were on occasions when I assigned greater than 50% probability to Bob hitting, and Bob stays 40 times, 13 of which were on occasions when I assigned less than 50% probability to Bob hitting. Thus, my overall accuracy, when reduced to a hit-stay prediction, is 63%. This is worse than my claimed certainty level of 65%, but better than the naive predictor “Bob always hits,” which only got 60% of the episodes right. Thus, the pass-fail test is one way of distinguishing my predictive abilities from the predictive abilities of a broad generalization.
To see this, suppose instead that I always predict, with 65% certainty, that Bob will hit or that Bob will stay. I might rate the chance of Bob hitting at 65%, or I might rate it at 35%. In this experiment, Bob hits 75 times, 50 of which were on occasions when I assigned a 65% probability that Bob would hit. Bob stays 25 times, 18 of which were on occasions when I assigned a 65% probability that Bob would stay. I correctly predicted Bob’s action 68% of the time, which is better than my stated certainty of 65%. However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
Again, sorry for the confusion. I gave an incomplete example before.
So if I understand correctly, the issue is not that the meteorologists are poorly calibrated (maybe they are, maybe they aren’t), but rather that their predictions are less useful than a simple rule like “it never rains” for actually predicting whether it will rain or not.
However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
I think I am beginning to see the light here. Basically, in this scenario you are too ignorant of the phenomenon itself, even though you are very good at quantifying your epistemic state with respect to the phenomenon? If this is more or less right, is there terminology that might help me get a better handle on this?
Bingo! That’s exactly what I was trying to say. Thanks for listening. :-)
My jargon mostly comes from political science. We’d say the meteorologists are using an overly complicated model, or seizing on spurious correlations, or that they have a low pseudo-R-squared. I’m not sure any of those are helpful. Personally, I think your words—the meteorologists are too ignorant for us to applaud their calibration—are more elegant.
The only other thing I would add is that the reason why it doesn’t make sense to applaud the meteorologists’ guess-level calibration is because they have such poor model-level calibration. In other words, while their confidence about any given guess seems accurate, their implicit confidence about the accuracy of their model as a whole is too high. If your (complex) model does not beat a naive predictor, social science (and, frankly, Occam’s Razor) says you ought to abandon it in favor of a simpler model. By sticking to their complex models in the face of weak predictive power, the meteorologists suggest that either (1) they don’t know or care about Occam’s Razor, or (2) they actually think their model has strong predictive power.
Here’s a really crude indicator of improvement in weather forecasting: I can remember when jokes about forecasts being wrong were a cliche. I haven’t heard a joke about weather forecasts for years, probably decades, which suggests that forecasts have actually gotten fairly good, even if they’re not as accurate as the probabilities in the forecasts suggest.
Does anyone remember when weatherman jokes went away?
Can we conclude that the prevalence of the cliche dropping is related to the quality of weather forecasting? All else being equal I expect a culture to develop a resistance to any given cliche over time. For example the cliche “It’s not you it’s me” has dropped in use and been somewhat relegated to ‘second order cliche’ . But it is true now at least as much as it has been in the past.
A fair point, though if a cliche has lasted for a very long time, I think it’s more plausible that its end is about changed conditions rather than boredom.
Point taken about forecast updating—information changing that drastically may be merely worthless noise.
However, on the coin toss/blackjack thing...
In your blackjack example, the answer you give is binary—Bob will either say “hit me” or “[whatever the opposite is, I’ve never played].” The meteorologists are giving answers in terms of probabilities: “there is a 70% chance that it will rain.”
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
My complaint is that the author interpreted forecasters’ probabilities as certainties, rounding them up to 1 or down to 0. This was unfair as it ignored their self-stated levels of confidence.
Sorry, I didn’t communicate clearly.
Correct. However, suppose we repeat this experiment 100 times, each time reducing my probability estimate to a binary prediction of hit-stay. Suppose that Bob hits 60 times, 50 of which were on occasions when I assigned greater than 50% probability to Bob hitting, and Bob stays 40 times, 13 of which were on occasions when I assigned less than 50% probability to Bob hitting. Thus, my overall accuracy, when reduced to a hit-stay prediction, is 63%. This is worse than my claimed certainty level of 65%, but better than the naive predictor “Bob always hits,” which only got 60% of the episodes right. Thus, the pass-fail test is one way of distinguishing my predictive abilities from the predictive abilities of a broad generalization.
To see this, suppose instead that I always predict, with 65% certainty, that Bob will hit or that Bob will stay. I might rate the chance of Bob hitting at 65%, or I might rate it at 35%. In this experiment, Bob hits 75 times, 50 of which were on occasions when I assigned a 65% probability that Bob would hit. Bob stays 25 times, 18 of which were on occasions when I assigned a 65% probability that Bob would stay. I correctly predicted Bob’s action 68% of the time, which is better than my stated certainty of 65%. However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
Again, sorry for the confusion. I gave an incomplete example before.
So if I understand correctly, the issue is not that the meteorologists are poorly calibrated (maybe they are, maybe they aren’t), but rather that their predictions are less useful than a simple rule like “it never rains” for actually predicting whether it will rain or not.
I think I am beginning to see the light here. Basically, in this scenario you are too ignorant of the phenomenon itself, even though you are very good at quantifying your epistemic state with respect to the phenomenon? If this is more or less right, is there terminology that might help me get a better handle on this?
Bingo! That’s exactly what I was trying to say. Thanks for listening. :-)
My jargon mostly comes from political science. We’d say the meteorologists are using an overly complicated model, or seizing on spurious correlations, or that they have a low pseudo-R-squared. I’m not sure any of those are helpful. Personally, I think your words—the meteorologists are too ignorant for us to applaud their calibration—are more elegant.
The only other thing I would add is that the reason why it doesn’t make sense to applaud the meteorologists’ guess-level calibration is because they have such poor model-level calibration. In other words, while their confidence about any given guess seems accurate, their implicit confidence about the accuracy of their model as a whole is too high. If your (complex) model does not beat a naive predictor, social science (and, frankly, Occam’s Razor) says you ought to abandon it in favor of a simpler model. By sticking to their complex models in the face of weak predictive power, the meteorologists suggest that either (1) they don’t know or care about Occam’s Razor, or (2) they actually think their model has strong predictive power.
Here’s a really crude indicator of improvement in weather forecasting: I can remember when jokes about forecasts being wrong were a cliche. I haven’t heard a joke about weather forecasts for years, probably decades, which suggests that forecasts have actually gotten fairly good, even if they’re not as accurate as the probabilities in the forecasts suggest.
Does anyone remember when weatherman jokes went away?
Can we conclude that the prevalence of the cliche dropping is related to the quality of weather forecasting? All else being equal I expect a culture to develop a resistance to any given cliche over time. For example the cliche “It’s not you it’s me” has dropped in use and been somewhat relegated to ‘second order cliche’ . But it is true now at least as much as it has been in the past.
A fair point, though if a cliche has lasted for a very long time, I think it’s more plausible that its end is about changed conditions rather than boredom.
Gotcha. Thanks for the explanation, it’s been very clarifying. =)