In measuring precipitation accuracy, the study assumed that if a forecaster predicted a 50 percent or higher chance of precipitation, they were saying it was more likely to rain than not. Less than 50 percent meant it was more likely to not rain.
That prediction was then compared to whether or not it actually did rain...
Isn’t something wrong here? If you say “60% chance of rain,” and it doesn’t rain, you are not necessarily a bad forecaster. Not unless it actually rained on less (or more!) than 60% of those occasions. It should rain on ~60% of occasions on which you say “60% chance of rain.”
Am I just confused about this fellow’s methodology?
If I’m reading this correctly they are doing exactly what you want but only breaking into two categories “more likely to rain than not” and “less likely to rain than not.” But I’m confused by the fact that 50 percent gets into the expecting rain category.
Okay, this is like a sore tooth. Somebody’s wrong, and I don’t know if it’s me. A queazy feeling.
Listen to this though:
The prize for the single most inconsistent forecast goes to Channel 5’s Devon Lucie who on Sunday, September 30th predicted a high temperature of 53 degrees for October 7th, and seven days later changed it to 84 degrees — a difference of 31 degrees! It turned out to be 81 that day.
A close second was Channel 4’s Mike Thompson’s initial prediction of 83 for October 15th, which he changed to 53 just two days later. It turned out to be 64 on the 15th.
Uhhh.… it’s remarkable that a forecast changed significantly in SEVEN DAYS? What?!
The weather is the canonical example of mathematical chaos in an (in principle) deterministic system. Of course the forecasts will change, because Tuesday’s weather sets the initial conditions for Wednesday, and chaotic systems are ultra-sensitive to initial conditions! The forecasters would be idiots if they didn’t update their forecasts as much as possible.
The “close second,” moreover, should be first! That change occurred in a two day period versus a seven! ARGGHHH.
To me it almost seems as though a scenario like this is happening:
You: “What is the chance that a 6-sided die will turn up one of the numbers 1-4?”
Me: “2/3, of course.”
You: “I will take that to mean that it is more likely than not to be 1-4 - thus I will count 1-4 as a hit and 5 or 6 as a miss.”
Me: “Um… okay(?)”
You: rolls a die 100 times. “Oh, it seems you were correct only about 66% of the time. You’re not very good at this, are you?”
In other words, isn’t the author misrepresenting the forecasters in throwing away their POPs, which could be interpreted as subjective beliefs about likelihoods?
I was also sort of confused by:
Have you ever noticed that the prediction for a particular day keeps changing from day to day, sometimes by quite a bit? The graph above shows how much the different stations change their minds about their own forecasts over a seven-day period.
On average, N.O.A.A. is the most consistent, but even they change their mind by more than six degrees and 23 percent likelihood of precipitation over a seven-day span.
The Kansas City television meteorologists will change their mind from 6.8 to nearly nine degrees in temperature and 30 percent to 57 percent in precipitation, showing a distinct lack of confidence in their initial predictions as time goes on.
Is changing the forecast as new information comes in a bad thing?? Or is it merely that they are changing the forecast too much?
Nota bene: I am also very tired and may just be being thickheaded—I rate that possibility at about 50%, and you’re welcome to check my calibration. =)
Related thought: Maybe see if they will give you their data? That would save you sometime and I’m now very interested in if a more careful analysis will substantially disagree with their results.
Oh. I see. Yes, they aren’t taking into account the accuracy estimations at all. Your criticism seems correct. Your complaints about the other aspects seem accurate also.
Huh. This is disturbing; most of the Freakonomics blog entries I’ve read have good analysis of data. It looks like this one really screwed the pooch. I have to wonder if others they’ve done have similar problems that I haven’t noticed.
Yeah, I am a fan of Freakonomics generally too. I will write to them, I think. Will let you know how it goes. I want to confirm I am right about the probability stuff though, I still have a niggling doubt that I’ve just misunderstood something. But I think they are definitely wrong about the forecast updating.
Is changing the forecast as new information comes in a bad thing?? Or is it merely that they are changing the forecast too much?
I think the criticism is that if they need to change their predictions so much between time 1 and time 2, then it is irresponsible to make any prediction at time 1. This is a hard case to make out for the temperature swings, since I think 8 degrees is only about one standard deviation for a prediction of a day’s temperature in a city knowing only what day of the year it is, but it’s an easy case to make out for the precipitation swings: if, on average, you are wrong by 40% objective probability (not even 40% error; 40% chance of rain, here), then a prediction of, e.g., 30% will on average convey virtually no information; that could easily mean 0% or it could easily mean 70%, and without too much implausibiliy it could even mean 90% -- so why bother saying 30% at all when you could (more honestly) admit your ignorance about whether it will rain next week.
In the meteorologists’ defense, their medium-range predictions become useful when tested against broader time periods. Specifically, a 60% chance of rain on Thursday means you can be pretty sure that it will rain on Wednesday, Thursday, or Friday—perhaps with 90% confidence. The reason for this is that predictions of rain generally come from tracking low-pressure pockets of air as they sweep across the continent; these pockets might speed up or slow down, or alter their course by a few degrees, but they rarely disappear or turn around altogether.
You: “I will take that to mean that it is more likely than not to be 1-4 - thus I will count 1-4 as a hit and 5 or 6 as a miss.”
This is a much more reasonable testing method when one’s predictions are based on an alleged causal process. For example, suppose I claim that I can predict how many cards Bob will draw in a game of blackjack by taking into consideration all of the variables in the game. A totally naive predictor might be “Bob will hit no matter what.” That predictor might be right about 60% of the time. A slightly better predictor might be “Bob will hit if his cards show a total of 13 or less.” That predictor might be right about 70% of the time. If I, as a skilled blackjack kibitzer, can really add predictive value to these simple predictors, then I should be able to beat their hit-miss ratio, maybe getting Bob’s decision right 75% of the time. If I knew Bob quite well and could read his tells, maybe I would go up to 90%.
Anyway, 66% is pretty good for a blind guess that can’t be varied from episode to episode. So the test with the die that you’re using in your analogy is a fair test, but the bar is set too high. If you can get 66% on a hit-miss test with a one-sentence rule, you’re doing pretty well.
Point taken about forecast updating—information changing that drastically may be merely worthless noise.
However, on the coin toss/blackjack thing...
In your blackjack example, the answer you give is binary—Bob will either say “hit me” or “[whatever the opposite is, I’ve never played].” The meteorologists are giving answers in terms of probabilities: “there is a 70% chance that it will rain.”
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
My complaint is that the author interpreted forecasters’ probabilities as certainties, rounding them up to 1 or down to 0. This was unfair as it ignored their self-stated levels of confidence.
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
Correct. However, suppose we repeat this experiment 100 times, each time reducing my probability estimate to a binary prediction of hit-stay. Suppose that Bob hits 60 times, 50 of which were on occasions when I assigned greater than 50% probability to Bob hitting, and Bob stays 40 times, 13 of which were on occasions when I assigned less than 50% probability to Bob hitting. Thus, my overall accuracy, when reduced to a hit-stay prediction, is 63%. This is worse than my claimed certainty level of 65%, but better than the naive predictor “Bob always hits,” which only got 60% of the episodes right. Thus, the pass-fail test is one way of distinguishing my predictive abilities from the predictive abilities of a broad generalization.
To see this, suppose instead that I always predict, with 65% certainty, that Bob will hit or that Bob will stay. I might rate the chance of Bob hitting at 65%, or I might rate it at 35%. In this experiment, Bob hits 75 times, 50 of which were on occasions when I assigned a 65% probability that Bob would hit. Bob stays 25 times, 18 of which were on occasions when I assigned a 65% probability that Bob would stay. I correctly predicted Bob’s action 68% of the time, which is better than my stated certainty of 65%. However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
Again, sorry for the confusion. I gave an incomplete example before.
So if I understand correctly, the issue is not that the meteorologists are poorly calibrated (maybe they are, maybe they aren’t), but rather that their predictions are less useful than a simple rule like “it never rains” for actually predicting whether it will rain or not.
However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
I think I am beginning to see the light here. Basically, in this scenario you are too ignorant of the phenomenon itself, even though you are very good at quantifying your epistemic state with respect to the phenomenon? If this is more or less right, is there terminology that might help me get a better handle on this?
Bingo! That’s exactly what I was trying to say. Thanks for listening. :-)
My jargon mostly comes from political science. We’d say the meteorologists are using an overly complicated model, or seizing on spurious correlations, or that they have a low pseudo-R-squared. I’m not sure any of those are helpful. Personally, I think your words—the meteorologists are too ignorant for us to applaud their calibration—are more elegant.
The only other thing I would add is that the reason why it doesn’t make sense to applaud the meteorologists’ guess-level calibration is because they have such poor model-level calibration. In other words, while their confidence about any given guess seems accurate, their implicit confidence about the accuracy of their model as a whole is too high. If your (complex) model does not beat a naive predictor, social science (and, frankly, Occam’s Razor) says you ought to abandon it in favor of a simpler model. By sticking to their complex models in the face of weak predictive power, the meteorologists suggest that either (1) they don’t know or care about Occam’s Razor, or (2) they actually think their model has strong predictive power.
Here’s a really crude indicator of improvement in weather forecasting: I can remember when jokes about forecasts being wrong were a cliche. I haven’t heard a joke about weather forecasts for years, probably decades, which suggests that forecasts have actually gotten fairly good, even if they’re not as accurate as the probabilities in the forecasts suggest.
Does anyone remember when weatherman jokes went away?
Can we conclude that the prevalence of the cliche dropping is related to the quality of weather forecasting? All else being equal I expect a culture to develop a resistance to any given cliche over time. For example the cliche “It’s not you it’s me” has dropped in use and been somewhat relegated to ‘second order cliche’ . But it is true now at least as much as it has been in the past.
A fair point, though if a cliche has lasted for a very long time, I think it’s more plausible that its end is about changed conditions rather than boredom.
Note that this sort of thing has been done a bit before. See for example this analysis.
Edit: The linked analysis has a lot of problems. See discussion below.
Cool, but hold on a minute though. I quote:
Isn’t something wrong here? If you say “60% chance of rain,” and it doesn’t rain, you are not necessarily a bad forecaster. Not unless it actually rained on less (or more!) than 60% of those occasions. It should rain on ~60% of occasions on which you say “60% chance of rain.”
Am I just confused about this fellow’s methodology?
If I’m reading this correctly they are doing exactly what you want but only breaking into two categories “more likely to rain than not” and “less likely to rain than not.” But I’m confused by the fact that 50 percent gets into the expecting rain category.
Okay, this is like a sore tooth. Somebody’s wrong, and I don’t know if it’s me. A queazy feeling.
Listen to this though:
Uhhh.… it’s remarkable that a forecast changed significantly in SEVEN DAYS? What?!
The weather is the canonical example of mathematical chaos in an (in principle) deterministic system. Of course the forecasts will change, because Tuesday’s weather sets the initial conditions for Wednesday, and chaotic systems are ultra-sensitive to initial conditions! The forecasters would be idiots if they didn’t update their forecasts as much as possible.
The “close second,” moreover, should be first! That change occurred in a two day period versus a seven! ARGGHHH.
To me it almost seems as though a scenario like this is happening:
In other words, isn’t the author misrepresenting the forecasters in throwing away their POPs, which could be interpreted as subjective beliefs about likelihoods?
I was also sort of confused by:
Is changing the forecast as new information comes in a bad thing?? Or is it merely that they are changing the forecast too much?
Nota bene: I am also very tired and may just be being thickheaded—I rate that possibility at about 50%, and you’re welcome to check my calibration. =)
Related thought: Maybe see if they will give you their data? That would save you sometime and I’m now very interested in if a more careful analysis will substantially disagree with their results.
Oh. I see. Yes, they aren’t taking into account the accuracy estimations at all. Your criticism seems correct. Your complaints about the other aspects seem accurate also.
Huh. This is disturbing; most of the Freakonomics blog entries I’ve read have good analysis of data. It looks like this one really screwed the pooch. I have to wonder if others they’ve done have similar problems that I haven’t noticed.
Yeah, I am a fan of Freakonomics generally too. I will write to them, I think. Will let you know how it goes. I want to confirm I am right about the probability stuff though, I still have a niggling doubt that I’ve just misunderstood something. But I think they are definitely wrong about the forecast updating.
I think the criticism is that if they need to change their predictions so much between time 1 and time 2, then it is irresponsible to make any prediction at time 1. This is a hard case to make out for the temperature swings, since I think 8 degrees is only about one standard deviation for a prediction of a day’s temperature in a city knowing only what day of the year it is, but it’s an easy case to make out for the precipitation swings: if, on average, you are wrong by 40% objective probability (not even 40% error; 40% chance of rain, here), then a prediction of, e.g., 30% will on average convey virtually no information; that could easily mean 0% or it could easily mean 70%, and without too much implausibiliy it could even mean 90% -- so why bother saying 30% at all when you could (more honestly) admit your ignorance about whether it will rain next week.
In the meteorologists’ defense, their medium-range predictions become useful when tested against broader time periods. Specifically, a 60% chance of rain on Thursday means you can be pretty sure that it will rain on Wednesday, Thursday, or Friday—perhaps with 90% confidence. The reason for this is that predictions of rain generally come from tracking low-pressure pockets of air as they sweep across the continent; these pockets might speed up or slow down, or alter their course by a few degrees, but they rarely disappear or turn around altogether.
This is a much more reasonable testing method when one’s predictions are based on an alleged causal process. For example, suppose I claim that I can predict how many cards Bob will draw in a game of blackjack by taking into consideration all of the variables in the game. A totally naive predictor might be “Bob will hit no matter what.” That predictor might be right about 60% of the time. A slightly better predictor might be “Bob will hit if his cards show a total of 13 or less.” That predictor might be right about 70% of the time. If I, as a skilled blackjack kibitzer, can really add predictive value to these simple predictors, then I should be able to beat their hit-miss ratio, maybe getting Bob’s decision right 75% of the time. If I knew Bob quite well and could read his tells, maybe I would go up to 90%.
Anyway, 66% is pretty good for a blind guess that can’t be varied from episode to episode. So the test with the die that you’re using in your analogy is a fair test, but the bar is set too high. If you can get 66% on a hit-miss test with a one-sentence rule, you’re doing pretty well.
Point taken about forecast updating—information changing that drastically may be merely worthless noise.
However, on the coin toss/blackjack thing...
In your blackjack example, the answer you give is binary—Bob will either say “hit me” or “[whatever the opposite is, I’ve never played].” The meteorologists are giving answers in terms of probabilities: “there is a 70% chance that it will rain.”
If you did that in the Blackjack example; i.e., you said “I rate it as 65% likely that Bob will take another card,” and then he DIDN’T take another card, that would not mean you were bad at predicting—we would have to watch you for longer.
My complaint is that the author interpreted forecasters’ probabilities as certainties, rounding them up to 1 or down to 0. This was unfair as it ignored their self-stated levels of confidence.
Sorry, I didn’t communicate clearly.
Correct. However, suppose we repeat this experiment 100 times, each time reducing my probability estimate to a binary prediction of hit-stay. Suppose that Bob hits 60 times, 50 of which were on occasions when I assigned greater than 50% probability to Bob hitting, and Bob stays 40 times, 13 of which were on occasions when I assigned less than 50% probability to Bob hitting. Thus, my overall accuracy, when reduced to a hit-stay prediction, is 63%. This is worse than my claimed certainty level of 65%, but better than the naive predictor “Bob always hits,” which only got 60% of the episodes right. Thus, the pass-fail test is one way of distinguishing my predictive abilities from the predictive abilities of a broad generalization.
To see this, suppose instead that I always predict, with 65% certainty, that Bob will hit or that Bob will stay. I might rate the chance of Bob hitting at 65%, or I might rate it at 35%. In this experiment, Bob hits 75 times, 50 of which were on occasions when I assigned a 65% probability that Bob would hit. Bob stays 25 times, 18 of which were on occasions when I assigned a 65% probability that Bob would stay. I correctly predicted Bob’s action 68% of the time, which is better than my stated certainty of 65%. However, my accuracy is worse than the accuracy of the naive predictor “Bob always hits,” which would have scored 75%. Thus, my predictions are not very good, by one relatively objective benchmark, despite the fact that they are, in a narrow Bayesian sense, fairly well-calibrated.
Again, sorry for the confusion. I gave an incomplete example before.
So if I understand correctly, the issue is not that the meteorologists are poorly calibrated (maybe they are, maybe they aren’t), but rather that their predictions are less useful than a simple rule like “it never rains” for actually predicting whether it will rain or not.
I think I am beginning to see the light here. Basically, in this scenario you are too ignorant of the phenomenon itself, even though you are very good at quantifying your epistemic state with respect to the phenomenon? If this is more or less right, is there terminology that might help me get a better handle on this?
Bingo! That’s exactly what I was trying to say. Thanks for listening. :-)
My jargon mostly comes from political science. We’d say the meteorologists are using an overly complicated model, or seizing on spurious correlations, or that they have a low pseudo-R-squared. I’m not sure any of those are helpful. Personally, I think your words—the meteorologists are too ignorant for us to applaud their calibration—are more elegant.
The only other thing I would add is that the reason why it doesn’t make sense to applaud the meteorologists’ guess-level calibration is because they have such poor model-level calibration. In other words, while their confidence about any given guess seems accurate, their implicit confidence about the accuracy of their model as a whole is too high. If your (complex) model does not beat a naive predictor, social science (and, frankly, Occam’s Razor) says you ought to abandon it in favor of a simpler model. By sticking to their complex models in the face of weak predictive power, the meteorologists suggest that either (1) they don’t know or care about Occam’s Razor, or (2) they actually think their model has strong predictive power.
Here’s a really crude indicator of improvement in weather forecasting: I can remember when jokes about forecasts being wrong were a cliche. I haven’t heard a joke about weather forecasts for years, probably decades, which suggests that forecasts have actually gotten fairly good, even if they’re not as accurate as the probabilities in the forecasts suggest.
Does anyone remember when weatherman jokes went away?
Can we conclude that the prevalence of the cliche dropping is related to the quality of weather forecasting? All else being equal I expect a culture to develop a resistance to any given cliche over time. For example the cliche “It’s not you it’s me” has dropped in use and been somewhat relegated to ‘second order cliche’ . But it is true now at least as much as it has been in the past.
A fair point, though if a cliche has lasted for a very long time, I think it’s more plausible that its end is about changed conditions rather than boredom.
Gotcha. Thanks for the explanation, it’s been very clarifying. =)