Both Nate Silver and Metaculus users seem to me to be in denial about this.
I think this is a strawman. Nate Silver says that his model has good calibration across its lifetime, and is in fact slightly too conservative. I agree that, if the only two things you consider are (a) the probabilities for a Biden win in 2020, 65% and 89%, and (b) the margin of the win in 2020, then betting markets are a clear winner. But how much does that matter? (And the article you linked doesn’t mention markets at all.)
I agree that, if the only two things you consider are (a) the probabilities for a Biden win in 2020, 65% and 89%, and (b) the margin of the win in 2020, then betting markets are a clear winner.
My impression from Silver’s internet writings is he hasn’t admitted this, but maybe I’m wrong. I haven’t seen him admit it and his claim that “we did a good job” suggests he’s unwilling to. Betting markets are the clear winner if you look at Silver’s predictions about how wrong polls would be, too. That was always the main point of contention. The line he’s taking is “we said the polls might be this wrong and that Biden could still win”, but obviously it’s worse to say that the polls might be that wrong than to say that the polls probably would be that wrong (in that direction), as the markets implicitly did.
If it is true that the model has been slightly conservative, historically speaking, then it isn’t clear why there is anything to admit. You expect some number of unlikely events to come true; looking at the history of 538, it was about the right number, a bit to few; now we have one more unlikely event, and the overall calibration probably improved.
So if people just ask him how good of a job he did, it seems completely reasonable to evaluate the model in terms of all past elections, and conclude that they did a good job. There’s no reason why you would think anything went wrong this time. This explains the way he’s been talking about it.
As far as I know, he hasn’t admitted this particular point. But I strongly assume no-one has asked about it. It doesn’t seem like a question that makes a lot of sense—why would you ever ignore all of the past history when you’re trying to compute calibration? It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?” The answer is yes, but asking the question is weird.
Data points come in one by one, so it’s only natural to ask how each data point affects our estimates of how well different models are doing, separately from how much we trust different models in advance. A lot of the arguments that were made by people who disagreed with Silver were Trump-specific, anyway, making the long-term record less relevant.
It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?”
If we were observing the results of his bets one by one, and Scott said it was 90% likely and a lot of other people said it was 60% likely, and then it didn’t happen, I would totally be happy to say that Scott’s model took a hit.
I would totally be happy to say that Scott’s model took a hit.
I think that’s the root of our disagreement. In this situation, I would not concede that Scott’s model took a hit. Instead, I would claim that 90% was a better estimate than 60%, despite the prediction coming out false. (This is assuming that I already know Scott’s overall calibration, which is the case for 538.)
I think this point bottoms out at a pretty deep philosophical problem that we can’t resolve in this comment thread. (I super want to write a post about it though.)
(I feel a bit weird referencing my post since it did much more poorly than I expected, but I’ll just do it anyway since I know you’ve read it.)
The way in which my post contradicts your argument is that it frames the questions
Did 538 make a good prediction?; and
Was the market’s prediction better than 538′s?
as entirely separate. For the first question, we care about how much information 538′s prediction was based on and how well calibrated it was. Well, we know what kind of information it was based on (the same as every election), and evidence shows that calibration is excellent. In fact, this election made 538′s calibration look better than it did before since it was historically conservative. (I think—I’ve heard Nate say this.) In the two pictures I’ve had in my post, both of them had 538 at the same place on the chart. They were only different in how well the market did. In other words, Nate did a good job regardless of what happened with the market. (And in the article you linked, he wasn’t asked about the market.)
The second question is where we compare the hypothesis that the market was being stupid (1) to the hypothesis that it was smarter than 538/had information about the polling error that 538 didn’t (2). This is where I’ll grant you the update you mentioned in your comment. (2) predicts a narrow margin in the real result, whereas (1) has significant probability mass on a Biden landslide. Since we got the narrow margin, that has to be a significant update toward the market being smart, maybe (1:4) or something. (But I made an even greater update toward the market being stupid based on its behavior on election night, so I come out updating toward the market being stupid in total, which was also my prior (that’s why I bet against it in the first place).)
Nate Silver’s predictions were changing too much over the time. If those probabilities were legit, you’d be able to sell binary options based on them. If Nate would do that, he’d went bankrupt, because he created lot of arbitrage opportunities.
That paper doesn’t actually justify why 538′s probabilities don’t form a martingale. (In fact it’s plausible that they do—to demonstrate they aren’t I’d want to see someone show a strategy which is successfully arbitraging the probabilities). Since 538′s model isn’t open source, it’s pretty difficult to say whether or not it is a true martingale, but that paper definitely doesn’t show it.
If we were to take a similar model which is open source (specifically The Economist’s model) we can see that it is not far from being a martingale. Specifically if they added forecasting for their [fundamentals model](http://www.stat.columbia.edu/~gelman/research/published/jdm200907b.pdf) (not difficult, just painful). I don’t think the difference made by the fundamentals model is that significant so I think it would have been fairly difficult for anyone to arbitrage those odds. (Not that they were correct, just that they were broadly time-consistent)
I think this is a strawman. Nate Silver says that his model has good calibration across its lifetime, and is in fact slightly too conservative. I agree that, if the only two things you consider are (a) the probabilities for a Biden win in 2020, 65% and 89%, and (b) the margin of the win in 2020, then betting markets are a clear winner. But how much does that matter? (And the article you linked doesn’t mention markets at all.)
My impression from Silver’s internet writings is he hasn’t admitted this, but maybe I’m wrong. I haven’t seen him admit it and his claim that “we did a good job” suggests he’s unwilling to. Betting markets are the clear winner if you look at Silver’s predictions about how wrong polls would be, too. That was always the main point of contention. The line he’s taking is “we said the polls might be this wrong and that Biden could still win”, but obviously it’s worse to say that the polls might be that wrong than to say that the polls probably would be that wrong (in that direction), as the markets implicitly did.
If it is true that the model has been slightly conservative, historically speaking, then it isn’t clear why there is anything to admit. You expect some number of unlikely events to come true; looking at the history of 538, it was about the right number, a bit to few; now we have one more unlikely event, and the overall calibration probably improved.
So if people just ask him how good of a job he did, it seems completely reasonable to evaluate the model in terms of all past elections, and conclude that they did a good job. There’s no reason why you would think anything went wrong this time. This explains the way he’s been talking about it.
As far as I know, he hasn’t admitted this particular point. But I strongly assume no-one has asked about it. It doesn’t seem like a question that makes a lot of sense—why would you ever ignore all of the past history when you’re trying to compute calibration? It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?” The answer is yes, but asking the question is weird.
Data points come in one by one, so it’s only natural to ask how each data point affects our estimates of how well different models are doing, separately from how much we trust different models in advance. A lot of the arguments that were made by people who disagreed with Silver were Trump-specific, anyway, making the long-term record less relevant.
If we were observing the results of his bets one by one, and Scott said it was 90% likely and a lot of other people said it was 60% likely, and then it didn’t happen, I would totally be happy to say that Scott’s model took a hit.
I think that’s the root of our disagreement. In this situation, I would not concede that Scott’s model took a hit. Instead, I would claim that 90% was a better estimate than 60%, despite the prediction coming out false. (This is assuming that I already know Scott’s overall calibration, which is the case for 538.)
I think this point bottoms out at a pretty deep philosophical problem that we can’t resolve in this comment thread. (I super want to write a post about it though.)
Yes, that looks like a crux. I guess I don’t see the need to reason about calibration instead of directly about expected log score.
(I feel a bit weird referencing my post since it did much more poorly than I expected, but I’ll just do it anyway since I know you’ve read it.)
The way in which my post contradicts your argument is that it frames the questions
Did 538 make a good prediction?; and
Was the market’s prediction better than 538′s?
as entirely separate. For the first question, we care about how much information 538′s prediction was based on and how well calibrated it was. Well, we know what kind of information it was based on (the same as every election), and evidence shows that calibration is excellent. In fact, this election made 538′s calibration look better than it did before since it was historically conservative. (I think—I’ve heard Nate say this.) In the two pictures I’ve had in my post, both of them had 538 at the same place on the chart. They were only different in how well the market did. In other words, Nate did a good job regardless of what happened with the market. (And in the article you linked, he wasn’t asked about the market.)
The second question is where we compare the hypothesis that the market was being stupid (1) to the hypothesis that it was smarter than 538/had information about the polling error that 538 didn’t (2). This is where I’ll grant you the update you mentioned in your comment. (2) predicts a narrow margin in the real result, whereas (1) has significant probability mass on a Biden landslide. Since we got the narrow margin, that has to be a significant update toward the market being smart, maybe (1:4) or something. (But I made an even greater update toward the market being stupid based on its behavior on election night, so I come out updating toward the market being stupid in total, which was also my prior (that’s why I bet against it in the first place).)
Nate Silver’s predictions were changing too much over the time. If those probabilities were legit, you’d be able to sell binary options based on them. If Nate would do that, he’d went bankrupt, because he created lot of arbitrage opportunities.
https://arxiv.org/pdf/1703.06351.pdf
That paper doesn’t actually justify why 538′s probabilities don’t form a martingale. (In fact it’s plausible that they do—to demonstrate they aren’t I’d want to see someone show a strategy which is successfully arbitraging the probabilities). Since 538′s model isn’t open source, it’s pretty difficult to say whether or not it is a true martingale, but that paper definitely doesn’t show it.
If we were to take a similar model which is open source (specifically The Economist’s model) we can see that it is not far from being a martingale. Specifically if they added forecasting for their [fundamentals model](http://www.stat.columbia.edu/~gelman/research/published/jdm200907b.pdf) (not difficult, just painful). I don’t think the difference made by the fundamentals model is that significant so I think it would have been fairly difficult for anyone to arbitrage those odds. (Not that they were correct, just that they were broadly time-consistent)