Data points come in one by one, so it’s only natural to ask how each data point affects our estimates of how well different models are doing, separately from how much we trust different models in advance. A lot of the arguments that were made by people who disagreed with Silver were Trump-specific, anyway, making the long-term record less relevant.
It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?”
If we were observing the results of his bets one by one, and Scott said it was 90% likely and a lot of other people said it was 60% likely, and then it didn’t happen, I would totally be happy to say that Scott’s model took a hit.
I would totally be happy to say that Scott’s model took a hit.
I think that’s the root of our disagreement. In this situation, I would not concede that Scott’s model took a hit. Instead, I would claim that 90% was a better estimate than 60%, despite the prediction coming out false. (This is assuming that I already know Scott’s overall calibration, which is the case for 538.)
I think this point bottoms out at a pretty deep philosophical problem that we can’t resolve in this comment thread. (I super want to write a post about it though.)
(I feel a bit weird referencing my post since it did much more poorly than I expected, but I’ll just do it anyway since I know you’ve read it.)
The way in which my post contradicts your argument is that it frames the questions
Did 538 make a good prediction?; and
Was the market’s prediction better than 538′s?
as entirely separate. For the first question, we care about how much information 538′s prediction was based on and how well calibrated it was. Well, we know what kind of information it was based on (the same as every election), and evidence shows that calibration is excellent. In fact, this election made 538′s calibration look better than it did before since it was historically conservative. (I think—I’ve heard Nate say this.) In the two pictures I’ve had in my post, both of them had 538 at the same place on the chart. They were only different in how well the market did. In other words, Nate did a good job regardless of what happened with the market. (And in the article you linked, he wasn’t asked about the market.)
The second question is where we compare the hypothesis that the market was being stupid (1) to the hypothesis that it was smarter than 538/had information about the polling error that 538 didn’t (2). This is where I’ll grant you the update you mentioned in your comment. (2) predicts a narrow margin in the real result, whereas (1) has significant probability mass on a Biden landslide. Since we got the narrow margin, that has to be a significant update toward the market being smart, maybe (1:4) or something. (But I made an even greater update toward the market being stupid based on its behavior on election night, so I come out updating toward the market being stupid in total, which was also my prior (that’s why I bet against it in the first place).)
Data points come in one by one, so it’s only natural to ask how each data point affects our estimates of how well different models are doing, separately from how much we trust different models in advance. A lot of the arguments that were made by people who disagreed with Silver were Trump-specific, anyway, making the long-term record less relevant.
If we were observing the results of his bets one by one, and Scott said it was 90% likely and a lot of other people said it was 60% likely, and then it didn’t happen, I would totally be happy to say that Scott’s model took a hit.
I think that’s the root of our disagreement. In this situation, I would not concede that Scott’s model took a hit. Instead, I would claim that 90% was a better estimate than 60%, despite the prediction coming out false. (This is assuming that I already know Scott’s overall calibration, which is the case for 538.)
I think this point bottoms out at a pretty deep philosophical problem that we can’t resolve in this comment thread. (I super want to write a post about it though.)
Yes, that looks like a crux. I guess I don’t see the need to reason about calibration instead of directly about expected log score.
(I feel a bit weird referencing my post since it did much more poorly than I expected, but I’ll just do it anyway since I know you’ve read it.)
The way in which my post contradicts your argument is that it frames the questions
Did 538 make a good prediction?; and
Was the market’s prediction better than 538′s?
as entirely separate. For the first question, we care about how much information 538′s prediction was based on and how well calibrated it was. Well, we know what kind of information it was based on (the same as every election), and evidence shows that calibration is excellent. In fact, this election made 538′s calibration look better than it did before since it was historically conservative. (I think—I’ve heard Nate say this.) In the two pictures I’ve had in my post, both of them had 538 at the same place on the chart. They were only different in how well the market did. In other words, Nate did a good job regardless of what happened with the market. (And in the article you linked, he wasn’t asked about the market.)
The second question is where we compare the hypothesis that the market was being stupid (1) to the hypothesis that it was smarter than 538/had information about the polling error that 538 didn’t (2). This is where I’ll grant you the update you mentioned in your comment. (2) predicts a narrow margin in the real result, whereas (1) has significant probability mass on a Biden landslide. Since we got the narrow margin, that has to be a significant update toward the market being smart, maybe (1:4) or something. (But I made an even greater update toward the market being stupid based on its behavior on election night, so I come out updating toward the market being stupid in total, which was also my prior (that’s why I bet against it in the first place).)