Rafael Harth comments on Did anybody calculate the Briers score for per-state election forecasts?

Rafael Harth 10 Nov 2020 21:19 UTC
2 points
If it is true that the model has been slightly conservative, historically speaking, then it isn’t clear why there is anything to admit. You expect some number of unlikely events to come true; looking at the history of 538, it was about the right number, a bit to few; now we have one more unlikely event, and the overall calibration probably improved.

So if people just ask him how good of a job he did, it seems completely reasonable to evaluate the model in terms of all past elections, and conclude that they did a good job. There’s no reason why you would think anything went wrong this time. This explains the way he’s been talking about it.

As far as I know, he hasn’t admitted this particular point. But I strongly assume no-one has asked about it. It doesn’t seem like a question that makes a lot of sense—why would you ever ignore all of the past history when you’re trying to compute calibration? It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?” The answer is yes, but asking the question is weird.
- steven0461 10 Nov 2020 21:31 UTC
  6 points
  Parent
  Data points come in one by one, so it’s only natural to ask how each data point affects our estimates of how well different models are doing, separately from how much we trust different models in advance. A lot of the arguments that were made by people who disagreed with Silver were Trump-specific, anyway, making the long-term record less relevant.
  
  It’s like taking one of Scott Alexander’s 90% bets that went wrong and asking, “do you admit that, if we only consider this particular bet, you would have done better assigning 60% instead?”
  
  If we were observing the results of his bets one by one, and Scott said it was 90% likely and a lot of other people said it was 60% likely, and then it didn’t happen, I would totally be happy to say that Scott’s model took a hit.
  - Rafael Harth 10 Nov 2020 21:43 UTC
    2 points
    Parent
    
    I would totally be happy to say that Scott’s model took a hit.
    
    I think that’s the root of our disagreement. In this situation, I would not concede that Scott’s model took a hit. Instead, I would claim that 90% was a better estimate than 60%, despite the prediction coming out false. (This is assuming that I already know Scott’s overall calibration, which is the case for 538.)
    
    I think this point bottoms out at a pretty deep philosophical problem that we can’t resolve in this comment thread. (I super want to write a post about it though.)
    - steven0461 11 Nov 2020 19:19 UTC
      8 points
      Parent
      Yes, that looks like a crux. I guess I don’t see the need to reason about calibration instead of directly about expected log score.
      - Rafael Harth 14 Nov 2020 7:43 UTC
        4 points
        Parent
        (I feel a bit weird referencing my post since it did much more poorly than I expected, but I’ll just do it anyway since I know you’ve read it.)
        
        The way in which my post contradicts your argument is that it frames the questions
        
        Did 538 make a good prediction?; and
        Was the market’s prediction better than 538′s?
        
        as entirely separate. For the first question, we care about how much information 538′s prediction was based on and how well calibrated it was. Well, we know what kind of information it was based on (the same as every election), and evidence shows that calibration is excellent. In fact, this election made 538′s calibration look better than it did before since it was historically conservative. (I think—I’ve heard Nate say this.) In the two pictures I’ve had in my post, both of them had 538 at the same place on the chart. They were only different in how well the market did. In other words, Nate did a good job regardless of what happened with the market. (And in the article you linked, he wasn’t asked about the market.)
        
        The second question is where we compare the hypothesis that the market was being stupid (1) to the hypothesis that it was smarter than 538/had information about the polling error that 538 didn’t (2). This is where I’ll grant you the update you mentioned in your comment. (2) predicts a narrow margin in the real result, whereas (1) has significant probability mass on a Biden landslide. Since we got the narrow margin, that has to be a significant update toward the market being smart, maybe (1:4) or something. (But I made an even greater update toward the market being stupid based on its behavior on election night, so I come out updating toward the market being stupid in total, which was also my prior (that’s why I bet against it in the first place).)