Shmi comments on Calibrating your probability estimates of world events: Russia vs Ukraine, 6 months later.

Shmi 29 Aug 2014 6:02 UTC
9 points
I clearly don’t. The calibration is about accurately estimating the lack of information and translating it into probabilities.
- Mark_Friedenbach 29 Aug 2014 16:52 UTC
  3 points
  Parent
  I think that’s not getting the point. You thought the odds of a Russian intervention were less than 5%. Even at the time and with the information available that was probably too low. At the same time I would have rated James’ predicitons as too high.
  
  However how certain should you have been of your prediction? No the answer here is not likely to be 5%. It’s at a meta level: your estimate was 5%, but with what error bars? How certain were you that you had processed and updated on all the information necessary to make such a call?
  
  If the error bars were too big, you shouldn’t have been making bets.
  - Shmi 29 Aug 2014 17:02 UTC
    3 points
    Parent
    That was indeed my reasoning, but apparently it’s not properly Bayesian :) Real Bayesians don’t use error bars! (They use credible intervals.)
    - Mark_Friedenbach 29 Aug 2014 17:52 UTC
      −4 points
      Parent
      Different word for the same thing.
      - IlyaShpitser 29 Aug 2014 18:40 UTC
        4 points
        Parent
        Sigh.
        Mark_Friedenbach 29 Aug 2014 20:19 UTC
        4 points
        Parent
        An interval defines a range, with the endpoints of that range being represented often as error bars when presented graphically. When I said “error bars” I was informally referring shminux’s measurement of his uncertainty in his prediction, regardless of whether he is using credible intervals, confidence intervals, or some other framework.
        Shmi 29 Aug 2014 19:01 UTC
        1 point
        Parent
        Actually, I tried a few times to make sense out of it and failed. Feel free to ELI5.
        pragmatist 29 Aug 2014 23:49 UTC
        9 points
        Parent
        Maybe a simple example will help. Suppose I have an urn with 100 balls in it. Each ball is either red, yellow or blue. There are, let’s say, five different hypotheses about the distribution of colors in the urn—H1, H2, H3, H4 and H5 -- and we’re interested in figuring out which hypothesis is correct. The experiment we’re conducting is drawing a single ball from the urn and noting its color. I get a new urn after each individual experiment.
        
        There are obviously three possible outcomes for this experiment, and the frequentist will associate a confidence interval with each outcome. The confidence interval for each outcome will be some set of hypotheses (so, for instance, the confidence interval for “yellow” might be {H2, H4}). These intervals are constructed so that, as the experiment is repeated, in the long run the obtained confidence interval will contain the correct hypothesis at least X% of the time (where X is decided by the experimenter). So, for instance, if I use 95% confidence intervals, then in 95% of the experiments I conduct the correct hypothesis will be included in the confidence interval associated with the outcome I obtain.
        
        In other words, if I say, after each experiment, “The correct hypothesis is one of these”, and point at the confidence interval I obtained in that experiment, then I will be right 95% of the time. The other 5% of the time I may be wrong, perhaps even obviously wrong.
        
        As a contrived example, suppose each urn I am given contains only 5 red balls. Also suppose the confidence interval I associate with “red” is the empty set, and the confidence interval I associate with both “yellow” and “blue” is the set containing all five hypotheses (H1 through H5). Now as I repeat the experiment over and over again, 95% of the time I will get either yellow or blue balls, and I will point at the set containing all hypotheses and say “The correct hypothesis is one of these”, and I will be trivially, obviously right. On the other hand, 5% of the time I will get a red ball, and I will point at the empty set and say “The correct hypothesis is one of these”, and I will be trivially, obviously wrong. But since the red ball only shows up 5% of the time, I will still end up being right 95% of the time. This means that the empty set is actually a kosher 95% confidence interval for the outcome “red”, even though I know the empty set cannot possibly include the correct hypothesis.
        
        The Bayesian doesn’t like this. She wants intervals that make sense in every particular case. She wants to be able to look at the list of hypotheses in a 95% interval and say “There’s a 95% chance that the correct hypothesis is one of these”. Confidence intervals cannot guarantee this. As we have seen, the empty set can be a legitimate 95% confidence interval, and it’s obvious that the chance of the correct hypothesis being part of the empty set is not 95%. This is why the Bayesian uses credible intervals.
        
        Unlike confidence intervals, with a 95% credible interval you get a list at which you can point and say “There’s a 95% chance that one of these is the correct hypothesis”. And this claim will make sense in every particular instance. Moreover, if your priors are correct (whatever that means), then it is guaranteed that there is a 95% chance that the correct hypothesis is in your 95% credible interval.
        IlyaShpitser 4 Sep 2014 21:20 UTC
        6 points
        Parent
        Upvoted—thanks for a long, even if not fully even handed, reply (also it is perhaps not most transparent to explain CIs using a discrete set of hypotheses). I will try to give an example with a continuous valued parameter.
        
        Say we want to estimate the mean height of LW posters. Ignoring the issue of sock puppets for the moment, we could pick LW usernames out of a hat, show up at the person with that username’s house, and measure their height. Say we do that for 100 LW users we picked randomly, and take an average, call it X1. The 100 users are a “sample” and X1 is a “sample mean.” If we randomly picked a different set of 100, we would get a different average, call it X2. If again a different set of 100, we would get yet a different average, call it X3, etc.
        
        These X1, X2, X3 are realizations of something called the “sampling distribution,” call it Ps. This distribution is a different thing than the distribution that governs height among all LW users, call it Ph. Ph could be anything in general, maybe Gaussian, maybe bimodal, maybe something weird. But if we can figure out what the distribution Ps is, we could make statements of the form
        
        “most of the times where I pick a sample Xi from Ps, e.g. most of the time I pick 100 LW users at random and get their average heights, this average will be pretty close to the real average height of all LW users, under a very small set of assumptions on Ph.”
        
        This is what confidence intervals are about. In fact, if the number of LW users we pick for our sample is large enough, we can well-approximate Ps by a Gaussian distribution because of a neat result called the Central Limit Theorem, (again regardless of what Ph is, or more precisely under very mild assumptions on Ph).
        
        What makes these kinds of statements powerful is that we can sometimes make them without needing to know much at all about Ph. Sometimes it is useful to be able to say something like that—maybe we are very uncertain about Ph, or we suspect shenanigans with how Ph is defined.
  - Luke_A_Somers 29 Aug 2014 17:10 UTC
    1 point
    Parent
    
    You thought the odds of a Russian intervention were less than 5%
    
    No, he didn’t. He thought the odds of Russia invading Ukraine in the same fashion as the Soviet Union invaded Afghanistan were 5%. This is a rather different thing.
    - Shmi 29 Aug 2014 17:37 UTC
      3 points
      Parent
      Something like that, yes. I was talking about Russian tanks openly rolling across the border. But Putin found a way to do effectively the same without being so brazen. Which was one of the factors I missed.
      - NancyLebovitz 31 Aug 2014 16:11 UTC
        1 point
        Parent
        I’ll going to look at the rationality skill of being able to tell whether you’ve anchored on a prototype. Has this already been explored?
        Shmi 31 Aug 2014 18:04 UTC
        1 point
        Parent
        I am not sure what you mean, maybe worth asking in the open thread.