An interval defines a range, with the endpoints of that range being represented often as error bars when presented graphically. When I said “error bars” I was informally referring shminux’s measurement of his uncertainty in his prediction, regardless of whether he is using credible intervals, confidence intervals, or some other framework.
Maybe a simple example will help. Suppose I have an urn with 100 balls in it. Each ball is either red, yellow or blue. There are, let’s say, five different hypotheses about the distribution of colors in the urn—H1, H2, H3, H4 and H5 -- and we’re interested in figuring out which hypothesis is correct. The experiment we’re conducting is drawing a single ball from the urn and noting its color. I get a new urn after each individual experiment.
There are obviously three possible outcomes for this experiment, and the frequentist will associate a confidence interval with each outcome. The confidence interval for each outcome will be some set of hypotheses (so, for instance, the confidence interval for “yellow” might be {H2, H4}). These intervals are constructed so that, as the experiment is repeated, in the long run the obtained confidence interval will contain the correct hypothesis at least X% of the time (where X is decided by the experimenter). So, for instance, if I use 95% confidence intervals, then in 95% of the experiments I conduct the correct hypothesis will be included in the confidence interval associated with the outcome I obtain.
In other words, if I say, after each experiment, “The correct hypothesis is one of these”, and point at the confidence interval I obtained in that experiment, then I will be right 95% of the time. The other 5% of the time I may be wrong, perhaps even obviously wrong.
As a contrived example, suppose each urn I am given contains only 5 red balls. Also suppose the confidence interval I associate with “red” is the empty set, and the confidence interval I associate with both “yellow” and “blue” is the set containing all five hypotheses (H1 through H5). Now as I repeat the experiment over and over again, 95% of the time I will get either yellow or blue balls, and I will point at the set containing all hypotheses and say “The correct hypothesis is one of these”, and I will be trivially, obviously right. On the other hand, 5% of the time I will get a red ball, and I will point at the empty set and say “The correct hypothesis is one of these”, and I will be trivially, obviously wrong. But since the red ball only shows up 5% of the time, I will still end up being right 95% of the time. This means that the empty set is actually a kosher 95% confidence interval for the outcome “red”, even though I know the empty set cannot possibly include the correct hypothesis.
The Bayesian doesn’t like this. She wants intervals that make sense in every particular case. She wants to be able to look at the list of hypotheses in a 95% interval and say “There’s a 95% chance that the correct hypothesis is one of these”. Confidence intervals cannot guarantee this. As we have seen, the empty set can be a legitimate 95% confidence interval, and it’s obvious that the chance of the correct hypothesis being part of the empty set is not 95%. This is why the Bayesian uses credible intervals.
Unlike confidence intervals, with a 95% credible interval you get a list at which you can point and say “There’s a 95% chance that one of these is the correct hypothesis”. And this claim will make sense in every particular instance. Moreover, if your priors are correct (whatever that means), then it is guaranteed that there is a 95% chance that the correct hypothesis is in your 95% credible interval.
Upvoted—thanks for a long, even if not fully even handed, reply (also it is perhaps not most transparent to explain CIs using a discrete set of hypotheses). I will try to give an example with a continuous valued parameter.
Say we want to estimate the mean height of LW posters. Ignoring the issue of sock puppets for the moment, we could pick LW usernames out of a hat, show up at the person with that username’s house, and measure their height. Say we do that for 100 LW users we picked randomly, and take an average, call it X1. The 100 users are a “sample” and X1 is a “sample mean.” If we randomly picked a different set of 100, we would get a different average, call it X2. If again a different set of 100, we would get yet a different average, call it X3, etc.
These X1, X2, X3 are realizations of something called the “sampling distribution,” call it Ps. This distribution is a different thing than the distribution that governs height among all LW users, call it Ph. Ph could be anything in general, maybe Gaussian, maybe bimodal, maybe something weird. But if we can figure out what the distribution Ps is, we could make statements of the form
“most of the times where I pick a sample Xi from Ps, e.g. most of the time I pick 100 LW users at random and get their average heights, this average will be pretty close to the real average height of all LW users, under a very small set of assumptions on Ph.”
This is what confidence intervals are about. In fact, if the number of LW users we pick for our sample is large enough, we can well-approximate Ps by a Gaussian distribution because of a neat result called the Central Limit Theorem, (again regardless of what Ph is, or more precisely under very mild assumptions on Ph).
What makes these kinds of statements powerful is that we can sometimes make them without needing to know much at all about Ph. Sometimes it is useful to be able to say something like that—maybe we are very uncertain about Ph, or we suspect shenanigans with how Ph is defined.
Sigh.
An interval defines a range, with the endpoints of that range being represented often as error bars when presented graphically. When I said “error bars” I was informally referring shminux’s measurement of his uncertainty in his prediction, regardless of whether he is using credible intervals, confidence intervals, or some other framework.
Actually, I tried a few times to make sense out of it and failed. Feel free to ELI5.
Maybe a simple example will help. Suppose I have an urn with 100 balls in it. Each ball is either red, yellow or blue. There are, let’s say, five different hypotheses about the distribution of colors in the urn—H1, H2, H3, H4 and H5 -- and we’re interested in figuring out which hypothesis is correct. The experiment we’re conducting is drawing a single ball from the urn and noting its color. I get a new urn after each individual experiment.
There are obviously three possible outcomes for this experiment, and the frequentist will associate a confidence interval with each outcome. The confidence interval for each outcome will be some set of hypotheses (so, for instance, the confidence interval for “yellow” might be {H2, H4}). These intervals are constructed so that, as the experiment is repeated, in the long run the obtained confidence interval will contain the correct hypothesis at least X% of the time (where X is decided by the experimenter). So, for instance, if I use 95% confidence intervals, then in 95% of the experiments I conduct the correct hypothesis will be included in the confidence interval associated with the outcome I obtain.
In other words, if I say, after each experiment, “The correct hypothesis is one of these”, and point at the confidence interval I obtained in that experiment, then I will be right 95% of the time. The other 5% of the time I may be wrong, perhaps even obviously wrong.
As a contrived example, suppose each urn I am given contains only 5 red balls. Also suppose the confidence interval I associate with “red” is the empty set, and the confidence interval I associate with both “yellow” and “blue” is the set containing all five hypotheses (H1 through H5). Now as I repeat the experiment over and over again, 95% of the time I will get either yellow or blue balls, and I will point at the set containing all hypotheses and say “The correct hypothesis is one of these”, and I will be trivially, obviously right. On the other hand, 5% of the time I will get a red ball, and I will point at the empty set and say “The correct hypothesis is one of these”, and I will be trivially, obviously wrong. But since the red ball only shows up 5% of the time, I will still end up being right 95% of the time. This means that the empty set is actually a kosher 95% confidence interval for the outcome “red”, even though I know the empty set cannot possibly include the correct hypothesis.
The Bayesian doesn’t like this. She wants intervals that make sense in every particular case. She wants to be able to look at the list of hypotheses in a 95% interval and say “There’s a 95% chance that the correct hypothesis is one of these”. Confidence intervals cannot guarantee this. As we have seen, the empty set can be a legitimate 95% confidence interval, and it’s obvious that the chance of the correct hypothesis being part of the empty set is not 95%. This is why the Bayesian uses credible intervals.
Unlike confidence intervals, with a 95% credible interval you get a list at which you can point and say “There’s a 95% chance that one of these is the correct hypothesis”. And this claim will make sense in every particular instance. Moreover, if your priors are correct (whatever that means), then it is guaranteed that there is a 95% chance that the correct hypothesis is in your 95% credible interval.
Upvoted—thanks for a long, even if not fully even handed, reply (also it is perhaps not most transparent to explain CIs using a discrete set of hypotheses). I will try to give an example with a continuous valued parameter.
Say we want to estimate the mean height of LW posters. Ignoring the issue of sock puppets for the moment, we could pick LW usernames out of a hat, show up at the person with that username’s house, and measure their height. Say we do that for 100 LW users we picked randomly, and take an average, call it X1. The 100 users are a “sample” and X1 is a “sample mean.” If we randomly picked a different set of 100, we would get a different average, call it X2. If again a different set of 100, we would get yet a different average, call it X3, etc.
These X1, X2, X3 are realizations of something called the “sampling distribution,” call it Ps. This distribution is a different thing than the distribution that governs height among all LW users, call it Ph. Ph could be anything in general, maybe Gaussian, maybe bimodal, maybe something weird. But if we can figure out what the distribution Ps is, we could make statements of the form
“most of the times where I pick a sample Xi from Ps, e.g. most of the time I pick 100 LW users at random and get their average heights, this average will be pretty close to the real average height of all LW users, under a very small set of assumptions on Ph.”
This is what confidence intervals are about. In fact, if the number of LW users we pick for our sample is large enough, we can well-approximate Ps by a Gaussian distribution because of a neat result called the Central Limit Theorem, (again regardless of what Ph is, or more precisely under very mild assumptions on Ph).
What makes these kinds of statements powerful is that we can sometimes make them without needing to know much at all about Ph. Sometimes it is useful to be able to say something like that—maybe we are very uncertain about Ph, or we suspect shenanigans with how Ph is defined.