I didn’t know that was possible, thanks. (Wow, a prior with integral=infinity! One that can’t be reached as a posterior after any observation! How’d a Bayesian come by that? But seems to work regardless.) What would be a better example?
ETA: I believe the point raised in that comment still deserves an answer from Bayesians.
ETA: I believe the point raised in that comment still deserves an answer from Bayesians.
Done, but I think a more useful reply could be given if you provided an actual worked example where a frequentist tool leads you to make a different prediction than the application of Bayes would (and where you prefer the frequentist prediction.) Something with numbers in it and with the frequentist prediction provided.
Here’s one. There is one data point, distributed according to 0.5*N(0,1) + 0.5*N(mu,1).
Bayes: any improper prior for mu yields an improper posterior (because there’s a 50% chance that the data are not informative about mu). Any proper prior has no calibration guarantee.
Frequentist: Neyman’s confidence belt construction guarantees valid confidence coverage of the resulting interval. If the datum is close to 0, the interval may be the whole real line. This is just what we want [claims the frequentist, not me!]; after all, when the datum is close to 0, mu really could be anything.
Can you explain the terms “calibration guarantee”, and what “the resulting interval” is?
Also, I don’t understand why you say there is a 50% chance the data is not informative about mu. This is not a multi-modal distribution; it is blended from N(0,1) and N(mu,1). If mu can be any positive or negative number, then the one data point will tell you whether mu is positive or negative with probability 1.
Can you explain the terms “calibration guarantee”...
By “calibration guarantee” I mean valid confidence coverage: if I give a number of intervals at a stated confidence, then relative frequency with which the estimated quantities fall within the interval is guaranteed to approach the stated confidence as the number of estimated quantities grows. Here we might imagine a large number of mu parameters and one datum per parameter.
… and what “the resulting interval” is?
Not easily. The second cousin of this post (a reply to wedrifid) contains a link to a paper on arXiv that gives a bare-bones overview of how confidence intervals can be constructed on page 3. When you’ve got that far I can tell you what interval I have in mind.
Also, I don’t understand why you say there is a 50% chance the data is not informative about mu. This is not a multi-modal distribution; it is blended from N(0,1) and N(mu,1).
I think there’s been a misunderstanding somewhere. Let Z be a fair coin toss. If it comes up heads the datum is generated from N(0,1); if it comes up tails, the datum is generated from N(mu,1). Z is unobserved and mu is unknown. The probability distribution of the datum is as stated above. It will be multimodal if the absolute value of mu is greater than 2 (according to some quick plots I made; I did not do a mathematical proof).
If mu can be any positive or negative number, then the one data point will tell you whether mu is positive or negative with probability 1.
If I observe the datum 0.1, is mu greater than or less than 0?
I’ll get back to you when (and if) I’ve had time to get my head around Neyman’s confidence belt construction, with which I’ve never had cause to acquaint myself.
This paper has a good explanation. Note that I’ve left one of the steps (the “ordering” that determines inclusion into the confidence belt) undetermined. I’ll tell you the ordering I have in mind if you get to the point of wanting to ask me.
All you need is page 3 (especially the figure). If you understand that in depth, then I can tell you what the confidence belt for my problem above looks like. Then I can give you a simulation algorithm and you can play around and see exactly how confidence intervals work and what they can give you.
It’s called an improper prior. There’s been some argument about their use but they seldom lead to problems. The posteriors usually has much better behavior at infinity and when they don’t, that’s the theory telling us that the information doesn’t determine the solution to the problem.
The observation that an improper prior cannot be obtain as a posterior distribution is kind of trivial. It is meant to represent a total lack of information w.r.t. some parameter. As soon you have made an observation you have more information than that.
Maybe the difference lies in the format of answers?
We know: set of n outputs of a random number generator with normal distribution. Say {3.2, 4.5, 8.1}.
We don’t know: mean m and variance v.
Your proposed answer: m = 5.26, v = 6.44.
A Bayesian’s answer: a probability distribution P(m) of the mean and another distribution Q(v) of the variance.
How does a frequentist get them? If he hasn’t them, what’s his confidence in m = 5.26 and v = 6.44? What if the set contains only one number—what is the frequentist’s estimate for v? Note that a Bayesian has no problem even if the data set is empty, he only rests with his priors. If the data set is large, Bayesian’s answer will inevitably converge at delta-function around the frequentist’s estimate, no matter what the priors are.
50% confidence interval for mean: 4.07 to 6.46, stddev: 2.15 to 4.74
90% confidence interval for mean: 0.98 to 9.55, stddev: 1.46 to 11.20
If there’s only one sample, the calculation fails due to division by n-1, so the frequentist says “no answer”. The Bayesian says the same if he used the improper prior Cyan mentioned.
The prior for variance that matches the frequentist conclusion isn’t flat. And even if it were, a flat prior for variance implies a non-flat prior for standard deviation and vice versa. :-)
Using the flat improper prior I was talking about before, when there’s only one data point the posterior distribution is improper, so the Bayesian answer is the same as the frequentist’s.
I didn’t know that was possible, thanks. (Wow, a prior with integral=infinity! One that can’t be reached as a posterior after any observation! How’d a Bayesian come by that? But seems to work regardless.) What would be a better example?
ETA: I believe the point raised in that comment still deserves an answer from Bayesians.
Done, but I think a more useful reply could be given if you provided an actual worked example where a frequentist tool leads you to make a different prediction than the application of Bayes would (and where you prefer the frequentist prediction.) Something with numbers in it and with the frequentist prediction provided.
Here’s one. There is one data point, distributed according to 0.5*N(0,1) + 0.5*N(mu,1).
Bayes: any improper prior for mu yields an improper posterior (because there’s a 50% chance that the data are not informative about mu). Any proper prior has no calibration guarantee.
Frequentist: Neyman’s confidence belt construction guarantees valid confidence coverage of the resulting interval. If the datum is close to 0, the interval may be the whole real line. This is just what we want [claims the frequentist, not me!]; after all, when the datum is close to 0, mu really could be anything.
Can you explain the terms “calibration guarantee”, and what “the resulting interval” is? Also, I don’t understand why you say there is a 50% chance the data is not informative about mu. This is not a multi-modal distribution; it is blended from N(0,1) and N(mu,1). If mu can be any positive or negative number, then the one data point will tell you whether mu is positive or negative with probability 1.
By “calibration guarantee” I mean valid confidence coverage: if I give a number of intervals at a stated confidence, then relative frequency with which the estimated quantities fall within the interval is guaranteed to approach the stated confidence as the number of estimated quantities grows. Here we might imagine a large number of mu parameters and one datum per parameter.
Not easily. The second cousin of this post (a reply to wedrifid) contains a link to a paper on arXiv that gives a bare-bones overview of how confidence intervals can be constructed on page 3. When you’ve got that far I can tell you what interval I have in mind.
I think there’s been a misunderstanding somewhere. Let Z be a fair coin toss. If it comes up heads the datum is generated from N(0,1); if it comes up tails, the datum is generated from N(mu,1). Z is unobserved and mu is unknown. The probability distribution of the datum is as stated above. It will be multimodal if the absolute value of mu is greater than 2 (according to some quick plots I made; I did not do a mathematical proof).
If I observe the datum 0.1, is mu greater than or less than 0?
Thanks Cyan.
I’ll get back to you when (and if) I’ve had time to get my head around Neyman’s confidence belt construction, with which I’ve never had cause to acquaint myself.
This paper has a good explanation. Note that I’ve left one of the steps (the “ordering” that determines inclusion into the confidence belt) undetermined. I’ll tell you the ordering I have in mind if you get to the point of wanting to ask me.
That’s a lot of integration to get my head around.
All you need is page 3 (especially the figure). If you understand that in depth, then I can tell you what the confidence belt for my problem above looks like. Then I can give you a simulation algorithm and you can play around and see exactly how confidence intervals work and what they can give you.
It’s called an improper prior. There’s been some argument about their use but they seldom lead to problems. The posteriors usually has much better behavior at infinity and when they don’t, that’s the theory telling us that the information doesn’t determine the solution to the problem.
The observation that an improper prior cannot be obtain as a posterior distribution is kind of trivial. It is meant to represent a total lack of information w.r.t. some parameter. As soon you have made an observation you have more information than that.
Maybe the difference lies in the format of answers?
We know: set of n outputs of a random number generator with normal distribution. Say {3.2, 4.5, 8.1}.
We don’t know: mean m and variance v.
Your proposed answer: m = 5.26, v = 6.44.
A Bayesian’s answer: a probability distribution P(m) of the mean and another distribution Q(v) of the variance.
How does a frequentist get them? If he hasn’t them, what’s his confidence in m = 5.26 and v = 6.44? What if the set contains only one number—what is the frequentist’s estimate for v? Note that a Bayesian has no problem even if the data set is empty, he only rests with his priors. If the data set is large, Bayesian’s answer will inevitably converge at delta-function around the frequentist’s estimate, no matter what the priors are.
http://www.xuru.org/st/DS.asp
50% confidence interval for mean: 4.07 to 6.46, stddev: 2.15 to 4.74
90% confidence interval for mean: 0.98 to 9.55, stddev: 1.46 to 11.20
If there’s only one sample, the calculation fails due to division by n-1, so the frequentist says “no answer”. The Bayesian says the same if he used the improper prior Cyan mentioned.
Hm, should I understand it that the frequentist assumes normal distribution of the mean value with peak at the estimated 5.26?
If so, then frequentism = bayes + flat prior.
Improper priors are however quite tricky, they may lead to paradoxes such as the two-envelope paradox.
The prior for variance that matches the frequentist conclusion isn’t flat. And even if it were, a flat prior for variance implies a non-flat prior for standard deviation and vice versa. :-)
Of course, I meant flat distribution of the mean. The variance cannot be negative at least.
In this problem, yes. In the general case no one knows exactly what the flat prior is, e.g. if there are constraints on model parameters.
Using the flat improper prior I was talking about before, when there’s only one data point the posterior distribution is improper, so the Bayesian answer is the same as the frequentist’s.