Bayesianism tells us that there is a unique answer in the form of a probability for the next coin to be heads
I’m obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can’t meaningfully state a probability?
For example, let’s say we have observed 100 samples of a real-valued random variable. We can use the maximum entropy principle, and thus use the normal distribution (whcih is maximal-entropy for unbounded reals). We then use standard methods to estimate population mean, and can even provide a probability that it’s in a certain interval.
But how valid is this result when we knew nothing of the original distribution? What if it was something awkward like the Cauchy distribution? It has no mean; so our interval is meaningless. You can’t just say that “well, we’re 60% certain it’s in this interval, that leaves 40% chance of us being wrong”—because it doesn’t; the mean isn’t outside the interval either! A complete answer would allow for a third outcome, that the mean isn’t defined, but how exactly do you assign a number to this probability?
With this in mind, do we still believe that it’s not wrong (or less wrong? :D) to assume a normal distribution, make our calculations and decide how much you’d bet that the mean of the next 100,000 samples is in the range −100..100? (the sample means of Cauchy distributions diverge as you add more samples)
I’m obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can’t meaningfully state a probability?
It does seem to be widely accepted and largely undebated. However, it is also widely rejected and largely undebated, for example by Andrew Gelman, Cosma Shalizi, Ken Binmore, and Leonard Savage (to name just the people I happen to have seen rejecting it—I am not a statistician, so I do not know how representative these are of the field in general, or if there has actually been a substantial debate anywhere). None of them except Ken Binmore actually present arguments against it in the material I have read, they merely dismiss the idea of a universal prior as absurd. But in mathematics, only one thing is absurd, a contradiction, and by that standard only Ken Binmore has offered any mathematical arguments. He gives two in his book “Rational Decisions”: one based on Gödel-style self-reference, and the other based on a formalisation of the concept of “knowing that” as the box operator of S5 modal logic. I haven’t studied the first but am not convinced by the second, which fails at the outset by defining “I know that” as an extensional predicate. (He identifies a proposition P with the set of worlds in which it is true, and assumes that “I know that P” is a function of the set representing P, not of the syntactic form of P. Therefore by that definition of knowing, since I know that 2+2=4, I know every true statement of mathematics, since they are all true in all possible worlds.)
(ETA: Binmore’s S5 argument can also be found online here.)
(ETA2: For those who don’t have a copy of “Rational Decisions” to hand, here’s a lengthy and informative review of it.)
These people distinguish “small-world” Bayesianism from “large-world” Bayesianism, they themselves being small-worlders. Large-worlders would include Eliezer, Marcus Hutter, and everyone else who believes in the possibility of a universal prior.
A typical small-world Bayesian argument would be: I hypothesise that a certain variable has a Gaussian distribution with unknown parameters over which I have a prior distribution; I observe some samples; I obtain a posterior distribution for the parameters. A large-world Bayesian also makes arguments of this sort and they both make the same calculations.
Where they part company is when the variable in fact does not have a Gaussian distribution. For example, suppose it is a sum of two widely separated Gaussians. According to small-worlders, the large-world Bayesian is stuck with his prior hypothesis of a single Gaussian, which no quantity of observations will force him to relinquish, since it is his prior. His estimate of the mean of the Gaussian will drift aimlessly up and down like the Flying Dutchman between the two modes of the real distribution, unable to see the world beyond his prior. According to large-worlders, that prior was not the real prior which one started from. That whole calculation was really conditional on the assumption of a Gaussian, and this assumption itself has a certain prior probability less than 1, and was chosen from a space of all possible hypothetical distributions. The small-worlders reply that this is absurd, declare victory, and walk away without listening to the large-worlders explain how to choose universal priors. Instead, small-worlders insist that to rectify the fault of having hypothesised the wrong model, one must engage in a completely different non-Bayesian activity called model-checking. Chapter 6 of Gelman’s book “Bayesian Data Analysis” is all about that, but I haven’t read it. There is some material in this paper by Gelman and Shalizi.
(ETA: I have now read Gelman ch.6. Model-checking is performed by various means, such as (1) eyeballing visualisations of the real data and simulated data generated by the model, (2) comparing statistics evaluated for both real and simulated data, or (3) seeing if the model predicts things that conflict with whatever other knowledge you have of the phenomenon being studied.)
And that’s as far as I’ve read on the subject. Have the small-worlders ever responded to large-worlders’ construction of universal priors? Have the large-worlders ever demonstrated that universal priors are more than a theoretical construction without practical application? Has “model checking” ever been analysed in large-world Bayesian terms?
I’m obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can’t meaningfully state a probability?
Actually, yes, but you’re right to be surprised because it’s (to my mind at least) an incredible result. Cox’s theorem establishes this as a mathematical result from the assumption that you want to reason quantitatively and consistently. Jaynes gives a great explanation of this in chapters 1 and 2 of his book “Probability Theory”.
But how valid is this result when we knew nothing of the original distribution?
The short answer is that a probability always reflects your current state of knowledge. If I told you absolutely nothing about the coin or the distribution, then you would be entirely justified in assigning 50% probability to heads (on the basis of symmetry). If I told you the exact distribution over p then you would be justified in assigning a different probability to heads. But in both cases I carried out the same experiment—it’s just that you had different information in the two trials. You are justified in assigning different probabilities because Probability is in the mind. The knowledge you have about the distribution over p is just one more piece of information to roll into your probability.
With this in mind, do we still believe that it’s not wrong (or less wrong? :D) to assume a normal distribution, make our calculations and decide how much you’d bet that the mean of the next 100,000 samples is in the range −100..100?
That depends on the probability that the coin flipper chooses a Cauchy distribution. If this were a real experiment then you’d have to take into account unwieldy facts about human psychology, physics of coin flips, and so on. Cox’s theorem tells us that in this case there is a unique answer in the form of a probability, but it doesn’t guarantee that we have time, resources, or inclination to actually calculate it. If you want to avoid all those kinds of complicated facts then you can start from some reasonable mathematical assumptions such as a normal distribution over p—but if your assumptions are wrong then don’t be surprised when your conclusions turn out wrong.
it doesn’t guarantee that we have time, resources, or inclination to actually calculate it
Here’s how I understand this point, that finally made things clearer:
Yes, there exists a more accurate answer, and we might even be able to discover it by investing some time. But until we do, the fact that such an answer exists is completely irrelevant. It is orthogonal to the problem.
In other words, doing the calculations would give us more information to base our prediction on, but knowing that we can do the calculation doesn’t change it in the slightest.
Thus, we are justified to treat this as “don’t know at all”, even though it seems that we do know something.
Probability is in the mind
Great read, and I think things have finally fit into the right places in my head. Now I just need to learn to guesstimate what the maximum entropy distribution might look like for a given set of facts :)
Well, that and how to actually churn out confidence intervals and expected values for experiments like this one, so that I know how much to bet given a particular set of knowledge.
I’m obviously new to this whole thing, but is this a largely undebated, widely accepted view on probabilities? That there are NO situations in which you can’t meaningfully state a probability?
For example, let’s say we have observed 100 samples of a real-valued random variable. We can use the maximum entropy principle, and thus use the normal distribution (whcih is maximal-entropy for unbounded reals). We then use standard methods to estimate population mean, and can even provide a probability that it’s in a certain interval.
But how valid is this result when we knew nothing of the original distribution? What if it was something awkward like the Cauchy distribution? It has no mean; so our interval is meaningless. You can’t just say that “well, we’re 60% certain it’s in this interval, that leaves 40% chance of us being wrong”—because it doesn’t; the mean isn’t outside the interval either! A complete answer would allow for a third outcome, that the mean isn’t defined, but how exactly do you assign a number to this probability?
With this in mind, do we still believe that it’s not wrong (or less wrong? :D) to assume a normal distribution, make our calculations and decide how much you’d bet that the mean of the next 100,000 samples is in the range −100..100? (the sample means of Cauchy distributions diverge as you add more samples)
It does seem to be widely accepted and largely undebated. However, it is also widely rejected and largely undebated, for example by Andrew Gelman, Cosma Shalizi, Ken Binmore, and Leonard Savage (to name just the people I happen to have seen rejecting it—I am not a statistician, so I do not know how representative these are of the field in general, or if there has actually been a substantial debate anywhere). None of them except Ken Binmore actually present arguments against it in the material I have read, they merely dismiss the idea of a universal prior as absurd. But in mathematics, only one thing is absurd, a contradiction, and by that standard only Ken Binmore has offered any mathematical arguments. He gives two in his book “Rational Decisions”: one based on Gödel-style self-reference, and the other based on a formalisation of the concept of “knowing that” as the box operator of S5 modal logic. I haven’t studied the first but am not convinced by the second, which fails at the outset by defining “I know that” as an extensional predicate. (He identifies a proposition P with the set of worlds in which it is true, and assumes that “I know that P” is a function of the set representing P, not of the syntactic form of P. Therefore by that definition of knowing, since I know that 2+2=4, I know every true statement of mathematics, since they are all true in all possible worlds.)
(ETA: Binmore’s S5 argument can also be found online here.)
(ETA2: For those who don’t have a copy of “Rational Decisions” to hand, here’s a lengthy and informative review of it.)
These people distinguish “small-world” Bayesianism from “large-world” Bayesianism, they themselves being small-worlders. Large-worlders would include Eliezer, Marcus Hutter, and everyone else who believes in the possibility of a universal prior.
A typical small-world Bayesian argument would be: I hypothesise that a certain variable has a Gaussian distribution with unknown parameters over which I have a prior distribution; I observe some samples; I obtain a posterior distribution for the parameters. A large-world Bayesian also makes arguments of this sort and they both make the same calculations.
Where they part company is when the variable in fact does not have a Gaussian distribution. For example, suppose it is a sum of two widely separated Gaussians. According to small-worlders, the large-world Bayesian is stuck with his prior hypothesis of a single Gaussian, which no quantity of observations will force him to relinquish, since it is his prior. His estimate of the mean of the Gaussian will drift aimlessly up and down like the Flying Dutchman between the two modes of the real distribution, unable to see the world beyond his prior. According to large-worlders, that prior was not the real prior which one started from. That whole calculation was really conditional on the assumption of a Gaussian, and this assumption itself has a certain prior probability less than 1, and was chosen from a space of all possible hypothetical distributions. The small-worlders reply that this is absurd, declare victory, and walk away without listening to the large-worlders explain how to choose universal priors. Instead, small-worlders insist that to rectify the fault of having hypothesised the wrong model, one must engage in a completely different non-Bayesian activity called model-checking. Chapter 6 of Gelman’s book “Bayesian Data Analysis” is all about that, but I haven’t read it. There is some material in this paper by Gelman and Shalizi.
(ETA: I have now read Gelman ch.6. Model-checking is performed by various means, such as (1) eyeballing visualisations of the real data and simulated data generated by the model, (2) comparing statistics evaluated for both real and simulated data, or (3) seeing if the model predicts things that conflict with whatever other knowledge you have of the phenomenon being studied.)
And that’s as far as I’ve read on the subject. Have the small-worlders ever responded to large-worlders’ construction of universal priors? Have the large-worlders ever demonstrated that universal priors are more than a theoretical construction without practical application? Has “model checking” ever been analysed in large-world Bayesian terms?
Actually, yes, but you’re right to be surprised because it’s (to my mind at least) an incredible result. Cox’s theorem establishes this as a mathematical result from the assumption that you want to reason quantitatively and consistently. Jaynes gives a great explanation of this in chapters 1 and 2 of his book “Probability Theory”.
The short answer is that a probability always reflects your current state of knowledge. If I told you absolutely nothing about the coin or the distribution, then you would be entirely justified in assigning 50% probability to heads (on the basis of symmetry). If I told you the exact distribution over p then you would be justified in assigning a different probability to heads. But in both cases I carried out the same experiment—it’s just that you had different information in the two trials. You are justified in assigning different probabilities because Probability is in the mind. The knowledge you have about the distribution over p is just one more piece of information to roll into your probability.
That depends on the probability that the coin flipper chooses a Cauchy distribution. If this were a real experiment then you’d have to take into account unwieldy facts about human psychology, physics of coin flips, and so on. Cox’s theorem tells us that in this case there is a unique answer in the form of a probability, but it doesn’t guarantee that we have time, resources, or inclination to actually calculate it. If you want to avoid all those kinds of complicated facts then you can start from some reasonable mathematical assumptions such as a normal distribution over p—but if your assumptions are wrong then don’t be surprised when your conclusions turn out wrong.
Thanks for this, it really helped.
Here’s how I understand this point, that finally made things clearer:
Yes, there exists a more accurate answer, and we might even be able to discover it by investing some time. But until we do, the fact that such an answer exists is completely irrelevant. It is orthogonal to the problem.
In other words, doing the calculations would give us more information to base our prediction on, but knowing that we can do the calculation doesn’t change it in the slightest.
Thus, we are justified to treat this as “don’t know at all”, even though it seems that we do know something.
Great read, and I think things have finally fit into the right places in my head. Now I just need to learn to guesstimate what the maximum entropy distribution might look like for a given set of facts :)
Well, that and how to actually churn out confidence intervals and expected values for experiments like this one, so that I know how much to bet given a particular set of knowledge.
Cool, glad it was helpful :)
Here is one interesting post about how to encourage our brains to output specific probabilities: http://lesswrong.com/lw/3m6/techniques_for_probability_estimates/
Actually I didn’t explain that middle bit well at all. Just see http://lesswrong.com/lw/oi/mind_projection_fallacy/