I agree that #1 is part of how a perfect Bayesian thinks, if by ‘a correct prior...before you see any evidence’ you have the maximum entropy prior in mind.
Allow me to introduce to you the Brandeis dice problem. We have a six-sided die, sides marked 1 to 6, possibly unfair. We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone, what’s your probability distribution for the next throw of the die? A naive application of the maxent approach says we should pick the distribution over {1,2,3,4,5,6} with mean 3.5 and maximum entropy, which is the uniform distribution; that is, the die is fair. But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity! The reason: a die that’s biased towards 3 and 4 makes a mean value of 3.5 even more likely than a fair die.
Does that mean you should give up your belief in maxent, your belief in Bayes, your belief in the existence of “perfect” priors for all problems, or something else? You decide.
But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity!
In this example, what information are we Bayesian updating on?
But if we start with a prior over all possible six-sided dies and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity!
I’m nearly positive that the linked paper (and in particular, the above-quoted conclusion) is just wrong. Many years ago I checked the calculations carefully and found that the results come from an unavailable computer program, so it’s definitely possible that the results were just due to a bug. Meanwhile, my paper copy of PT:LOS contains a section which purports to show that Bayesian updating and maximum entropy give the same answer in the large-sample limit. I checked the math there too, and it seemed sound.
I might be able to offer more than my unsupported assertions when I get home from work.
I’ve checked carefully in PT:LOS for the section I thought I remembered, but I can’t find it. I distinctly remember the form of the theorem (it was a squeeze theorem), but I do not recall where I saw it. I think Jaynes was the author, so it might be in one of the papers listed here… or it could have been someone else entirely, or I could be misremembering. But I don’t think I’m misremembering, because I recall working through the proof and becoming satisfied that Uffink must have made a coding error.
We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone
So my prior state of knowledge about the die is entirely characterized by N=10^9 and m=3.5, with no knowledge of the shape of the distribution? It’s not obvious to me how you’re supposed to turn that, plus your background knowledge about what sort of object a die is, into a prior distribution; even one that maximizes entropy. The linked article mentions a “constraint rule” which seems to be an additional thing.
This sort of thing is rather thoroughly covered by Jaynes in PT:TLOS as I recall, and could make a good exercise for the Book Club when we come to the relevant chapters. In particular section 10.3 “How to cheat at coin and die tossing” contains the following caveat:
The results of tossing a die many times do not tell us any definite number char-
acteristic only of the die. They tell us also something about how the die was tossed. If
you toss ‘loaded’ dice in different ways, you can easily alter the relative frequencies of
the faces. With only slightly more difficulty, you can still do this if your dice are perfectly
‘honest’.
And later:
The problems in which intuition compels us most strongly to a uniform probability
assignment are not the ones in which we merely apply a principle of ‘equal distribution
of ignorance’. Thus, to explain the assignment of equal probabilities to heads and tails on the grounds that we ‘saw no reason why either face should be more likely than the other’, fails utterly to do justice to the reasoning involved. The point is that we have not merely ‘equal ignorance’. We also have positive knowledge of the symmetry of the problem; and introspection will show that when this positive knowledge is lacking, so also is our intuitive compulsion toward a uniform distribution.
Hah. The dice example and the application of maxent to it comes originally from Jaynes himself, see page 4 of the linked paper.
I’ll try to reformulate the problem without the constraint rule, to clear matters up or maybe confuse them even more. Imagine that, instead of you throwing the die a billion times and obtaining a mean of 3.5, a truthful deity told you that the mean was 3.5. First question: do you think the maxent solution in that case is valid, for some meaning of “valid”? Second question: why do you think it disagrees with Bayesian updating as you throw the die a huge number of times and learn only the mean? Is the information you receive somehow different in quality? Third question: which answer is actually correct, and what does “correct” mean here?
I’m not really qualified to comment on the methodological issues since I have yet to work through the formal meaning of “maximum entropy” approaches. What I know at this stage is the general argument for justifying priors, i.e. that they should in some manner reflect your actual state of knowledge (or uncertainty), rather than be tainted by preconceptions.
If you appeal to intuitions involving a particular physical object (a die) and simultaneously pick a particular mathematical object (the uniform prior) without making a solid case that the latter is our best representation the former, I won’t be overly surprised at some apparently absurd result.
It’s not clear to me for instance what we take a “possibly biased die” to be. Suppose I have a model that a cubic die is made biased by injecting a very small but very dense object at a particular (x,y,z) coordinate in a cubic volume. Now I can reason based on a prior distribution for (x,y,z) and what probability theory can possibly tell me about the posterior distribution, given a number of throws with a certain mean.
Now a six-sided die is normally symmetrical in such a way that 3 and 4 are on opposite sides, and I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions. Which means a prior which makes that a more likely outcome than a fair die should probably be ruled out by our formalization—or we should also model our uncertainty over how which faces have which numbers.
I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions.
If the die is slightly shorter along the 3-4 axis than along the 1-6 and 2-5 axes, then the 3 and 4 faces will have slightly greater surface area than the other faces.
Our models differ, then: I was assuming a strictly cubic die. So maybe we should also model our uncertainty over the dimensions of the (parallelepipedic) die.
But it seems in any case that we are circling back to the question of model checking, via the requirement that we should first be clear about what our uncertainty is about.
In the large N limit, and only the information that the mean is exactly 3.5, the obvious conclusion is that one is in a thought experiment, because that’s an absurd thing to choose to measure and an adversary has chosen the result to make us regret the choice.
More generally, one should revisit the hypothesis that the rolls of the die are independent. Yes, rolling only 1 and 6 is more likely to get a mean of 3.5 than rolling all six numbers, but still quite unlikely. Model checking!
EDIT: I am an eejit. Dangit, need to remember to stop and think before posting.
Umm, not quite.
The die being biased towards 2 and 5 gives the same probability of 3.5 as the die being 3,4 biased.
As does 1,6 bias.
So, given these three possibilities, an equal distribution is once again shown to be correct. By picking one of the three, and ignoring the other two, you can (accidentally) trick some people, but you cannot trick probability.
This is before even looking at the maths, and/or asking about the precision to which the mean is given (ie. is it 2 SF, 13 SF, 1 billion sf? Rounded to the nearest .5?)
Intuitively, I’d say that a die biased towards 1 and 6 makes hitting the mean (with some given precision) less likely than a die biased towards 3 and 4, because it spreads out the distribution wider. But you don’t have to take my word for it, see the linked paper for calculations.
Ahk, brainfart, it DOES depend on accuracy. I was thinking of it as so heavily biased that the other results don’t come up, and having perfect accuracy (rather than rounded to: what?)
Sorry, please vote down my previous post slightly (negative reinforcement for reacting too fast)
Hopefully I’ll find information about the rounding in the paper.
Allow me to introduce to you the Brandeis dice problem. We have a six-sided die, sides marked 1 to 6, possibly unfair. We throw it many times (say, a billion) and obtain an average value of 3.5. Using that information alone, what’s your probability distribution for the next throw of the die? A naive application of the maxent approach says we should pick the distribution over {1,2,3,4,5,6} with mean 3.5 and maximum entropy, which is the uniform distribution; that is, the die is fair. But if we start with a prior over all possible six-sided dice and do Bayesian updating, we get a different answer that diverges from fairness more and more as the number of throws goes to infinity! The reason: a die that’s biased towards 3 and 4 makes a mean value of 3.5 even more likely than a fair die.
Does that mean you should give up your belief in maxent, your belief in Bayes, your belief in the existence of “perfect” priors for all problems, or something else? You decide.
In this example, what information are we Bayesian updating on?
I’m nearly positive that the linked paper (and in particular, the above-quoted conclusion) is just wrong. Many years ago I checked the calculations carefully and found that the results come from an unavailable computer program, so it’s definitely possible that the results were just due to a bug. Meanwhile, my paper copy of PT:LOS contains a section which purports to show that Bayesian updating and maximum entropy give the same answer in the large-sample limit. I checked the math there too, and it seemed sound.
I might be able to offer more than my unsupported assertions when I get home from work.
I’ve checked carefully in PT:LOS for the section I thought I remembered, but I can’t find it. I distinctly remember the form of the theorem (it was a squeeze theorem), but I do not recall where I saw it. I think Jaynes was the author, so it might be in one of the papers listed here… or it could have been someone else entirely, or I could be misremembering. But I don’t think I’m misremembering, because I recall working through the proof and becoming satisfied that Uffink must have made a coding error.
So my prior state of knowledge about the die is entirely characterized by N=10^9 and m=3.5, with no knowledge of the shape of the distribution? It’s not obvious to me how you’re supposed to turn that, plus your background knowledge about what sort of object a die is, into a prior distribution; even one that maximizes entropy. The linked article mentions a “constraint rule” which seems to be an additional thing.
This sort of thing is rather thoroughly covered by Jaynes in PT:TLOS as I recall, and could make a good exercise for the Book Club when we come to the relevant chapters. In particular section 10.3 “How to cheat at coin and die tossing” contains the following caveat:
And later:
Hah. The dice example and the application of maxent to it comes originally from Jaynes himself, see page 4 of the linked paper.
I’ll try to reformulate the problem without the constraint rule, to clear matters up or maybe confuse them even more. Imagine that, instead of you throwing the die a billion times and obtaining a mean of 3.5, a truthful deity told you that the mean was 3.5. First question: do you think the maxent solution in that case is valid, for some meaning of “valid”? Second question: why do you think it disagrees with Bayesian updating as you throw the die a huge number of times and learn only the mean? Is the information you receive somehow different in quality? Third question: which answer is actually correct, and what does “correct” mean here?
I think I’d answer, “the mean of what?” ;)
I’m not really qualified to comment on the methodological issues since I have yet to work through the formal meaning of “maximum entropy” approaches. What I know at this stage is the general argument for justifying priors, i.e. that they should in some manner reflect your actual state of knowledge (or uncertainty), rather than be tainted by preconceptions.
If you appeal to intuitions involving a particular physical object (a die) and simultaneously pick a particular mathematical object (the uniform prior) without making a solid case that the latter is our best representation the former, I won’t be overly surprised at some apparently absurd result.
It’s not clear to me for instance what we take a “possibly biased die” to be. Suppose I have a model that a cubic die is made biased by injecting a very small but very dense object at a particular (x,y,z) coordinate in a cubic volume. Now I can reason based on a prior distribution for (x,y,z) and what probability theory can possibly tell me about the posterior distribution, given a number of throws with a certain mean.
Now a six-sided die is normally symmetrical in such a way that 3 and 4 are on opposite sides, and I’m having trouble even seeing how a die could be biased “towards 3 and 4” under such conditions. Which means a prior which makes that a more likely outcome than a fair die should probably be ruled out by our formalization—or we should also model our uncertainty over how which faces have which numbers.
If the die is slightly shorter along the 3-4 axis than along the 1-6 and 2-5 axes, then the 3 and 4 faces will have slightly greater surface area than the other faces.
Our models differ, then: I was assuming a strictly cubic die. So maybe we should also model our uncertainty over the dimensions of the (parallelepipedic) die.
But it seems in any case that we are circling back to the question of model checking, via the requirement that we should first be clear about what our uncertainty is about.
Cyan, I was hoping you’d show up. What do you think about this whole mess?
I find myself at a loss to give a brief answer. Can you ask a more specific question?
In the large N limit, and only the information that the mean is exactly 3.5, the obvious conclusion is that one is in a thought experiment, because that’s an absurd thing to choose to measure and an adversary has chosen the result to make us regret the choice.
More generally, one should revisit the hypothesis that the rolls of the die are independent. Yes, rolling only 1 and 6 is more likely to get a mean of 3.5 than rolling all six numbers, but still quite unlikely. Model checking!
EDIT: I am an eejit. Dangit, need to remember to stop and think before posting.
Umm, not quite. The die being biased towards 2 and 5 gives the same probability of 3.5 as the die being 3,4 biased.
As does 1,6 bias.
So, given these three possibilities, an equal distribution is once again shown to be correct. By picking one of the three, and ignoring the other two, you can (accidentally) trick some people, but you cannot trick probability.
This is before even looking at the maths, and/or asking about the precision to which the mean is given (ie. is it 2 SF, 13 SF, 1 billion sf? Rounded to the nearest .5?)
EDIT: this appears to be incorrect, sorry.
Intuitively, I’d say that a die biased towards 1 and 6 makes hitting the mean (with some given precision) less likely than a die biased towards 3 and 4, because it spreads out the distribution wider. But you don’t have to take my word for it, see the linked paper for calculations.
Ahk, brainfart, it DOES depend on accuracy. I was thinking of it as so heavily biased that the other results don’t come up, and having perfect accuracy (rather than rounded to: what?)
Sorry, please vote down my previous post slightly (negative reinforcement for reacting too fast)
Hopefully I’ll find information about the rounding in the paper.