I’ll work in the easier case 1 dimension down. Say we have a die which rolls a 1, 2 or a 3, and we know it averages to 5⁄2.
Then {x in R^3 : x1+x2+x3=1, xi>=0 for all i} is an equilateral triangle, which we put an uniform distribution on. Then the points where the mean roll is 5⁄2 lie on a straight line from (1/4,0,3/4) to (0,1/2,1/2). By some kind of linearity argument the averages over this line (with the uniform weighting from our uniform prior) are just the average of (1/4,0,3/4) and (0,1/2,1/2). This gives (1/8,2/8,5/8).
On the other hand we know that maxent gives a geometric sequence. But (1/8,2/8,5/8) isn’t geometric.
The method of maximum entropy has been very successful but there are cases where it has either failed or led to paradoxes that have cast doubt on its general legitimacy. My more optimistic assessment is that such failures and paradoxes provide us with valuable learning opportunities to sharpen our skills in the proper way to deploy entropic methods. The central theme of this paper revolves around the different ways in which constraints are used to capture the information that is relevant to a problem. This leads us to focus on four epistemically different types of constraints. I propose that the failure to recognize the distinctions between them is a prime source of errors. I explicitly discuss two examples. One concerns the dangers involved in replacing expected values with sample averages. The other revolves around misunderstanding ignorance. I discuss the Friedman-Shimony paradox as it is manifested in the three-sided die problem and also in its original thermodynamic formulation.
Thanks, that’s interesting. But if we know that the expected roll is 2, then that must lie somewhere on the straight line between (1/2,0,1/2) and (0,1,0). This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
I’ll think about this some more—Cyan’s link also goes into the problem a bit.
I know a handful of people who have built / are building PhDs on dealing with scoring approximation rules based on how they handle distributions that are sampled from all possible distributions that satisfy some characteristics. The impression I get is that the rabbit hole is pretty deep; I haven’t read through it all but here are some places to start: Montiel: [1][2], Hammond: [1][2].
This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
It seems to me that P1=(1/4,1/2,1/4) is “more robust” than P2=(1/3,1/3,1/3) in some way- suppose you remove y from x1 and add it proportionally to x2 and x3. The result would be closer to 2 for P1 than for P2, especially if y is a fraction of x1 rather than a flat amount.
But it also seems to me like having a uniform distribution across all possible distributions is kind of silly. Do I really think that (1/10,4/5,1/10) is just as likely as (1/3,1/3,1/3)? I suspect it’s possible to have a prior which results in maxent posteriors, but it might be the case that the prior depends on what sort of update you provide (i.e. it only works when you know the variance, and not when you know the mean) and it might not exist for some updates.
Well, “a uniform distribution across possible distributions” is kinda nonsense. There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model. Maybe there’s some subtle effect of singling out the complete, uniform model rather than integrating over some ensemble.
There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
(This will have effects, and they will not be subtle.)
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model.
One of the classic examples of this is three interpretations of “randomly select a point from a circle.” You could do this by selecting a angle for a radius uniformly, then selecting a point on that radius uniformly along its length. Or you could do those two steps, and then select a point along the associated chord uniformly at random. Or you could select x and y uniformly at random in a square bounding the circle, and reject any point outside the circle. Only the last one will make all areas in the circle equally likely- the first method will make areas near the center more likely and the second method will make areas near the edge more likely (if I remember correctly).
But I think that it generally is possible to reach consensus on what criterion you want (such as “pick a method such that any area of equal size has equal probability of containing the point you select.”) and then it’s obvious what sort of method you want to use. (There’s a non-rejection sampling way to get the equal area method for the circle, by the way.) And so you probably need to be clever about how you parameterize your distributions, and what priors you put on those parameters, and eventually you do have hyperparameters that functionally have no uncertainty. (This is, for example, seeing a uniform as a beta(1/2,1/2), where you don’t have a distribution on the 1/2s.) But I think this is a reasonable way to go about things.
One of the classic examples of this is three interpretations of “randomly select a point from a circle.”
In a separate comment, Kurros worries about cases with “no preferred parameterisation of the problem”. I have the same worry as both of you, I think. I guess I’m less optimistic about the resolution. The parameterization seems like an empirical rabbit that Jaynes and other descendants of the Principle of Insufficient Reason are trying to pull out of an a priori hat. (See also Seidenfeld .pdf) section 3 on re-partitioning the sample space.)
I’d appreciate it if someone could assuage—or aggravate—this concern. Preferably without presuming quite as much probability and statistics knowledge as Seidenfeld does (that one went somewhat over my head, toward the end).
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
I haven’t been able to follow this whole thread of conversation, but I think it’s pretty clear you’re talking about different things here.
Obviously, the long-run frequency distribution of the die can be many different things. One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Equally obviously, the probability distribution that represents rational expectations about the first flip is only one thing. Manfred claims that it’s (1/3, 1⁄3, 1⁄3), which doesn’t represent fairness. It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
I think it’s pretty clear you’re talking about different things here.
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Right; but putting a delta function there means you’re infinitely certain that’s what it is, because you give probability 0 to all other possibilities.
It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
Knowing that the die is completely biased, but not which side it is biased towards, would be represented by three delta functions, at (1,0,0), (0,1,0), and (0,0,1), each with a coefficient of (1/3). This is very different from the uniform case and the delta at (1/3,1/3,1/3) case, as you can see by calculating the posterior distribution for observing that the die rolled a 1.
okay, and you were just trying to make sure that Manfred knows that all this probability-of-distributions speech you’re speaking isn’t, as he seems to think, about the degree-of-belief-in-my-current-state-of-ignorance distribution for the first roll. Gotcha.
Okay… but do we agree that the degree-of-belief distribution for the first roll is (1/3, 1⁄3, 1⁄3), whether it’s a fair die or a completely biased in an unknown way die?
Because I’m pretty sure that’s what Manfred’s talking about when he says
There is a single correct distribution for our starting information, which is (1/3,1/3,1/3),
and I think him going on to say
the “distribution across possible distributions” is just a delta function there.
was a mistake, because you were talking about different things.
EDIT:
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
Ah. Yes. Okay. I am literally saying only things that you know, aren’t I. My bad.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
It’s not about if the die is fair—my state of information is fair. Of that it is okay to be certain. Also, I think I figured it out—see my recent reply to Oscar’s parent comment.
Wanna check? :)
I’ll work in the easier case 1 dimension down. Say we have a die which rolls a 1, 2 or a 3, and we know it averages to 5⁄2.
Then {x in R^3 : x1+x2+x3=1, xi>=0 for all i} is an equilateral triangle, which we put an uniform distribution on. Then the points where the mean roll is 5⁄2 lie on a straight line from (1/4,0,3/4) to (0,1/2,1/2). By some kind of linearity argument the averages over this line (with the uniform weighting from our uniform prior) are just the average of (1/4,0,3/4) and (0,1/2,1/2). This gives (1/8,2/8,5/8).
On the other hand we know that maxent gives a geometric sequence. But (1/8,2/8,5/8) isn’t geometric.
This may help. Abstract:
Thanks, that’s interesting. But if we know that the expected roll is 2, then that must lie somewhere on the straight line between (1/2,0,1/2) and (0,1,0). This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
I’ll think about this some more—Cyan’s link also goes into the problem a bit.
I know a handful of people who have built / are building PhDs on dealing with scoring approximation rules based on how they handle distributions that are sampled from all possible distributions that satisfy some characteristics. The impression I get is that the rabbit hole is pretty deep; I haven’t read through it all but here are some places to start: Montiel: [1] [2], Hammond: [1] [2].
It seems to me that P1=(1/4,1/2,1/4) is “more robust” than P2=(1/3,1/3,1/3) in some way- suppose you remove y from x1 and add it proportionally to x2 and x3. The result would be closer to 2 for P1 than for P2, especially if y is a fraction of x1 rather than a flat amount.
But it also seems to me like having a uniform distribution across all possible distributions is kind of silly. Do I really think that (1/10,4/5,1/10) is just as likely as (1/3,1/3,1/3)? I suspect it’s possible to have a prior which results in maxent posteriors, but it might be the case that the prior depends on what sort of update you provide (i.e. it only works when you know the variance, and not when you know the mean) and it might not exist for some updates.
Well, “a uniform distribution across possible distributions” is kinda nonsense. There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model. Maybe there’s some subtle effect of singling out the complete, uniform model rather than integrating over some ensemble.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
(This will have effects, and they will not be subtle.)
One of the classic examples of this is three interpretations of “randomly select a point from a circle.” You could do this by selecting a angle for a radius uniformly, then selecting a point on that radius uniformly along its length. Or you could do those two steps, and then select a point along the associated chord uniformly at random. Or you could select x and y uniformly at random in a square bounding the circle, and reject any point outside the circle. Only the last one will make all areas in the circle equally likely- the first method will make areas near the center more likely and the second method will make areas near the edge more likely (if I remember correctly).
But I think that it generally is possible to reach consensus on what criterion you want (such as “pick a method such that any area of equal size has equal probability of containing the point you select.”) and then it’s obvious what sort of method you want to use. (There’s a non-rejection sampling way to get the equal area method for the circle, by the way.) And so you probably need to be clever about how you parameterize your distributions, and what priors you put on those parameters, and eventually you do have hyperparameters that functionally have no uncertainty. (This is, for example, seeing a uniform as a beta(1/2,1/2), where you don’t have a distribution on the 1/2s.) But I think this is a reasonable way to go about things.
In a separate comment, Kurros worries about cases with “no preferred parameterisation of the problem”. I have the same worry as both of you, I think. I guess I’m less optimistic about the resolution. The parameterization seems like an empirical rabbit that Jaynes and other descendants of the Principle of Insufficient Reason are trying to pull out of an a priori hat. (See also Seidenfeld .pdf) section 3 on re-partitioning the sample space.)
I’d appreciate it if someone could assuage—or aggravate—this concern. Preferably without presuming quite as much probability and statistics knowledge as Seidenfeld does (that one went somewhat over my head, toward the end).
I haven’t been able to follow this whole thread of conversation, but I think it’s pretty clear you’re talking about different things here.
Obviously, the long-run frequency distribution of the die can be many different things. One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Equally obviously, the probability distribution that represents rational expectations about the first flip is only one thing. Manfred claims that it’s (1/3, 1⁄3, 1⁄3), which doesn’t represent fairness. It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
Right; but putting a delta function there means you’re infinitely certain that’s what it is, because you give probability 0 to all other possibilities.
Knowing that the die is completely biased, but not which side it is biased towards, would be represented by three delta functions, at (1,0,0), (0,1,0), and (0,0,1), each with a coefficient of (1/3). This is very different from the uniform case and the delta at (1/3,1/3,1/3) case, as you can see by calculating the posterior distribution for observing that the die rolled a 1.
okay, and you were just trying to make sure that Manfred knows that all this probability-of-distributions speech you’re speaking isn’t, as he seems to think, about the degree-of-belief-in-my-current-state-of-ignorance distribution for the first roll. Gotcha.
Okay… but do we agree that the degree-of-belief distribution for the first roll is (1/3, 1⁄3, 1⁄3), whether it’s a fair die or a completely biased in an unknown way die?
Because I’m pretty sure that’s what Manfred’s talking about when he says
and I think him going on to say
was a mistake, because you were talking about different things.
EDIT:
Ah. Yes. Okay. I am literally saying only things that you know, aren’t I. My bad.
It’s not about if the die is fair—my state of information is fair. Of that it is okay to be certain. Also, I think I figured it out—see my recent reply to Oscar’s parent comment.