If we have additional knowledge that the average roll of our die is 3, then we want to maximize -P(1)·Log(P(1)) - P(2)·Log(P(2)) - P(3)·Log(P(3)) - P(4)·Log(P(4)), given that the sum is 1 and the average is 3. We can either plug in the constraints and set partial derivatives to zero, or we can use a maximization technique like Lagrange multipliers.
I’ve never been able to understand this.
Surely the correct course of action in this situation is to have a prior for the possible biases of the die, say the uniform prior on {x in R^4 : x1+x2+x3+x4=1, xi>=0 for all i}, and then update Bayesianly by restricting to the subset where the average is 3. Then to find the distribution for the outcomes of the die we integrate over this.
I’m pretty sure this doesn’t give the same distribution as maxent, and I can’t think of a prior that would. (I think my suggested prior gives the “straight lines” distribution that you wanted!)
So when are each of these procedures appropriate? I agree that maxent is a good way to assign priors, but I think that when you have data you should use it by updating, rather than by remaking you prior.
I don’t think there’s anything that says a maximum entropy prior is what you get if you construct a maximum entropy prior for a weaker subset of assumptions, and then update based on the complement.
EDIT: Jaynes elaborates on the relationship between Bayes and maximum entropy priors here (warning, pdf).
In doing the Bayesian updating method, you assumed that the die has some weights, and that the die having different weights are events in event-space. This assumption is a very good one for a physical die, and the nature of the assumption is most obvious from the Kolmogorov and Savage perspectives.
Then, when translating the information that the expected roll was 5⁄2, you translated it as “the sum of weight 1 + 2 * weight 2 + 3 * weight 3 is equal to 5⁄2.” (Note that this is not necessary! If you’re symmetrically uncertain about the weights, the expected roll can still be 5⁄2. Frequentist intuitions are so sneaky :P )
What does the maximum entropy principle say if we give it that same information? The exact same answer you got! It maximizes entropy over those different possibilities in event-space, and the constraint that the weighted sum of the weights is 5⁄2 is interpreted in just the way you’d expect, leaving a straight line of possibilities in event-space with equal weights. Thus, maxent gives the same answer as Bayes’ theorem for this question, and it certainly seems like it did so given the same information you used for Bayes’ theorem.
Since it didn’t give the same answer before, this means we’re solving a different set of equations. Different equations means different information.
The state of information that I use in the post is different because we have no knowledge that the probabilities comes from some physical process with different weights. No physical events at all are entangled with the probabilities. It’s obvious why this is unintuitive—any die has some physical weights underlying it. So calling our unknown number “the roll of a die” is actually highly misleading. My bad on that one—it looks like christopherj’s concerns about the example being unrealistic were totally legit.
However, that doesn’t mean that we’ll never see our maximum entropy result in the physical world. Suppose that I started not knowing that the expected roll of the die was 5⁄2. And then someone offered to repeat not just “rolling the die,” but to repeat experiments with equivalent states of knowledge many times. And then what they’ll do is after 1000 repeats of experiments with the same state of knowledge, is if the average roll was really close to 5⁄2, they’ll stop, but if the average roll wasn’t 5⁄2 they’ll try again until it is.
Since the probability given my state of knowledge is 1⁄3, I expect a repeat of many experiments with the same state of knowledge to be like a rolling a fair die many times, then only keeping ensembles with average 5⁄2. Then, if I look at this ensemble that represents your state of knowledge except for happening to have average roll 5⁄2, I will see a maximum entropy distribution of rolls. (proof left as an exercise :P ) This physical process encapsulates the information stated in the post, in a way that rolling a die whose weights are different physical events does not.
I’ll work in the easier case 1 dimension down. Say we have a die which rolls a 1, 2 or a 3, and we know it averages to 5⁄2.
Then {x in R^3 : x1+x2+x3=1, xi>=0 for all i} is an equilateral triangle, which we put an uniform distribution on. Then the points where the mean roll is 5⁄2 lie on a straight line from (1/4,0,3/4) to (0,1/2,1/2). By some kind of linearity argument the averages over this line (with the uniform weighting from our uniform prior) are just the average of (1/4,0,3/4) and (0,1/2,1/2). This gives (1/8,2/8,5/8).
On the other hand we know that maxent gives a geometric sequence. But (1/8,2/8,5/8) isn’t geometric.
The method of maximum entropy has been very successful but there are cases where it has either failed or led to paradoxes that have cast doubt on its general legitimacy. My more optimistic assessment is that such failures and paradoxes provide us with valuable learning opportunities to sharpen our skills in the proper way to deploy entropic methods. The central theme of this paper revolves around the different ways in which constraints are used to capture the information that is relevant to a problem. This leads us to focus on four epistemically different types of constraints. I propose that the failure to recognize the distinctions between them is a prime source of errors. I explicitly discuss two examples. One concerns the dangers involved in replacing expected values with sample averages. The other revolves around misunderstanding ignorance. I discuss the Friedman-Shimony paradox as it is manifested in the three-sided die problem and also in its original thermodynamic formulation.
Thanks, that’s interesting. But if we know that the expected roll is 2, then that must lie somewhere on the straight line between (1/2,0,1/2) and (0,1,0). This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
I’ll think about this some more—Cyan’s link also goes into the problem a bit.
I know a handful of people who have built / are building PhDs on dealing with scoring approximation rules based on how they handle distributions that are sampled from all possible distributions that satisfy some characteristics. The impression I get is that the rabbit hole is pretty deep; I haven’t read through it all but here are some places to start: Montiel: [1][2], Hammond: [1][2].
This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
It seems to me that P1=(1/4,1/2,1/4) is “more robust” than P2=(1/3,1/3,1/3) in some way- suppose you remove y from x1 and add it proportionally to x2 and x3. The result would be closer to 2 for P1 than for P2, especially if y is a fraction of x1 rather than a flat amount.
But it also seems to me like having a uniform distribution across all possible distributions is kind of silly. Do I really think that (1/10,4/5,1/10) is just as likely as (1/3,1/3,1/3)? I suspect it’s possible to have a prior which results in maxent posteriors, but it might be the case that the prior depends on what sort of update you provide (i.e. it only works when you know the variance, and not when you know the mean) and it might not exist for some updates.
Well, “a uniform distribution across possible distributions” is kinda nonsense. There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model. Maybe there’s some subtle effect of singling out the complete, uniform model rather than integrating over some ensemble.
There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
(This will have effects, and they will not be subtle.)
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model.
One of the classic examples of this is three interpretations of “randomly select a point from a circle.” You could do this by selecting a angle for a radius uniformly, then selecting a point on that radius uniformly along its length. Or you could do those two steps, and then select a point along the associated chord uniformly at random. Or you could select x and y uniformly at random in a square bounding the circle, and reject any point outside the circle. Only the last one will make all areas in the circle equally likely- the first method will make areas near the center more likely and the second method will make areas near the edge more likely (if I remember correctly).
But I think that it generally is possible to reach consensus on what criterion you want (such as “pick a method such that any area of equal size has equal probability of containing the point you select.”) and then it’s obvious what sort of method you want to use. (There’s a non-rejection sampling way to get the equal area method for the circle, by the way.) And so you probably need to be clever about how you parameterize your distributions, and what priors you put on those parameters, and eventually you do have hyperparameters that functionally have no uncertainty. (This is, for example, seeing a uniform as a beta(1/2,1/2), where you don’t have a distribution on the 1/2s.) But I think this is a reasonable way to go about things.
One of the classic examples of this is three interpretations of “randomly select a point from a circle.”
In a separate comment, Kurros worries about cases with “no preferred parameterisation of the problem”. I have the same worry as both of you, I think. I guess I’m less optimistic about the resolution. The parameterization seems like an empirical rabbit that Jaynes and other descendants of the Principle of Insufficient Reason are trying to pull out of an a priori hat. (See also Seidenfeld .pdf) section 3 on re-partitioning the sample space.)
I’d appreciate it if someone could assuage—or aggravate—this concern. Preferably without presuming quite as much probability and statistics knowledge as Seidenfeld does (that one went somewhat over my head, toward the end).
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
I haven’t been able to follow this whole thread of conversation, but I think it’s pretty clear you’re talking about different things here.
Obviously, the long-run frequency distribution of the die can be many different things. One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Equally obviously, the probability distribution that represents rational expectations about the first flip is only one thing. Manfred claims that it’s (1/3, 1⁄3, 1⁄3), which doesn’t represent fairness. It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
I think it’s pretty clear you’re talking about different things here.
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Right; but putting a delta function there means you’re infinitely certain that’s what it is, because you give probability 0 to all other possibilities.
It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
Knowing that the die is completely biased, but not which side it is biased towards, would be represented by three delta functions, at (1,0,0), (0,1,0), and (0,0,1), each with a coefficient of (1/3). This is very different from the uniform case and the delta at (1/3,1/3,1/3) case, as you can see by calculating the posterior distribution for observing that the die rolled a 1.
okay, and you were just trying to make sure that Manfred knows that all this probability-of-distributions speech you’re speaking isn’t, as he seems to think, about the degree-of-belief-in-my-current-state-of-ignorance distribution for the first roll. Gotcha.
Okay… but do we agree that the degree-of-belief distribution for the first roll is (1/3, 1⁄3, 1⁄3), whether it’s a fair die or a completely biased in an unknown way die?
Because I’m pretty sure that’s what Manfred’s talking about when he says
There is a single correct distribution for our starting information, which is (1/3,1/3,1/3),
and I think him going on to say
the “distribution across possible distributions” is just a delta function there.
was a mistake, because you were talking about different things.
EDIT:
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
Ah. Yes. Okay. I am literally saying only things that you know, aren’t I. My bad.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
It’s not about if the die is fair—my state of information is fair. Of that it is okay to be certain. Also, I think I figured it out—see my recent reply to Oscar’s parent comment.
I’ve never been able to understand this.
Surely the correct course of action in this situation is to have a prior for the possible biases of the die, say the uniform prior on {x in R^4 : x1+x2+x3+x4=1, xi>=0 for all i}, and then update Bayesianly by restricting to the subset where the average is 3. Then to find the distribution for the outcomes of the die we integrate over this.
I’m pretty sure this doesn’t give the same distribution as maxent, and I can’t think of a prior that would. (I think my suggested prior gives the “straight lines” distribution that you wanted!)
So when are each of these procedures appropriate? I agree that maxent is a good way to assign priors, but I think that when you have data you should use it by updating, rather than by remaking you prior.
I don’t think there’s anything that says a maximum entropy prior is what you get if you construct a maximum entropy prior for a weaker subset of assumptions, and then update based on the complement.
EDIT: Jaynes elaborates on the relationship between Bayes and maximum entropy priors here (warning, pdf).
Okay, I have an answer for you.
In doing the Bayesian updating method, you assumed that the die has some weights, and that the die having different weights are events in event-space. This assumption is a very good one for a physical die, and the nature of the assumption is most obvious from the Kolmogorov and Savage perspectives.
Then, when translating the information that the expected roll was 5⁄2, you translated it as “the sum of weight 1 + 2 * weight 2 + 3 * weight 3 is equal to 5⁄2.” (Note that this is not necessary! If you’re symmetrically uncertain about the weights, the expected roll can still be 5⁄2. Frequentist intuitions are so sneaky :P )
What does the maximum entropy principle say if we give it that same information? The exact same answer you got! It maximizes entropy over those different possibilities in event-space, and the constraint that the weighted sum of the weights is 5⁄2 is interpreted in just the way you’d expect, leaving a straight line of possibilities in event-space with equal weights. Thus, maxent gives the same answer as Bayes’ theorem for this question, and it certainly seems like it did so given the same information you used for Bayes’ theorem.
Since it didn’t give the same answer before, this means we’re solving a different set of equations. Different equations means different information.
The state of information that I use in the post is different because we have no knowledge that the probabilities comes from some physical process with different weights. No physical events at all are entangled with the probabilities. It’s obvious why this is unintuitive—any die has some physical weights underlying it. So calling our unknown number “the roll of a die” is actually highly misleading. My bad on that one—it looks like christopherj’s concerns about the example being unrealistic were totally legit.
However, that doesn’t mean that we’ll never see our maximum entropy result in the physical world. Suppose that I started not knowing that the expected roll of the die was 5⁄2. And then someone offered to repeat not just “rolling the die,” but to repeat experiments with equivalent states of knowledge many times. And then what they’ll do is after 1000 repeats of experiments with the same state of knowledge, is if the average roll was really close to 5⁄2, they’ll stop, but if the average roll wasn’t 5⁄2 they’ll try again until it is.
Since the probability given my state of knowledge is 1⁄3, I expect a repeat of many experiments with the same state of knowledge to be like a rolling a fair die many times, then only keeping ensembles with average 5⁄2. Then, if I look at this ensemble that represents your state of knowledge except for happening to have average roll 5⁄2, I will see a maximum entropy distribution of rolls. (proof left as an exercise :P ) This physical process encapsulates the information stated in the post, in a way that rolling a die whose weights are different physical events does not.
Wanna check? :)
I’ll work in the easier case 1 dimension down. Say we have a die which rolls a 1, 2 or a 3, and we know it averages to 5⁄2.
Then {x in R^3 : x1+x2+x3=1, xi>=0 for all i} is an equilateral triangle, which we put an uniform distribution on. Then the points where the mean roll is 5⁄2 lie on a straight line from (1/4,0,3/4) to (0,1/2,1/2). By some kind of linearity argument the averages over this line (with the uniform weighting from our uniform prior) are just the average of (1/4,0,3/4) and (0,1/2,1/2). This gives (1/8,2/8,5/8).
On the other hand we know that maxent gives a geometric sequence. But (1/8,2/8,5/8) isn’t geometric.
This may help. Abstract:
Thanks, that’s interesting. But if we know that the expected roll is 2, then that must lie somewhere on the straight line between (1/2,0,1/2) and (0,1,0). This doesn’t mean we should average those to claim that the correct distribution given that information is (1/4,1/2,1/4), rather than the uniform distribution!
I’ll think about this some more—Cyan’s link also goes into the problem a bit.
I know a handful of people who have built / are building PhDs on dealing with scoring approximation rules based on how they handle distributions that are sampled from all possible distributions that satisfy some characteristics. The impression I get is that the rabbit hole is pretty deep; I haven’t read through it all but here are some places to start: Montiel: [1] [2], Hammond: [1] [2].
It seems to me that P1=(1/4,1/2,1/4) is “more robust” than P2=(1/3,1/3,1/3) in some way- suppose you remove y from x1 and add it proportionally to x2 and x3. The result would be closer to 2 for P1 than for P2, especially if y is a fraction of x1 rather than a flat amount.
But it also seems to me like having a uniform distribution across all possible distributions is kind of silly. Do I really think that (1/10,4/5,1/10) is just as likely as (1/3,1/3,1/3)? I suspect it’s possible to have a prior which results in maxent posteriors, but it might be the case that the prior depends on what sort of update you provide (i.e. it only works when you know the variance, and not when you know the mean) and it might not exist for some updates.
Well, “a uniform distribution across possible distributions” is kinda nonsense. There is a single correct distribution for our starting information, which is (1/3,1/3,1/3), the “distribution across possible distributions” is just a delta function there.
Any non-delta “distribution over distributions” is laden with some model of what’s going on in the die, and is a distribution over parts of that model. Maybe there’s some subtle effect of singling out the complete, uniform model rather than integrating over some ensemble.
Whoa, you think the only correct interpretation of “there’s a die that returns 1, 2, or 3” is to be absolutely certain that it’s fair? Or what do you think a delta function in the distribution space means?
(This will have effects, and they will not be subtle.)
One of the classic examples of this is three interpretations of “randomly select a point from a circle.” You could do this by selecting a angle for a radius uniformly, then selecting a point on that radius uniformly along its length. Or you could do those two steps, and then select a point along the associated chord uniformly at random. Or you could select x and y uniformly at random in a square bounding the circle, and reject any point outside the circle. Only the last one will make all areas in the circle equally likely- the first method will make areas near the center more likely and the second method will make areas near the edge more likely (if I remember correctly).
But I think that it generally is possible to reach consensus on what criterion you want (such as “pick a method such that any area of equal size has equal probability of containing the point you select.”) and then it’s obvious what sort of method you want to use. (There’s a non-rejection sampling way to get the equal area method for the circle, by the way.) And so you probably need to be clever about how you parameterize your distributions, and what priors you put on those parameters, and eventually you do have hyperparameters that functionally have no uncertainty. (This is, for example, seeing a uniform as a beta(1/2,1/2), where you don’t have a distribution on the 1/2s.) But I think this is a reasonable way to go about things.
In a separate comment, Kurros worries about cases with “no preferred parameterisation of the problem”. I have the same worry as both of you, I think. I guess I’m less optimistic about the resolution. The parameterization seems like an empirical rabbit that Jaynes and other descendants of the Principle of Insufficient Reason are trying to pull out of an a priori hat. (See also Seidenfeld .pdf) section 3 on re-partitioning the sample space.)
I’d appreciate it if someone could assuage—or aggravate—this concern. Preferably without presuming quite as much probability and statistics knowledge as Seidenfeld does (that one went somewhat over my head, toward the end).
I haven’t been able to follow this whole thread of conversation, but I think it’s pretty clear you’re talking about different things here.
Obviously, the long-run frequency distribution of the die can be many different things. One of them, (1/3, 1⁄3, 1⁄3), represents fairness, and is just one among many possibilities.
Equally obviously, the probability distribution that represents rational expectations about the first flip is only one thing. Manfred claims that it’s (1/3, 1⁄3, 1⁄3), which doesn’t represent fairness. It could equally well represent being certain that it’s biased to land on only one side every time, but you have no idea which side.
I thought so too, which is why I asked him what he thought a delta function in the distribution space meant.
Right; but putting a delta function there means you’re infinitely certain that’s what it is, because you give probability 0 to all other possibilities.
Knowing that the die is completely biased, but not which side it is biased towards, would be represented by three delta functions, at (1,0,0), (0,1,0), and (0,0,1), each with a coefficient of (1/3). This is very different from the uniform case and the delta at (1/3,1/3,1/3) case, as you can see by calculating the posterior distribution for observing that the die rolled a 1.
okay, and you were just trying to make sure that Manfred knows that all this probability-of-distributions speech you’re speaking isn’t, as he seems to think, about the degree-of-belief-in-my-current-state-of-ignorance distribution for the first roll. Gotcha.
Okay… but do we agree that the degree-of-belief distribution for the first roll is (1/3, 1⁄3, 1⁄3), whether it’s a fair die or a completely biased in an unknown way die?
Because I’m pretty sure that’s what Manfred’s talking about when he says
and I think him going on to say
was a mistake, because you were talking about different things.
EDIT:
Ah. Yes. Okay. I am literally saying only things that you know, aren’t I. My bad.
It’s not about if the die is fair—my state of information is fair. Of that it is okay to be certain. Also, I think I figured it out—see my recent reply to Oscar’s parent comment.