Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.
“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefix symbol p; to emphasize its more abstract nature, let us use the bare bracket symbol notation (Ap|E) to denote such quantities, and call it simply ‘the density for Ap, given E’.”—Page 554 of Professor Jaynes’ book
The idea of the A_p distribution not being a real probability distribution but obeying the mathematical rules of probability theory is far too nuanced and intricate for me to be able to understand.
“I think a much better approach is to assign models to the problem (e.g. “it’s a box that has 100 holes, 45 open and 65 plugged, the machine picks one hole, you get 2 coins if the hole is open and nothing if it’s plugged.”), and then have a probability distribution over models. This is better because keeps probabilities assigned to facts about the world.
It’s true that probabilities-of-probabilities are just an abstraction of this (when used correctly), but I’ve found that people get confused really fast if you ask them to think in terms of probabilities-of-probabilities. (See every confused discussion of “what’s the standard deviation of the standard deviation?”)“
I would appreciate your thoughts on this. My current understanding of A_p distributions in light of this comment and in the context of coin flipping is this:
Ap is defined to be a proposition such that P(H|Ap,I)=p and P(T|Ap,I)=1−p where H & T represents heads & tails and I represents the background information. This is similar to the definition Professor Jaynes gives in page 554 of his book.
Let D, the data, be HH.
Using this definition, the posterior is P(Ap|D,I)∝P(D|Ap,I)P(Ap|I).
Assuming the background information I is indifferent to the Ap’s:
P(Ap|D,I)∝P(D|Ap,I)
P(D|Ap,I)=P(H|Ap,I) * P(H|Ap,I)=p2
argmaxApP(Ap|D,I)=argmaxApP(D|Ap,I)=A1.0
Therefore in the set of propositions {Ap:0≤p≤1}, the most plausible proposition given our data is A1.0. Each member of this set of propositions is called a model. The probability of heads given the most plausible model is 1.0
I don’t think so. Like you, I don’t really understand this Ap stuff philisophically. But the step where you drop the prior P(Ap|I) to obtain P(Ap|D,I)∝P(D|Ap,I) is, I think, not warranted. Dropping the prior term outright like that… I don’t think there are many cases where that’s acceptable. Doing so does not reflect a state of low knowledge, but instead a state of pretty strong knowledge. To give intuition on what I mean:
Contrast with the prior that reflects the state of knowledge “All I know is that H is possible and T is possible”. This is closer to Jaynes’ example about whether there’s life on Mars. The prior that reflects that state of knowledge is Beta(1,1), which after two heads come up, becomes Beta(3, 1). The mean of Beta(3, 1) is 3⁄4 = 0.75. This is much less than the 1.0 you arrive at.
A prior that gives 1.0 after the data H,H might be something like:
”This coin is very unfair in a well-known, specific way: It either always gives heads, always gives tails, or gives heads and tails alternating: ‘H,T,H,T...’.”
Under that prior, the data HH would give you a probability of near-1 that H is next. But that’s a prior that reflects definite, strong knowledge of the coin. Maybe this argument changes given the nature of Ap, which again I don’t really understand. But whatever it is, I don’t think it’s valid to assume the prior away.
Ah, wait, I misunderstood. You’re interested in the mode, huh—that’s why you’re taking the argmax. In my Beta(3,1) example, the mode is also 1. So no problem there. I was focused on the mean in my previous comment. I still think dropping the prior is bad but now I’m not sure how to argue the point…
I assumed the background information to be indifferent to the A_p’s
We do not explicitly talk about the nature of the A_p’s. Prof. Jaynes defines it as a proposition such that P(A|A_p, E) = p. In my example A_p is defined as a proposition such that P(H|A_p, I) = p. No matter what prior information we have, it is going to be indifferent to the A_p’s by virtue of the fact that we don’t know what A_p represents
Isn’t A_p the distribution over how often the coin will come up heads, or the probability of life on Mars? If so… there’s no way those things could be indifferent to the background information. A core tenet of the philosophy outlined in this book is that when you ignore prior information without good cause, things get wacky and fall apart. This is part of desiderata iii from chapter 2: “The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains.”
(Then Jaynes ignores information in later chapters because it doesn’t change the result… so this desideratum is easier said than done… but yeah)
“[…] A_p the distribution over how often the coin will come up heads […]”—I understood A_p to be a sort of distribution over models; we do not know/talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?
I can’t see anything wrong in what you’ve said there, but I still have to insist without good argument that dropping P(A_p|I) is incorrect. In my vague defense, consider the two A_p distributions drawn on p558, for the penny and for Mars. Those distributions are as different as they are because of the different prior information. If it was correct to drop the prior term a priori, I think those distributions would look the same?
You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me
It seems like in practice, when there’s a lot of data, people like Jaynes and Gelman are happy to assign low-information (or “uninformative”) priors, knowing that with a lot of data the prior ends up getting washed away anyway. So just slapping a uniform prior down might be OK in a lot of real-world situations. This is I think pretty different than just dropping the prior completely, but gets the same job done.
Now I’m doubting myself >_> is it pretty different?? Anyone lurking reading this who knows whether uniform prior is very different than just dropping the prior term?
I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)
Shoot! You’re right! I think I was wrong this whole time on the impact of dropping the prior term. Cuz data term * prior term is like multiplying the distributions, and dropping the prior term is like multiplying the data distribution by the uniform one. Thanks for sticking with me :)
No worries :) Thanks a lot for your help! Much appreciated.
It’s amazing how complex a simple coin flipping problem can get when we approach it from our paradigm of objective Bayesianism. Professor Jaynes remarks on this after deriving the principle of indifference: “At this point, depending on your personality and background in this subject, you will be either greatly impressed or greatly disappointed by the result (2.91).”—page 40
A frequentist would have “solved“ this problem rather easily. Personally, I would trade simplicity for coherence any day of the week...
Thank you so much for telling me about A_p distribution! This is exactly what I have been looking for.
“Pending a better understanding of what that means, let us adopt a cautious notation that will avoid giving possibly wrong impressions. We are not claiming that P(Ap|E) is a ‘real probability’ in the sense that we have been using that term; it is only a number which is to obey the mathematical rules of probability theory. Perhaps its proper conceptual meaning will be clearer after getting a little experience using it. So let us refrain from using the prefix symbol p; to emphasize its more abstract nature, let us use the bare bracket symbol notation (Ap|E) to denote such quantities, and call it simply ‘the density for Ap, given E’.”—Page 554 of Professor Jaynes’ book
The idea of the A_p distribution not being a real probability distribution but obeying the mathematical rules of probability theory is far too nuanced and intricate for me to be able to understand.
I was reading an article on this site about the A_p distribution, Probability, knowledge, and meta-probability, and a commenter wrote:
“I think a much better approach is to assign models to the problem (e.g. “it’s a box that has 100 holes, 45 open and 65 plugged, the machine picks one hole, you get 2 coins if the hole is open and nothing if it’s plugged.”), and then have a probability distribution over models. This is better because keeps probabilities assigned to facts about the world.
It’s true that probabilities-of-probabilities are just an abstraction of this (when used correctly), but I’ve found that people get confused really fast if you ask them to think in terms of probabilities-of-probabilities. (See every confused discussion of “what’s the standard deviation of the standard deviation?”)“
I would appreciate your thoughts on this. My current understanding of A_p distributions in light of this comment and in the context of coin flipping is this:
Ap is defined to be a proposition such that P(H|Ap,I)=p and P(T|Ap,I)=1−p where H & T represents heads & tails and I represents the background information. This is similar to the definition Professor Jaynes gives in page 554 of his book.
Let D, the data, be HH.
Using this definition, the posterior is P(Ap|D,I) ∝ P(D|Ap,I)P(Ap|I).
Assuming the background information I is indifferent to the Ap’s:
P(Ap|D,I) ∝ P(D|Ap,I)
P(D|Ap,I) = P(H|Ap,I) * P(H|Ap,I) = p2
argmaxApP(Ap|D,I)=argmaxApP(D|Ap,I)=A1.0
Therefore in the set of propositions {Ap:0≤p≤1}, the most plausible proposition given our data is A1.0. Each member of this set of propositions is called a model. The probability of heads given the most plausible model is 1.0
Is this a correct understanding?
I don’t think so. Like you, I don’t really understand this Ap stuff philisophically. But the step where you drop the prior P(Ap|I) to obtain P(Ap|D,I)∝P(D|Ap,I) is, I think, not warranted. Dropping the prior term outright like that… I don’t think there are many cases where that’s acceptable. Doing so does not reflect a state of low knowledge, but instead a state of pretty strong knowledge. To give intuition on what I mean:
Contrast with the prior that reflects the state of knowledge “All I know is that H is possible and T is possible”. This is closer to Jaynes’ example about whether there’s life on Mars. The prior that reflects that state of knowledge is Beta(1,1), which after two heads come up, becomes Beta(3, 1). The mean of Beta(3, 1) is 3⁄4 = 0.75. This is much less than the 1.0 you arrive at.
A prior that gives 1.0 after the data H,H might be something like:
”This coin is very unfair in a well-known, specific way: It either always gives heads, always gives tails, or gives heads and tails alternating: ‘H,T,H,T...’.”
Under that prior, the data HH would give you a probability of near-1 that H is next. But that’s a prior that reflects definite, strong knowledge of the coin.
Maybe this argument changes given the nature of Ap, which again I don’t really understand. But whatever it is, I don’t think it’s valid to assume the prior away.
Ah, wait, I misunderstood. You’re interested in the mode, huh—that’s why you’re taking the argmax. In my Beta(3,1) example, the mode is also 1. So no problem there. I was focused on the mean in my previous comment. I still think dropping the prior is bad but now I’m not sure how to argue the point…
I dropped the prior for two reason:
I assumed the background information to be indifferent to the A_p’s
We do not explicitly talk about the nature of the A_p’s. Prof. Jaynes defines it as a proposition such that P(A|A_p, E) = p. In my example A_p is defined as a proposition such that P(H|A_p, I) = p. No matter what prior information we have, it is going to be indifferent to the A_p’s by virtue of the fact that we don’t know what A_p represents
Is this justification valid?
Isn’t A_p the distribution over how often the coin will come up heads, or the probability of life on Mars? If so… there’s no way those things could be indifferent to the background information. A core tenet of the philosophy outlined in this book is that when you ignore prior information without good cause, things get wacky and fall apart. This is part of desiderata iii from chapter 2: “The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains.”
(Then Jaynes ignores information in later chapters because it doesn’t change the result… so this desideratum is easier said than done… but yeah)
“[…] A_p the distribution over how often the coin will come up heads […]”—I understood A_p to be a sort of distribution over models; we do not know/talk about the model itself but we know that if a model A_p is true, then the probability of heads is equal to p by definition of A_p. Perhaps the model A_p is the proposition “the centre of mass of the coin is at p” or “the bias-weighting of the coin is p” but we do not care as long the resulting probability of heads is p. So how can the prior not be indifferent when we do not know the nature of each proposition A_p in a set of mutually exclusive and exhaustive propositions?
I can’t see anything wrong in what you’ve said there, but I still have to insist without good argument that dropping P(A_p|I) is incorrect. In my vague defense, consider the two A_p distributions drawn on p558, for the penny and for Mars. Those distributions are as different as they are because of the different prior information. If it was correct to drop the prior term a priori, I think those distributions would look the same?
You are right; dropping priors in the A_p distribution is probably not a general rule. Perhaps the propositions don’t always need to interpretable for us to be able impose priors? For example, people impose priors over the parameter space of a neural network which is certainly not interpretable. But the topic of Bayesian neural networks is beyond me
It seems like in practice, when there’s a lot of data, people like Jaynes and Gelman are happy to assign low-information (or “uninformative”) priors, knowing that with a lot of data the prior ends up getting washed away anyway. So just slapping a uniform prior down might be OK in a lot of real-world situations. This is I think pretty different than just dropping the prior completely, but gets the same job done.
Now I’m doubting myself >_> is it pretty different?? Anyone lurking reading this who knows whether uniform prior is very different than just dropping the prior term?
I believe it is the same thing. A uniform prior means your prior is constant function i.e. P(A_p|I) = x where x is a real number with the usual caveats. So if you have a uniform prior, you can drop it (from a safe height of course). But perhaps the more seasoned Bayesians disagree? (where are they when you need them)
Shoot! You’re right! I think I was wrong this whole time on the impact of dropping the prior term. Cuz data term * prior term is like multiplying the distributions, and dropping the prior term is like multiplying the data distribution by the uniform one. Thanks for sticking with me :)
No worries :) Thanks a lot for your help! Much appreciated.
It’s amazing how complex a simple coin flipping problem can get when we approach it from our paradigm of objective Bayesianism. Professor Jaynes remarks on this after deriving the principle of indifference: “At this point, depending on your personality and background in this subject, you will be either greatly impressed or greatly disappointed by the result (2.91).”—page 40
A frequentist would have “solved“ this problem rather easily. Personally, I would trade simplicity for coherence any day of the week...
I looooove that coin flip section! Cheers