What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn’t need new probability concepts. Let Fi be the flips. All you need in life is the distribution P(F1,F2,…). You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about pcoin as a variable atop all the Fi in a graph, and then think in terms of P(F,G|pcoin) and p(pcoin) separately, because that’s more intuitive. This is the standard way of doing things. All you do with Ap is the same, there’s no point at which you do something different in practice, even if you ascribed additional properties to Ap in words.
A concept like “the probability of me assigning a certain probability” makes sense but I don’t think Jaynes actually did anything like that for real. Here on lesswrong I guess @abramdemski knows about stuff like that.
--PS: I think Jaynes was great in his way of approaching the meaning and intuition of statistics, but the book is bad as a statistics textbook. It’s literally the half-complete posthumous publication of a rambling contrarian physicist, and it shows. So I would not trust any specific statistical thing he does. Taking the general vibe and ideas is good, but when you ask about a specific thing “why is nobody doing this?” it’s most likely because it’s outdated or wrong.
What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn’t need new probability concepts. Let Fi be the flips. All you need in life is the distribution P(F1,F2,…). You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about pcoin as a variable atop all the Fi in a graph, and then think in terms of P(F,G|pcoin) and p(pcoin) separately, because that’s more intuitive. This is the standard way of doing things. All you do with Ap is the same, there’s no point at which you do something different in practice, even if you ascribed additional properties to Ap in words.
This isn’t emphasized by Jaynes (though I believe it’s mentioned at the very end of the chapter), but the Ap distribution isn’t new as a formal idea in probability theory. It’s based on De Finetti’s representation theorem. The theorem concerns exchangeable sequences of random variables.
A sequence of random variables {Xi} is exchangeable if the joint distribution of any finite subsequence is invariant under permutations. A sequence of coin flips is the canonical example. Note that exchangeability does not imply independence! If I have a perfectly biased coin where I don’t know the bias, then all the random variables are perfectly dependent on each other (they all must obtain the same value).
De Finetti’s representation theorem says that any exchangeable sequence of random variables can be represented as an integral over identical and independent distributions (i.e binomial distributions). Or in other words, the extent to which random variables in the sequence are dependent on each other is solely due to their mutual relationship to the latent variable (the hidden bias of the coin).
P(X1=x1,…,Xn=xn)=∫10(nk)θk(1−θ)n−kdF(θ)
You are correct that all relevant information is contained in the joint distribution P(F1,F2,...). And while I have no deep familiarity with Bayesian hierarchical modeling, I believe your claim that the decomposition ∫10dpcoinP(F,G|pcoin)p(pcoin) is standard in Bayesian modeling.
But I think the point is that the Ap distribution is a useful conceptual tool when considering distributions governed by a time-invariant generating process. A lot of real-world processes don’t fit that description, but many do fit that description.
A concept like “the probability of me assigning a certain probability” makes sense but I don’t think Jaynes actually did anything like that for real. Here on lesswrong I guess @abramdemski knows about stuff like that.
Yes, this is correct. The part about “the probability of assigning a probability” and the part about interpreting the proposition Ap as a shorthand for an infinite collection evidences are my own interpretations of what the Ap distribution “really” means. Specifically, the part about the “probability that you will assign the probability in the infinite future” is loosely inspired by the idea of Cauchy surfaces from e.g general relativity (or any physical theory that has a causal structure built in). In general relativity, the idea is that if you have boundary conditions specified on a Cauchy surface, then you can time-evolve to solve for the distribution of matter and energy for all time. In something like quantum field theory, a principled choice for the Cauchy surface would be the infinite past (this conceptual idea shows up when understanding the vacuum in QFT). But I think in probability theory, it’s more useful conceptually to take your Cauchy surface of probabilities to be what you expect them to be in the “infinite future”. This is how I make sense of the Ap distribution.
And now that you mention it, this blog post was totally inspired by reading the first couple chapters of “Logical Inductors” (though the inspiration wasn’t conscious on my part).
--PS: I think Jaynes was great in his way of approaching the meaning and intuition of statistics, but the book is bad as a statistics textbook. It’s literally the half-complete posthumous publication of a rambling contrarian physicist, and it shows. So I would not trust any specific statistical thing he does. Taking the general vibe and ideas is good, but when you ask about a specific thing “why is nobody doing this?” it’s most likely because it’s outdated or wrong.
Not a statistician, so I will defer to your expertise that the book is bad as a statistics book (never thought of it as a statistics book to be honest). I think the strongest parts of this book are when he derives statistical mechanics from the maximum entropy principle and when he generalizes the principle of indifference to consider more general group invariances/symmetries. As far as I’m aware, my opinion on which of Jaynes’ ideas are his best ideas matches the consensus.
I suspect the reason why I like the Ap distribution is that I come from a physics background, so his reformulation of standard ideas in Bayesian modeling makes some amount of sense to me even if comes across as weird and crankish to statisticians.
I still don’t understand your “infinite limit” idea. If in your post I drop the following paragraph:
A way to think about the proposition Ap is as a kind of limit. When we have little evidence, each bit of evidence has a potentially big impact on our overall probability of a given proposition. But each incremental bit of evidence shifts our beliefs less and less. The proposition Ap can be thought of a shorthand for an infinite collection of evidences Fi where the collection leads to an overall probability of p given to A. This would perhaps explains why the Ap proposition is so strange: we have well-developed intuitions for how “finite” propositions interact, but the characteristic absorbing property of the Ap distribution is more reminiscent of how an infinite object interacts with finite objects.
the rest is standard hierarchical modeling. So even if your words here are suggestive, I don’t understand how to actually connect the idea to calculations/concrete things, even at a vague indicative level. So I guess I’m not actually understanding it.
For example, you could show me a conceptual example where you do something with this which is not standard probabilistic modeling. Or maybe it’s all standard but you get to a solution faster. Or anything where applying the idea produces something different, then I would see how it works.
(Note: I don’t know if you noticed, but De Finetti applies to proper infinite sequences only, not finite ones, people forget this. It is not relevant to the discussion though)
I don’t think this concept is useful.
What you are showing with the coin is a hierarchical model over multiple coin flips, and doesn’t need new probability concepts. Let Fi be the flips. All you need in life is the distribution P(F1,F2,…). You can decide to restrict yourself to distributions of the form ∫10dpcoinP(F,G|pcoin)p(pcoin). In practice, you start out thinking about pcoin as a variable atop all the Fi in a graph, and then think in terms of P(F,G|pcoin) and p(pcoin) separately, because that’s more intuitive. This is the standard way of doing things. All you do with Ap is the same, there’s no point at which you do something different in practice, even if you ascribed additional properties to Ap in words.
A concept like “the probability of me assigning a certain probability” makes sense but I don’t think Jaynes actually did anything like that for real. Here on lesswrong I guess @abramdemski knows about stuff like that.
--PS: I think Jaynes was great in his way of approaching the meaning and intuition of statistics, but the book is bad as a statistics textbook. It’s literally the half-complete posthumous publication of a rambling contrarian physicist, and it shows. So I would not trust any specific statistical thing he does. Taking the general vibe and ideas is good, but when you ask about a specific thing “why is nobody doing this?” it’s most likely because it’s outdated or wrong.
Thanks for the feedback.
This isn’t emphasized by Jaynes (though I believe it’s mentioned at the very end of the chapter), but the Ap distribution isn’t new as a formal idea in probability theory. It’s based on De Finetti’s representation theorem. The theorem concerns exchangeable sequences of random variables.
A sequence of random variables {Xi} is exchangeable if the joint distribution of any finite subsequence is invariant under permutations. A sequence of coin flips is the canonical example. Note that exchangeability does not imply independence! If I have a perfectly biased coin where I don’t know the bias, then all the random variables are perfectly dependent on each other (they all must obtain the same value).
De Finetti’s representation theorem says that any exchangeable sequence of random variables can be represented as an integral over identical and independent distributions (i.e binomial distributions). Or in other words, the extent to which random variables in the sequence are dependent on each other is solely due to their mutual relationship to the latent variable (the hidden bias of the coin).
P(X1=x1,…,Xn=xn)=∫10(nk)θk(1−θ)n−kdF(θ)
You are correct that all relevant information is contained in the joint distribution P(F1,F2,...). And while I have no deep familiarity with Bayesian hierarchical modeling, I believe your claim that the decomposition ∫10 dpcoinP(F,G|pcoin)p(pcoin) is standard in Bayesian modeling.
But I think the point is that the Ap distribution is a useful conceptual tool when considering distributions governed by a time-invariant generating process. A lot of real-world processes don’t fit that description, but many do fit that description.
Yes, this is correct. The part about “the probability of assigning a probability” and the part about interpreting the proposition Ap as a shorthand for an infinite collection evidences are my own interpretations of what the Ap distribution “really” means. Specifically, the part about the “probability that you will assign the probability in the infinite future” is loosely inspired by the idea of Cauchy surfaces from e.g general relativity (or any physical theory that has a causal structure built in). In general relativity, the idea is that if you have boundary conditions specified on a Cauchy surface, then you can time-evolve to solve for the distribution of matter and energy for all time. In something like quantum field theory, a principled choice for the Cauchy surface would be the infinite past (this conceptual idea shows up when understanding the vacuum in QFT). But I think in probability theory, it’s more useful conceptually to take your Cauchy surface of probabilities to be what you expect them to be in the “infinite future”. This is how I make sense of the Ap distribution.
And now that you mention it, this blog post was totally inspired by reading the first couple chapters of “Logical Inductors” (though the inspiration wasn’t conscious on my part).
Not a statistician, so I will defer to your expertise that the book is bad as a statistics book (never thought of it as a statistics book to be honest). I think the strongest parts of this book are when he derives statistical mechanics from the maximum entropy principle and when he generalizes the principle of indifference to consider more general group invariances/symmetries. As far as I’m aware, my opinion on which of Jaynes’ ideas are his best ideas matches the consensus.
I suspect the reason why I like the Ap distribution is that I come from a physics background, so his reformulation of standard ideas in Bayesian modeling makes some amount of sense to me even if comes across as weird and crankish to statisticians.
I still don’t understand your “infinite limit” idea. If in your post I drop the following paragraph:
the rest is standard hierarchical modeling. So even if your words here are suggestive, I don’t understand how to actually connect the idea to calculations/concrete things, even at a vague indicative level. So I guess I’m not actually understanding it.
For example, you could show me a conceptual example where you do something with this which is not standard probabilistic modeling. Or maybe it’s all standard but you get to a solution faster. Or anything where applying the idea produces something different, then I would see how it works.
(Note: I don’t know if you noticed, but De Finetti applies to proper infinite sequences only, not finite ones, people forget this. It is not relevant to the discussion though)