Given 1. your model and 2 the magical no uncertainty in theta, then it’s theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it’s a distribution of y (coin flip outcomes) not theta (which describes the frequency)
Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.
This is what I think “estimating the probability” means:
We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:
“frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature.
But we do not need to dispute about this as theta being “frequency” is unnecessary.
Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads in the next flip given data from previous flips.
Is the above accurate?
So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?
But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.
Philosophically then the posterior predictive is actually frequents, allow me to explain: Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.
Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a “Bayesian P-value” which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.
*In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models
Excellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”.
I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically.
This is how it would work in the coin flipping example:
Define A(u) to a function that maps from real numbers to propositions with domain [0, 1] s.t. 1. The set of propositions {A(u): 0 ⇐ u ⇐ 1} is mutually exclusive and exhaustive
2. P(y=1 | A(u)) = u and P(y=0 | A(u)) = 1 - u
Because the set of propositions is mutually exclusive and exhaustive, there is one u s.t. A(u) is true and for any v != u, A(v) is false. We call this unique value of u: theta.
It follows that P(y=1 | theta) = theta and P(y=0 | theta) = 1 - theta and we use this to calculate the posterior predictive distribution
Given 1. your model and 2 the magical no uncertainty in theta, then it’s theta, the posterior predictive allows us to jump from infrence about parameters to infence about new data, it’s a distribution of y (coin flip outcomes) not theta (which describes the frequency)
Think I have finally got it. I would like to thank you once again for all your help; I really appreciate it.
This is what I think “estimating the probability” means:
We define theta to be a real-world/objective/physical quantity s.t. P(H|theta=alpha) = alpha & P(T|theta=alpha) = 1 - alpha. We do not talk about the nature of this quantity theta because we do not care what it is. I don’t think it is appropriate to say that theta is “frequency” for this reason:
“frequency” is not a well-defined physical quantity. You can’t measure “frequency” like you measure temperature.
But we do not need to dispute about this as theta being “frequency” is unnecessary.
Using the above definitions, we can compute the likelihood and then the posterior and then the posterior predictive which is represents the probability of heads in the next flip given data from previous flips.
Is the above accurate?
So Bayesians who say that theta is the probability of heads and compute a point estimate of the parameter theta and say that they have “estimated the probability” are just frequentists in disguise?
I think the above is accurate.
I disagree with the last part, but it has two sources of confusion
Frequentists vs Bayesian is in principle about priors but in practice about about point estimates vs distributions
Good Frequentists use distributions and bad Bayesian use point estimates such as Bayes Factors, a good review is this is https://link.springer.com/article/10.3758/s13423-016-1221-4
But the leap from theta to probability of heads I think is an intuitive leap that happens to be correct but unjustified.
Philosophically then the posterior predictive is actually frequents, allow me to explain:
Frequents are people who estimates a parameter and then draws fake samples from that point estimate and summarize it in confidence intervals, to justify this they imagine parallel worlds and what not.
Bayesian are people who assumes a prior distributions from which the parameter is drawn, they thus have both prior and likelihood uncertainty which gives posterior uncertainty, which is the uncertainty of the parameters in their model, when a Bayesian wants to use his model to make predictions then they integrate their model parameters out and thus have a predictive distribution of new data given data*. Because this is a distribution of the data like the Frequentists sampling function, then we can actually draw from it multiple times to compute summary statistics much like the frequents, and calculate things such as a “Bayesian P-value” which describes how likely the model is to have generated our data, here the goal is for the p-value to be high because that suggests that the model describes the data well.
*In the real world they do not integrate out theta, they draw it 10.000 times and use thous samples as a stand in distribution because the math is to hard for complex models
Excellent! One final point that I would like to add is if we say that “theta is a physical quantity s.t. [...]“, we are faced with an ontological question: “does a physical quantity exist with these properties?”.
I recently found about Professor Jaynes’ A_p distribution idea, it is introduced in chapter 18 of his book, from Maxwell Peterson in the sub-thread below and I believe it is an elegant workaround to this problem. It leads to the same results but is more satisfying philosophically.
This is how it would work in the coin flipping example:
Define A(u) to a function that maps from real numbers to propositions with domain [0, 1] s.t.
1. The set of propositions {A(u): 0 ⇐ u ⇐ 1} is mutually exclusive and exhaustive
2. P(y=1 | A(u)) = u and P(y=0 | A(u)) = 1 - u
Because the set of propositions is mutually exclusive and exhaustive, there is one u s.t. A(u) is true and for any v != u, A(v) is false. We call this unique value of u: theta.
It follows that P(y=1 | theta) = theta and P(y=0 | theta) = 1 - theta and we use this to calculate the posterior predictive distribution