I’m pretty sure nothing I say here will be new to you, so consider this more of an effort to explain to you where I (and I think also Jonah, though I won’t categorically speak for him) am coming from.
Jonah was looking at probability distributions over estimates of an unknown probability (such as the probability of a coin coming up heads). Unless you have some objection to probability distributions per se, I don’t see anything wrong with taking a probability distribution to describe one’s current state of knowledge of a probability.
If your goal is to answer the question “Will this coin come up heads?” for a single coin toss, and you can’t run any experiments to augment your knowledge about the model, but only have access to your prior knowledge, then it’s true that all your knowledge would be captured in a single probability number, and in case you have a subjective probability distribution, then the single probability number would simply be the expected value of the distribution.
If, however, you are trying to answer a similar question “Will this coin come up heads when I toss it on such-and-such date at such-and-such time?” but you can run experiments before that, it would make sense to use those experiments to try to understand the model that determines how the coin tossing works. Your model may be something like “with fairly extreme probability, I believe that there is a probability p such that the coin toss turns up heads with probability p, and that that probability p is independent of the time and place that it is tossed. I also have a Bayesian prior for the probability distribution of the probability p.” You would start with the prior and then run coin-tossing experiments to continue updating that probability distribution of probabilities. The day before your grand toss, you’ll need to take the expected value of the probability distribution that you have obtained by then. But at intermediate stages it would make sense to store the entire probability distribution rather than the expected value (the point estimate of the probability). For instance, if you think that the coin is either fair (probability 1⁄3), or always heads (probability 1⁄3), or always tails (probability 1⁄3), then it’s worth storing that full prior rather than simply saying that there’s a 50% chance of it turning up heads, so that you can appropriately update your evidence. I could also construct higher-order versions of this hypothetical, but they would be too tedious to describe.
Secondly, as Jonah said, if you’re running the coin-tossing experiment multiple times and measuring the probability of, say, all heads, then the subjective probability distribution for p does matter for calculating the probability of all heads, and just the point estimate (expected value) of p would give a wrong answer.
Sorry if this isn’t clear—I can elaborate more later.
“Jonah was looking at probability distributions over estimates of an unknown probability (such as the probability of a coin coming up heads)”
It sounds like you are just confusing epistemic probabilities with propensities, or frequencies. I.e, due to physics, the shape of the coin, and your style of flipping, a particular set of coin flips will have certain frequency properties that you can characterise by a bias parameter p, which you call “the probability of landing on heads”. This is just a parameter of a stochastic model, not a degree of belief.
However, you can have a degree of belief about what p is no problem. So you are talking about your degree of belief that a set of coin flips has certain frequentist properties, i.e. your degree of belief in a particular model for the coin flips.
edit: I could add that GIVEN a stochastic model you then have degrees of belief about whether a given coin flip will result in heads. But this is a conditional probability: see my other comment in reply to Vanvier.
This is not, however, “beliefs about beliefs”. It is just standard Bayesian modelling.
This is just a parameter of a stochastic model, not a degree of belief.
This is not exactly correct. It’s true that in general there’s a sharp distinction to be made between model parameters (which govern/summarize/encode properties of the entire stochastic process) and degrees of belief for various outcomes, but that distinction becomes very blurry in the current context.
What’s going on here is that the probability distribution for the observable outcomes is infinitely exchangeable. Infinite exchangeability gives rise to a certain representation for the predictive distribution under which the prior expected limiting frequency is mathematically equal to the marginal prior probability for any single outcome. So under exchangeability, it’s not an either/or—it’s a both/and.
Are you referring to De Finetti’s theorem? I can’t say I understand your point. Does it relate to the edit I made shortly before your post? i.e. Given a stochastic model with some parameters, you then have degrees of belief about certain outcomes, some of which may seem almost the same thing as the parameters themselves? I still maintain that the two are quite different: parameters characterise probability distributions, and just in certain cases happen to coincide with conditional degrees of belief. In this ‘beliefs about beliefs’ context, though, it is the parameters we have degrees of belief about, we do not have degrees of belief about the conditional degrees of belief to which said parameters may happen to coincide.
Yup, I’m referring to de Finetti’s theorem. Thing is, de Finetti himself would have denied that there is such a thing as a parameter—he was all about only assigning probabilities to observable, bet-on-able things. That’s why he developed his representation theorem. From his perspective, p arises as a distinct mathematical entity merely as a result of the representation provided by exchangeability. The meaning of p is to be found in the predictive distribution; to describe p as a bias parameter is to reify a concept which has no place in de Finetti’s Bayesian approach.
Now, I’m not a de-Finetti-style subjective Bayesian. For me, it’s enough to note that the math is the same whether one conceives of p as stochastic model parameter or as the degree of plausibility of any single outcome. That’s why I say it’s not either/or.
Hmm, interesting. I will go and learn more deeply what de Finetti was getting at. It is a little confusing… in this simple case ok fine p can be defined in a straightforward way in terms of the predictive distribution, but in more complicated cases this quickly becomes extremely difficult or impossible. For one thing, a single model with a single set of parameters may describe outcomes of vastly different experiments. E.g. consider Newtonian gravity. Ok fine strictly the Newtonian gravity part of the model has to be coupled to various other models to describe specific details of the setup, but in all cases there is a parameter G for the universal gravitation constant. G impacts on the predictive distributions for all such experiments, so it is pretty hard to see how it could be defined in terms of them, at least in a concrete sense.
I’d guess that in Geisser-style predictive inference, the meaning or reality or what-have-you of G is to be found in the way it encodes the dependence (or maybe, compresses the description) of the joint multivariate predictive distribution. But like I say, that’s not my school of thought—I’m happy to admit the possibility of physical model parameters—so I really am just guessing.
Hmm, do you know of any good material to learn more about this? I am actually extremely sympathetic to any attempt to rid model parameters of physical meaning; I mean in an abstract sense I am happy to have degrees of belief about them, but in a prior-elucidation sense I find it extremely difficult to argue about what it is sensible to believe a-priori about parameters, particularly given parameterisation dependence problems.
I am a particle physicist, and a particular problem I have is that parameters in particle physics are not constant; they vary with renormalisation scale (roughly, energy of the scattering process), so that if I want to argue about what it is a-priori reasonable to believe about (say) the mass of the Higgs boson, it matters a very great deal what energy scale I choose to define my prior for the parameters at. If I choose (naively) a flat prior over low-energy values for the Higgs mass, it implies I believe some really special and weird things about the high-scale Higgs mass parameter values (they have to be fine-tuned to the bejesus); while if I believe something more “flat” about the high scale parameters, it in turn implies something extremely informative about the low-scale values, namely that the Higgs mass should be really heavy (in the Standard Model—this is essentially the Hierarchy problem, translated into Bayesian words).
Anyway, if I can more directly reason about the physically observable things and detach from the abstract parameters, it might help clarify how one should think about this mess...
I can pass along a recommendation I have received: Operational Subjective Statistical Methods by Frank Lad. I haven’t read the book myself, so I can’t actually vouch for it, but it was described to me as “excellent”. I don’t know if it is actively prediction-centered, but it should at least be compatible with that philosophy.
Thanks, this seems interesting. It is pretty radical; he is very insistent on the idea that for all ‘quantities’ about which we want to reason there must some operational procedure we can follow in order to find out what it is. I don’t know what this means for the ontological status of physical principles, models, etc, but I can at least see the naive appeal… it makes it hard to understand why a model could ever have the power to predict new things we have never seen before though, like Higgs bosons...
I understand this, though I hadn’t thought of it with such clear terminology. I think the point Jonah was making was that in many cases, people are talking about propensities/frequencies when they refer to probabilities. So it’s not so much that Jonah or I are confusing epistemic probabilities with propensities/frequencies, it’s that many people use the term “probability” to refer to the latter. With language used this way, the probability distribution for this model parameter can be called the “probability distribution of the probability estimate.” If you reserve the term probability exclusive to epistemic probability (degree of belief) then this would constitute an abuse of language.
Sure, I don’t want to suggest we only use the word ‘probability’ for epistemic probabilities (although the world might be a better place if we did...), only that if we use the word to mean different sorts of probabilities in the same sentence, or even whole body of text, without explicit clarification, then it is just asking for confusion.
Jonah was looking at probability distributions over estimates of an unknown probability
What is an unknown probability? Forming a probability distribution means rationally assigning degrees of belief to a set of hypotheses. The very act of rational assignment entails that you know what it is.
I’m pretty sure nothing I say here will be new to you, so consider this more of an effort to explain to you where I (and I think also Jonah, though I won’t categorically speak for him) am coming from.
Jonah was looking at probability distributions over estimates of an unknown probability (such as the probability of a coin coming up heads). Unless you have some objection to probability distributions per se, I don’t see anything wrong with taking a probability distribution to describe one’s current state of knowledge of a probability.
If your goal is to answer the question “Will this coin come up heads?” for a single coin toss, and you can’t run any experiments to augment your knowledge about the model, but only have access to your prior knowledge, then it’s true that all your knowledge would be captured in a single probability number, and in case you have a subjective probability distribution, then the single probability number would simply be the expected value of the distribution.
If, however, you are trying to answer a similar question “Will this coin come up heads when I toss it on such-and-such date at such-and-such time?” but you can run experiments before that, it would make sense to use those experiments to try to understand the model that determines how the coin tossing works. Your model may be something like “with fairly extreme probability, I believe that there is a probability p such that the coin toss turns up heads with probability p, and that that probability p is independent of the time and place that it is tossed. I also have a Bayesian prior for the probability distribution of the probability p.” You would start with the prior and then run coin-tossing experiments to continue updating that probability distribution of probabilities. The day before your grand toss, you’ll need to take the expected value of the probability distribution that you have obtained by then. But at intermediate stages it would make sense to store the entire probability distribution rather than the expected value (the point estimate of the probability). For instance, if you think that the coin is either fair (probability 1⁄3), or always heads (probability 1⁄3), or always tails (probability 1⁄3), then it’s worth storing that full prior rather than simply saying that there’s a 50% chance of it turning up heads, so that you can appropriately update your evidence. I could also construct higher-order versions of this hypothetical, but they would be too tedious to describe.
Secondly, as Jonah said, if you’re running the coin-tossing experiment multiple times and measuring the probability of, say, all heads, then the subjective probability distribution for p does matter for calculating the probability of all heads, and just the point estimate (expected value) of p would give a wrong answer.
Sorry if this isn’t clear—I can elaborate more later.
“Jonah was looking at probability distributions over estimates of an unknown probability (such as the probability of a coin coming up heads)”
It sounds like you are just confusing epistemic probabilities with propensities, or frequencies. I.e, due to physics, the shape of the coin, and your style of flipping, a particular set of coin flips will have certain frequency properties that you can characterise by a bias parameter p, which you call “the probability of landing on heads”. This is just a parameter of a stochastic model, not a degree of belief.
However, you can have a degree of belief about what p is no problem. So you are talking about your degree of belief that a set of coin flips has certain frequentist properties, i.e. your degree of belief in a particular model for the coin flips.
edit: I could add that GIVEN a stochastic model you then have degrees of belief about whether a given coin flip will result in heads. But this is a conditional probability: see my other comment in reply to Vanvier. This is not, however, “beliefs about beliefs”. It is just standard Bayesian modelling.
This is not exactly correct. It’s true that in general there’s a sharp distinction to be made between model parameters (which govern/summarize/encode properties of the entire stochastic process) and degrees of belief for various outcomes, but that distinction becomes very blurry in the current context.
What’s going on here is that the probability distribution for the observable outcomes is infinitely exchangeable. Infinite exchangeability gives rise to a certain representation for the predictive distribution under which the prior expected limiting frequency is mathematically equal to the marginal prior probability for any single outcome. So under exchangeability, it’s not an either/or—it’s a both/and.
Are you referring to De Finetti’s theorem? I can’t say I understand your point. Does it relate to the edit I made shortly before your post? i.e. Given a stochastic model with some parameters, you then have degrees of belief about certain outcomes, some of which may seem almost the same thing as the parameters themselves? I still maintain that the two are quite different: parameters characterise probability distributions, and just in certain cases happen to coincide with conditional degrees of belief. In this ‘beliefs about beliefs’ context, though, it is the parameters we have degrees of belief about, we do not have degrees of belief about the conditional degrees of belief to which said parameters may happen to coincide.
Yup, I’m referring to de Finetti’s theorem. Thing is, de Finetti himself would have denied that there is such a thing as a parameter—he was all about only assigning probabilities to observable, bet-on-able things. That’s why he developed his representation theorem. From his perspective, p arises as a distinct mathematical entity merely as a result of the representation provided by exchangeability. The meaning of p is to be found in the predictive distribution; to describe p as a bias parameter is to reify a concept which has no place in de Finetti’s Bayesian approach.
Now, I’m not a de-Finetti-style subjective Bayesian. For me, it’s enough to note that the math is the same whether one conceives of p as stochastic model parameter or as the degree of plausibility of any single outcome. That’s why I say it’s not either/or.
Hmm, interesting. I will go and learn more deeply what de Finetti was getting at. It is a little confusing… in this simple case ok fine p can be defined in a straightforward way in terms of the predictive distribution, but in more complicated cases this quickly becomes extremely difficult or impossible. For one thing, a single model with a single set of parameters may describe outcomes of vastly different experiments. E.g. consider Newtonian gravity. Ok fine strictly the Newtonian gravity part of the model has to be coupled to various other models to describe specific details of the setup, but in all cases there is a parameter G for the universal gravitation constant. G impacts on the predictive distributions for all such experiments, so it is pretty hard to see how it could be defined in terms of them, at least in a concrete sense.
I’d guess that in Geisser-style predictive inference, the meaning or reality or what-have-you of G is to be found in the way it encodes the dependence (or maybe, compresses the description) of the joint multivariate predictive distribution. But like I say, that’s not my school of thought—I’m happy to admit the possibility of physical model parameters—so I really am just guessing.
Hmm, do you know of any good material to learn more about this? I am actually extremely sympathetic to any attempt to rid model parameters of physical meaning; I mean in an abstract sense I am happy to have degrees of belief about them, but in a prior-elucidation sense I find it extremely difficult to argue about what it is sensible to believe a-priori about parameters, particularly given parameterisation dependence problems.
I am a particle physicist, and a particular problem I have is that parameters in particle physics are not constant; they vary with renormalisation scale (roughly, energy of the scattering process), so that if I want to argue about what it is a-priori reasonable to believe about (say) the mass of the Higgs boson, it matters a very great deal what energy scale I choose to define my prior for the parameters at. If I choose (naively) a flat prior over low-energy values for the Higgs mass, it implies I believe some really special and weird things about the high-scale Higgs mass parameter values (they have to be fine-tuned to the bejesus); while if I believe something more “flat” about the high scale parameters, it in turn implies something extremely informative about the low-scale values, namely that the Higgs mass should be really heavy (in the Standard Model—this is essentially the Hierarchy problem, translated into Bayesian words).
Anyway, if I can more directly reason about the physically observable things and detach from the abstract parameters, it might help clarify how one should think about this mess...
I can pass along a recommendation I have received: Operational Subjective Statistical Methods by Frank Lad. I haven’t read the book myself, so I can’t actually vouch for it, but it was described to me as “excellent”. I don’t know if it is actively prediction-centered, but it should at least be compatible with that philosophy.
Thanks, this seems interesting. It is pretty radical; he is very insistent on the idea that for all ‘quantities’ about which we want to reason there must some operational procedure we can follow in order to find out what it is. I don’t know what this means for the ontological status of physical principles, models, etc, but I can at least see the naive appeal… it makes it hard to understand why a model could ever have the power to predict new things we have never seen before though, like Higgs bosons...
I understand this, though I hadn’t thought of it with such clear terminology. I think the point Jonah was making was that in many cases, people are talking about propensities/frequencies when they refer to probabilities. So it’s not so much that Jonah or I are confusing epistemic probabilities with propensities/frequencies, it’s that many people use the term “probability” to refer to the latter. With language used this way, the probability distribution for this model parameter can be called the “probability distribution of the probability estimate.” If you reserve the term probability exclusive to epistemic probability (degree of belief) then this would constitute an abuse of language.
Sure, I don’t want to suggest we only use the word ‘probability’ for epistemic probabilities (although the world might be a better place if we did...), only that if we use the word to mean different sorts of probabilities in the same sentence, or even whole body of text, without explicit clarification, then it is just asking for confusion.
What is an unknown probability? Forming a probability distribution means rationally assigning degrees of belief to a set of hypotheses. The very act of rational assignment entails that you know what it is.
That distribution of coin biases is a hyperprior.