You do not need a probability distribution on your probability distribution to represent uncertainty. The uncertainty is captured by the spread (variance) of your prior. I think you are confusing the map and the map of the map.
First, I think you should think about whether the thing you are interested in knowing the truth about is a true/false proposition or something that can have more than two possible values.
Let’s imagine you want to know the true value of a number X between negative and positive infinity. Scientist 1 tells you “My prior is represented by a standard deviation with mean 0 and standard deviation 1”. Scientist 2 says the same thing, except his standard deviation is 10.
These two scientists have the same belief about the most likely value of X, but they have different certainties. This difference will be reflected in how they respond to data: Scientist 2 will always adjust his beliefs more in response to any new evidence. The point is that you are able to reflect the uncertainty of the beliefs in the prior itself.
Next, let’s imagine you are interested in a true/false statement. Since there are only two possibilities (law of excluded middle) you can represent your beliefs with a Bernoulli distribution. This distribution has only one parameter, its variance is equal to p(1-p). Therefore, your estimate p tells me everything there is to know about how certain you are.
If you claim “I believe the statement is true with probability 50%” you have committed yourself to updating that probability only by the likelihood ratio associated with future evidence, which depends only on the probability of the outcome given the hypothesis. This likelihood ratio simply cannot depend on how certain you are about the hypothesis.
The only meaningful interpretation of a probability on a probability, is if you are unsure about what you actually believe. In other words, you are trying to make a map of your map. For example, you can say that “I believe with probability 1⁄4 that I believe that p=0.40, and I believe with probability 3⁄4 that I believe that p=0.60.”. This however logically implies that you believe the statement is true with p=0.45, which is the only thing that determines how you update your beliefs in response to new evidence.
Of course, if you obtain new information about what you truly believe (which is independent of whether the statement is true), you could update your prior on your prior. However, I fail to see what this represents or why this idea would be useful.
You do not need a probability distribution on your probability distribution to represent uncertainty.
I think I do.
The uncertainty is captured by the spread (variance) of your prior.
First, my prior is a probability distribution, isn’t it? Second, some but not all uncertainty is captured by the variance of my prior. For example, I could be uncertain about the shape of the distribution—say, it might be skewed but I’m not sure whether it actually is. Or I don’t know whether I’m looking at a Student’s-t (which e.g. has a defined mean) or I’m looking at Cauchy (which doesn’t). How will I express that uncertainty?
The only meaningful interpretation of a probability on a probability, is if you are unsure about what you actually believe.
So, what’s wrong with that? Of course I am unsure of what I actually believe—say, I have some prior about the future values of X, but my confidence in my prior is not 100%, it’s quite possible that my prior is wrong. You basically want to collapse all the meta-levels into a single prior, and I think that having one or more meta-levels is actually useful for thinking about the situation.
Your uncertainty about X need not be fully captured by your prior distribution for X.
When it isn’t, the other things it involves may not be best thought of in terms of your probability distribution for anything.
Example: you are looking at the results of a scientific experiment. You have two rival theories for what’s going on. One predicts that the frobulator will show an average reading of 11.3, with variance of 3 units and something very close to a normal distribution. One predicts the same average and variance, but expects a geometric distribution. And it’s also possible that neither existing theory is right, in which case almost anything could be, though earlier experiments suggest that readings less than 0.4 or more than 21 are extremely unlikely.
I suggest that in this case your uncertainty about the next frobulator reading is reasonably well captured by the following structure:
You assign, let’s say, p=0.6 that Theory A is basically right, in which case the next reading will be roughly normally distributed if measured correctly.
You assign, let’s say, p=0.37 that Theory B is basically right, in which case the next reading will be roughly exponentially distributed if measured correctly.
You assign p=0.03 that neither theory is correct, conditional on which you have a largely atheoretical prior that maybe looks roughly normal but with larger variance than either Theory A or Theory B.
You are aware that sometimes measurements are messed up, so you expect that with some quite small probability the measured result will be corrupted in some way you could probably write down a crude distribution for (obtained by reflecting on the kinds of mistakes people make, or past experience of measurement cockups, or something).
So you have uncertainties about things other than the next frobulator reading, but it would be misleading to describe them as uncertainties about your probability distribution; e.g. the sort of thing that would change your prior would be discovering evidence from some other source that favours Theory B, or learning that the person taking the measurements is a hopeless klutz whose mistakes have caused trouble in the past.
Besides, isn’t your first point contradicted by the two following ones?
You assign, let’s say, p=0.6 that Theory A is basically right
How do I express my uncertainty about that 0.6 number?
So you have uncertainties about things other than the next frobulator reading, but it would be misleading to describe them as uncertainties about your probability distribution
I don’t know about that. I am uncertain about the next frobulator reading. I’m treating this reading as a random variable arising out of an unobserved process (=some unobserved distribution). This unobserved process/distribution has a set of parameters theta. I am uncertain about these parameters. Would you describe the uncertainty about these parameters as “uncertainties about [my] probability distribution”?
I don’t really believe in “Knightian uncertainty” as a fundamental notion, but in so far as you have it I’m not sure you can properly be said to have a prior at all.
Your “uncertainty about that 0.6 number” is a meaningful notion only when there’s something in (your model of) the world for it to be about. For instance, perhaps your opinion that Theory A is a bit more likely than not is the result of your having read a speculative paper by someone you think is an expert; but if you think there’s a 10% chance she’s a charlatan, maybe it would be useful to represent that as p=0.9 of (65% Theory A, 32% Theory B, 3% neither) plus p=0.1 of some other probability distribution over theories. (If that’s the only impact of learning that the author is or isn’t a charlatan, this doesn’t buy you anything relative to just figuring out the overall probabilities for A, B, and Neither; but e.g. perhaps if the author is a charlatan then your ideas about how things might look if A and B are both wrong will change.)
But your estimate of p=0.6 as such—I think asking for your uncertainty about it is a type error.
(It might be fruitful in practice to put probability distributions on such things—it might be easier and almost as accurate as figuring out all the intricate evidential structures that I’m suggesting are the “real” underpinnings of the kind of uncertainty that makes it feel like a good thing to do. But I think that’s a heuristic technique and I’m not convinced that there’s a way to make it rigorous that doesn’t cash it out in terms of the kind of thing I’ve been describing.)
Would you describe the uncertainty about these parameters as “uncertainties about [my] probability distribution”?
No. I think you’re making a type error again. The unobserved process is, or describes, some physical thing within the world; its parameters, whatever they may be, are facts about the world. You are (of course) uncertain about them; that uncertainty is part of your current probability distribution over ways-the-world-could-be. (You may also be uncertain about whether the actual process is the sort you think it is; again, that’s represented by your probability distribution over how the world is.)
None of that involves making your probability assignment apply to itself.
Now, having said all that: you are part of the world, and you may in fact be uncertain about various aspects of your mind, including your probability assignments. So if you are trying to predict your own future behaviour or something, then for that purpose you may want to introduce something like uncertainty about your probability distribution. But I think you shouldn’t identify your model of your probability distribution, as here, with the probability distribution you’re using for calculation, as in the previous paragraphs. (In particular, I suspect that assuming they match may lead you into inconsistencies.)
Let me express my approach in a slightly different way.
Let’s say I have a sample of some numbers and I’m interested in the properties of future numbers coming out of the same underlying process.
The simplest approach (say, Level 1) is to have a point estimate. Here is my expected value for the future numbers.
But wait! There is uncertainty. At Level 2 I specify a distribution, say, a Gaussian with a particular mean and standard deviation (note that it implies e.g. very specific “hard” probabilities of seeing particulate future numbers).
But wait! There is more uncertainty! At Level 3 I specify that the mean of that Gaussian is actually uncertain, too, and has a standard error—in effect it is a distribution (meaning your “hard” probabilities from the previous level just became “soft”). And the variance is uncertain, too, and has parameters of its own.
But wait! You can dive deeper and find yet more turtles down there.
but in so far as you have it I’m not sure you can properly be said to have a prior at all.
I have an uncertain prior. I find that notion intuitive, it seems that you don’t.
Your “uncertainty about that 0.6 number” is a meaningful notion only when there’s something in (your model of) the world for it to be about.
It is uncertainty about the probability that the theory A is correct. I find the idea of “uncertainty about the probability” meaningful and useful.
I think that in a large number of cases you just do not have enough data for “figuring out all the intricate evidential structures” and the “heuristic technique” is all you can do. As for being rigorous, I’ll be happy if in the limit it converges to the right values.
that’s represented by your probability distribution over how the world is
But I don’t have one. I’m not Omega—the world is too large for me to have a probability distribution over it. I’m building models all of which are wrong but some of which are useful (hat tip to George Box). Is it useful to me to have multilayered models which involve probabilities of probabilities.
I think we are basically talking about whether to collapse all the meta-levels into one (your and Anders_H’s position) or not collapse them (my position).
You do not need a probability distribution on your probability distribution to represent uncertainty. The uncertainty is captured by the spread (variance) of your prior. I think you are confusing the map and the map of the map.
First, I think you should think about whether the thing you are interested in knowing the truth about is a true/false proposition or something that can have more than two possible values.
Let’s imagine you want to know the true value of a number X between negative and positive infinity. Scientist 1 tells you “My prior is represented by a standard deviation with mean 0 and standard deviation 1”. Scientist 2 says the same thing, except his standard deviation is 10.
These two scientists have the same belief about the most likely value of X, but they have different certainties. This difference will be reflected in how they respond to data: Scientist 2 will always adjust his beliefs more in response to any new evidence. The point is that you are able to reflect the uncertainty of the beliefs in the prior itself.
Next, let’s imagine you are interested in a true/false statement. Since there are only two possibilities (law of excluded middle) you can represent your beliefs with a Bernoulli distribution. This distribution has only one parameter, its variance is equal to p(1-p). Therefore, your estimate p tells me everything there is to know about how certain you are.
If you claim “I believe the statement is true with probability 50%” you have committed yourself to updating that probability only by the likelihood ratio associated with future evidence, which depends only on the probability of the outcome given the hypothesis. This likelihood ratio simply cannot depend on how certain you are about the hypothesis.
The only meaningful interpretation of a probability on a probability, is if you are unsure about what you actually believe. In other words, you are trying to make a map of your map. For example, you can say that “I believe with probability 1⁄4 that I believe that p=0.40, and I believe with probability 3⁄4 that I believe that p=0.60.”. This however logically implies that you believe the statement is true with p=0.45, which is the only thing that determines how you update your beliefs in response to new evidence.
Of course, if you obtain new information about what you truly believe (which is independent of whether the statement is true), you could update your prior on your prior. However, I fail to see what this represents or why this idea would be useful.
I think I do.
First, my prior is a probability distribution, isn’t it? Second, some but not all uncertainty is captured by the variance of my prior. For example, I could be uncertain about the shape of the distribution—say, it might be skewed but I’m not sure whether it actually is. Or I don’t know whether I’m looking at a Student’s-t (which e.g. has a defined mean) or I’m looking at Cauchy (which doesn’t). How will I express that uncertainty?
So, what’s wrong with that? Of course I am unsure of what I actually believe—say, I have some prior about the future values of X, but my confidence in my prior is not 100%, it’s quite possible that my prior is wrong. You basically want to collapse all the meta-levels into a single prior, and I think that having one or more meta-levels is actually useful for thinking about the situation.
I suggest that:
Your uncertainty is fully captured by your prior.
Your uncertainty about X need not be fully captured by your prior distribution for X.
When it isn’t, the other things it involves may not be best thought of in terms of your probability distribution for anything.
Example: you are looking at the results of a scientific experiment. You have two rival theories for what’s going on. One predicts that the frobulator will show an average reading of 11.3, with variance of 3 units and something very close to a normal distribution. One predicts the same average and variance, but expects a geometric distribution. And it’s also possible that neither existing theory is right, in which case almost anything could be, though earlier experiments suggest that readings less than 0.4 or more than 21 are extremely unlikely.
I suggest that in this case your uncertainty about the next frobulator reading is reasonably well captured by the following structure:
You assign, let’s say, p=0.6 that Theory A is basically right, in which case the next reading will be roughly normally distributed if measured correctly.
You assign, let’s say, p=0.37 that Theory B is basically right, in which case the next reading will be roughly exponentially distributed if measured correctly.
You assign p=0.03 that neither theory is correct, conditional on which you have a largely atheoretical prior that maybe looks roughly normal but with larger variance than either Theory A or Theory B.
You are aware that sometimes measurements are messed up, so you expect that with some quite small probability the measured result will be corrupted in some way you could probably write down a crude distribution for (obtained by reflecting on the kinds of mistakes people make, or past experience of measurement cockups, or something).
So you have uncertainties about things other than the next frobulator reading, but it would be misleading to describe them as uncertainties about your probability distribution; e.g. the sort of thing that would change your prior would be discovering evidence from some other source that favours Theory B, or learning that the person taking the measurements is a hopeless klutz whose mistakes have caused trouble in the past.
Including my Knightian uncertainty?
Besides, isn’t your first point contradicted by the two following ones?
How do I express my uncertainty about that 0.6 number?
I don’t know about that. I am uncertain about the next frobulator reading. I’m treating this reading as a random variable arising out of an unobserved process (=some unobserved distribution). This unobserved process/distribution has a set of parameters theta. I am uncertain about these parameters. Would you describe the uncertainty about these parameters as “uncertainties about [my] probability distribution”?
I don’t really believe in “Knightian uncertainty” as a fundamental notion, but in so far as you have it I’m not sure you can properly be said to have a prior at all.
Your “uncertainty about that 0.6 number” is a meaningful notion only when there’s something in (your model of) the world for it to be about. For instance, perhaps your opinion that Theory A is a bit more likely than not is the result of your having read a speculative paper by someone you think is an expert; but if you think there’s a 10% chance she’s a charlatan, maybe it would be useful to represent that as p=0.9 of (65% Theory A, 32% Theory B, 3% neither) plus p=0.1 of some other probability distribution over theories. (If that’s the only impact of learning that the author is or isn’t a charlatan, this doesn’t buy you anything relative to just figuring out the overall probabilities for A, B, and Neither; but e.g. perhaps if the author is a charlatan then your ideas about how things might look if A and B are both wrong will change.)
But your estimate of p=0.6 as such—I think asking for your uncertainty about it is a type error.
(It might be fruitful in practice to put probability distributions on such things—it might be easier and almost as accurate as figuring out all the intricate evidential structures that I’m suggesting are the “real” underpinnings of the kind of uncertainty that makes it feel like a good thing to do. But I think that’s a heuristic technique and I’m not convinced that there’s a way to make it rigorous that doesn’t cash it out in terms of the kind of thing I’ve been describing.)
No. I think you’re making a type error again. The unobserved process is, or describes, some physical thing within the world; its parameters, whatever they may be, are facts about the world. You are (of course) uncertain about them; that uncertainty is part of your current probability distribution over ways-the-world-could-be. (You may also be uncertain about whether the actual process is the sort you think it is; again, that’s represented by your probability distribution over how the world is.)
None of that involves making your probability assignment apply to itself.
Now, having said all that: you are part of the world, and you may in fact be uncertain about various aspects of your mind, including your probability assignments. So if you are trying to predict your own future behaviour or something, then for that purpose you may want to introduce something like uncertainty about your probability distribution. But I think you shouldn’t identify your model of your probability distribution, as here, with the probability distribution you’re using for calculation, as in the previous paragraphs. (In particular, I suspect that assuming they match may lead you into inconsistencies.)
Let me express my approach in a slightly different way.
Let’s say I have a sample of some numbers and I’m interested in the properties of future numbers coming out of the same underlying process.
The simplest approach (say, Level 1) is to have a point estimate. Here is my expected value for the future numbers.
But wait! There is uncertainty. At Level 2 I specify a distribution, say, a Gaussian with a particular mean and standard deviation (note that it implies e.g. very specific “hard” probabilities of seeing particulate future numbers).
But wait! There is more uncertainty! At Level 3 I specify that the mean of that Gaussian is actually uncertain, too, and has a standard error—in effect it is a distribution (meaning your “hard” probabilities from the previous level just became “soft”). And the variance is uncertain, too, and has parameters of its own.
But wait! You can dive deeper and find yet more turtles down there.
I have an uncertain prior. I find that notion intuitive, it seems that you don’t.
It is uncertainty about the probability that the theory A is correct. I find the idea of “uncertainty about the probability” meaningful and useful.
I think that in a large number of cases you just do not have enough data for “figuring out all the intricate evidential structures” and the “heuristic technique” is all you can do. As for being rigorous, I’ll be happy if in the limit it converges to the right values.
But I don’t have one. I’m not Omega—the world is too large for me to have a probability distribution over it. I’m building models all of which are wrong but some of which are useful (hat tip to George Box). Is it useful to me to have multilayered models which involve probabilities of probabilities.
I think we are basically talking about whether to collapse all the meta-levels into one (your and Anders_H’s position) or not collapse them (my position).