Well, we could also assume and specify additional things that would make “p(A)” make sense even if “A” is not a statement in some formal system.
Do you mean, for example, that p could be a measure and A could be a set? Since komponisto was talking about expressions of the form p(A) such that A can appear in expressions like A&B, I understood the context to be one in which we were already considering p to be a function over sentences or propositions (which, following komponisto, I was equating), and not, for example, sets.
Do you mean that “p(A)” can make sense in some case where A is a sentence, but not a sentence in some formal system? If so, would you give an example? Do you mean, for example, that A could be a statement in some non-formal language like English?
In my own interpretation, A is a hypothesis -- something that represents a possible state of the world. Hypotheses are of course subject to Boolean algebra, so you could perhaps model them as sentences or sets.
You have made a number of interesting comments that will probably take me some time to respond to.
I’ve been trying to develop a formal understanding of your claim that the prior probability of an unknown arbitrary hypothesis A makes sense and should equal 0.5. I’m not there yet, but I have a couple of tentative approaches. I was wondering whether either one looks at all like what you are getting at.
The first approach is to let the sample space Ω be the set of all hypotheses, endowed with a suitable probability distribution p. It’s not clear to me what probability distribution p you would have in mind, though. Presumably it would be “uniform” in some appropriate sense, because we are supposed to start in a state of complete ignorance about the elements of Ω.
At any rate, you would then define the random variable v : Ω → {True, False} that returns the actual truth value of each hypothesis. The quantity “p(A), for arbitrary unknown A” would be interpreted to mean the value of p(v = True). One would then show that half of the hypotheses in Ω (with respect to p-measure) are true. That is, one would have p(v = True) = 0.5, yielding your claim.
I have two difficulties with this approach. First, as I mentioned, I don’t see how to define p. Second, as I mentioned in this comment, “the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map.” (ETA: I should emphasize that this second difficulty seems fatal to me. Defining p might just be a technicality. But making probability a property of the territory is fundamentally contrary to the Bayesian Way.)
The second approach tries to avoid that last difficulty by going “meta”. Under this approach, you would take the sample space Ω to be the set of logically consistent possible worlds. More precisely, Ω would be the set of all valuation maps v : {hypotheses} → {True, False} assigning a truth value to every hypothesis. (By calling a map v a “valuation map” here, I just mean that it respects the usual logical connectives and quantifiers. E.g., if v(A) = True and v(B) = True, then v(A & B) = True.) You would then endow Ω with some appropriate probability distribution p. However, again, I don’t yet see precisely what p should be.
Then, for each hypothesis A, you would have a random variable V_A : Ω → {True, False} that equals True on precisely those valuation maps v such that v(A) = True. The claim that “p(A) = 0.5 for arbitrary unknown A” would unpack as the claim that, for every hypothesis A, p(V_A = True) = 0.5 — that is, that each hypothesis A is true in exactly half of all possible worlds (with respect to p-measure).
Do either of these approaches look to you like they are on the right track?
ETA: Here’s a third approach which combines the previous two: When you’re asked “What’s p(A), where A is an arbitrary unknown hypothesis?”, and you are still in a state of complete ignorance, then you know neither the world you’re in, nor the hypothesis A whose truth in that world you are being asked to consider. So, let the sample space Ω be the set of ordered pairs (v, A), where v is a valuation map and A is a hypothesis. You endow Ω with some appropriate probability distribution p, and you have a random variable V : Ω → {True, False} that maps (v, A) to True precisely when v(A) = True — i.e., when A is true under v. You give the response “0.5″ to the question because (we suppose) p(V = True) = 0.5.
But I still don’t see how to define p. Is there a well-known and widely-agreed-upon definition for p? On the one hand, p is a probability distribution over a countably infinite set (assuming that we identify the set of hypotheses with the set of sentences in some formal language). [ETA: That was a mistake. The sample space is countable in the first of the approaches above, but there might be uncountably many logically consistent ways to assign truth values to hypotheses.] On the other hand, it seems intuitively like p should be “uniform” in some sense, to capture the condition that we start in a state of total ignorance. How can these conditions be met simultaneously?
I think the second approach (and possibly the third also, but I haven’t yet considered it as deeply) is close to the right idea.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
Here are a few problems that I have with this approach:
This approach makes your focus on the case where the hypotheses A is “unspecified” seem very mysterious. Under this model, we have P(V_A = True) = 0.5 even for a hypothesis A that is entirely specified, down to its last bit. So why all the talk about how a true prior probability for A needs to be based on complete ignorance even of the content of A? Under this model, even if you grant complete knowledge of A, you’re still assigning it a prior probability of 0.5. Much of the push-back you got seemed to be around the meaningfulness of assigning a probability to an unspecified hypothesis. But you could have sidestepped that issue and still established the claim in the OP under this model, because here the claim is true even of specified hypotheses. (However, you would still need to justify that this model is how we ought to think about Bayesian updating. My remaining concerns address this.)
By having Ω be the collection of all bit strings of length n, you’ve dropped the condition that the maps v respect logical operations. This is equivalent to dropping the requirement that the possible worlds be logically possible. E.g., your sample space would include maps v such that v(A) = v(~A) for some hypothesis A. But, maybe you figure that this is a feature, not a bug, because knowledge about logical consistency is something that the agent shouldn’t yet have in its prior state of complete ignorance. But then …
… If the agent starts out as logically ignorant, how can it work with only a finite number of hypotheses? It doesn’t start out knowing that A, A&A, A&A&A, etc., can all be collapsed down to just A, and that’s infinitely many hypotheses right there. But maybe you mean for the n hypotheses to be “atomic” propositions, each represented by a distinct proposition letter A, B, C, …, with no logical dependencies among them, and all other hypotheses built up out of these “atoms” with logical connectives. It’s not clear to me how you would handle quantifiers this way, but set that aside. The more important problem is …
… How do you ever accomplish any nontrivial Bayesian updating under this model? For suppose that you learn somehow that A is true. Now, conditioned on A, what is the probability of B? Still 0.5. Even if you learn the truth value of every hypothesis except B, you still would assign probability 0.5 to B.
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
Is this a description of what the prior distribution might be like? Or is it a description of what updating on the prior distribution might yield?
If you meant the former, wouldn’t you lose your justification for claiming that the prior probability of an unspecified hypothesis is exactly 0.5? For, couldn’t it be the case that most hypotheses are true in most worlds (counted by weight), so that an unknown random hypothesis would be more likely to be true than not?
If you meant the latter, I would like to see how this updating would work in more detail. I especially would like to see how Problem 4 above could be overcome.
Do you mean, for example, that p could be a measure and A could be a set? Since komponisto was talking about expressions of the form p(A) such that A can appear in expressions like A&B, I understood the context to be one in which we were already considering p to be a function over sentences or propositions (which, following komponisto, I was equating), and not, for example, sets.
Do you mean that “p(A)” can make sense in some case where A is a sentence, but not a sentence in some formal system? If so, would you give an example? Do you mean, for example, that A could be a statement in some non-formal language like English?
Or do you mean something else?
In my own interpretation, A is a hypothesis -- something that represents a possible state of the world. Hypotheses are of course subject to Boolean algebra, so you could perhaps model them as sentences or sets.
You have made a number of interesting comments that will probably take me some time to respond to.
I’ve been trying to develop a formal understanding of your claim that the prior probability of an unknown arbitrary hypothesis A makes sense and should equal 0.5. I’m not there yet, but I have a couple of tentative approaches. I was wondering whether either one looks at all like what you are getting at.
The first approach is to let the sample space Ω be the set of all hypotheses, endowed with a suitable probability distribution p. It’s not clear to me what probability distribution p you would have in mind, though. Presumably it would be “uniform” in some appropriate sense, because we are supposed to start in a state of complete ignorance about the elements of Ω.
At any rate, you would then define the random variable v : Ω → {True, False} that returns the actual truth value of each hypothesis. The quantity “p(A), for arbitrary unknown A” would be interpreted to mean the value of p(v = True). One would then show that half of the hypotheses in Ω (with respect to p-measure) are true. That is, one would have p(v = True) = 0.5, yielding your claim.
I have two difficulties with this approach. First, as I mentioned, I don’t see how to define p. Second, as I mentioned in this comment, “the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map.” (ETA: I should emphasize that this second difficulty seems fatal to me. Defining p might just be a technicality. But making probability a property of the territory is fundamentally contrary to the Bayesian Way.)
The second approach tries to avoid that last difficulty by going “meta”. Under this approach, you would take the sample space Ω to be the set of logically consistent possible worlds. More precisely, Ω would be the set of all valuation maps v : {hypotheses} → {True, False} assigning a truth value to every hypothesis. (By calling a map v a “valuation map” here, I just mean that it respects the usual logical connectives and quantifiers. E.g., if v(A) = True and v(B) = True, then v(A & B) = True.) You would then endow Ω with some appropriate probability distribution p. However, again, I don’t yet see precisely what p should be.
Then, for each hypothesis A, you would have a random variable V_A : Ω → {True, False} that equals True on precisely those valuation maps v such that v(A) = True. The claim that “p(A) = 0.5 for arbitrary unknown A” would unpack as the claim that, for every hypothesis A, p(V_A = True) = 0.5 — that is, that each hypothesis A is true in exactly half of all possible worlds (with respect to p-measure).
Do either of these approaches look to you like they are on the right track?
ETA: Here’s a third approach which combines the previous two: When you’re asked “What’s p(A), where A is an arbitrary unknown hypothesis?”, and you are still in a state of complete ignorance, then you know neither the world you’re in, nor the hypothesis A whose truth in that world you are being asked to consider. So, let the sample space Ω be the set of ordered pairs (v, A), where v is a valuation map and A is a hypothesis. You endow Ω with some appropriate probability distribution p, and you have a random variable V : Ω → {True, False} that maps (v, A) to True precisely when v(A) = True — i.e., when A is true under v. You give the response “0.5″ to the question because (we suppose) p(V = True) = 0.5.
But I still don’t see how to define p. Is there a well-known and widely-agreed-upon definition for p? On the one hand, p is a probability distribution over a countably infinite set (assuming that we identify the set of hypotheses with the set of sentences in some formal language). [ETA: That was a mistake. The sample space is countable in the first of the approaches above, but there might be uncountably many logically consistent ways to assign truth values to hypotheses.] On the other hand, it seems intuitively like p should be “uniform” in some sense, to capture the condition that we start in a state of total ignorance. How can these conditions be met simultaneously?
I think the second approach (and possibly the third also, but I haven’t yet considered it as deeply) is close to the right idea.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
Here are a few problems that I have with this approach:
This approach makes your focus on the case where the hypotheses A is “unspecified” seem very mysterious. Under this model, we have P(V_A = True) = 0.5 even for a hypothesis A that is entirely specified, down to its last bit. So why all the talk about how a true prior probability for A needs to be based on complete ignorance even of the content of A? Under this model, even if you grant complete knowledge of A, you’re still assigning it a prior probability of 0.5. Much of the push-back you got seemed to be around the meaningfulness of assigning a probability to an unspecified hypothesis. But you could have sidestepped that issue and still established the claim in the OP under this model, because here the claim is true even of specified hypotheses. (However, you would still need to justify that this model is how we ought to think about Bayesian updating. My remaining concerns address this.)
By having Ω be the collection of all bit strings of length n, you’ve dropped the condition that the maps v respect logical operations. This is equivalent to dropping the requirement that the possible worlds be logically possible. E.g., your sample space would include maps v such that v(A) = v(~A) for some hypothesis A. But, maybe you figure that this is a feature, not a bug, because knowledge about logical consistency is something that the agent shouldn’t yet have in its prior state of complete ignorance. But then …
… If the agent starts out as logically ignorant, how can it work with only a finite number of hypotheses? It doesn’t start out knowing that A, A&A, A&A&A, etc., can all be collapsed down to just A, and that’s infinitely many hypotheses right there. But maybe you mean for the n hypotheses to be “atomic” propositions, each represented by a distinct proposition letter A, B, C, …, with no logical dependencies among them, and all other hypotheses built up out of these “atoms” with logical connectives. It’s not clear to me how you would handle quantifiers this way, but set that aside. The more important problem is …
… How do you ever accomplish any nontrivial Bayesian updating under this model? For suppose that you learn somehow that A is true. Now, conditioned on A, what is the probability of B? Still 0.5. Even if you learn the truth value of every hypothesis except B, you still would assign probability 0.5 to B.
Is this a description of what the prior distribution might be like? Or is it a description of what updating on the prior distribution might yield?
If you meant the former, wouldn’t you lose your justification for claiming that the prior probability of an unspecified hypothesis is exactly 0.5? For, couldn’t it be the case that most hypotheses are true in most worlds (counted by weight), so that an unknown random hypothesis would be more likely to be true than not?
If you meant the latter, I would like to see how this updating would work in more detail. I especially would like to see how Problem 4 above could be overcome.