A statement, any statement, starts out with a 50% probability of being true, and then you adjust that percentage based on the evidence you come into contact with.
That’s wildly wrong. “50% probability” is what you assign if someone tells you, “One and only one of the statements X or Y is true, but I’m not going to give you the slightest hint as to what they mean” and it’s questionable whether you can even call that a statement, since you can’t say anything about its truth-conditions.
Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols, and that’s where the rough measure of prior probability starts—not at 50%. 50% is where you start if you start with 1 bit. If you start with 0 bits the problem is just underspecified.
Update a bit in this direction: That part where Rational Rian said “What the hell do you mean, it starts with 50% probability”, he was perfectly right. If you’re not confident of your ability to wield the math, don’t be so quick to distrust your intuitive side!
Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols...If you start with 0 bits the problem is just underspecified.
What a perfect illustration of what I was talking about when I wrote:
Of course, we almost never reach this level of ignorance in practice, which makes this the type of abstract academic point that people all-too-characteristically have trouble with. The step of calculating the complexity of a hypothesis seems “automatic”, so much so that it’s easy to forget that there is a step there.
You can call 0 bits “underspecifed” if you like, but the antilogarithm of 0 is still 1, and odds of 1 still corresponds to 50% probability.
Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the “50%” line on a non-initiate: even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false.
Nevertheless, that doesn’t change the fact that knowing the complexity of a statement is knowing something about the statement (and hence not being in total ignorance).
I still don’t think you’re saying something sophisticated and true. I think you’re saying something sophisticated and nonsensical. I think it’s meaningless to assign a probability to the assertion “understand up without any clams” because you can’t say what configurations of the universe would make it true or false, nor interpret it as a question about the logical validity of an implication. Assigning probabilities to A, B, C as in your linked writing strikes me as equally nonsensical. The part where you end up with a probability of 25% after doing an elaborate calculation based on having no idea what your symbols are talking about is not a feature, it is a bug. To convince me otherwise, explain how an AI that assigns probabilities to arbitrary labels about which it knows nothing will function in a superior fashion to an AI that only assigns probabilities to things for which it has nonzero notion of its truth condition.
“If you know nothing, 50% prior probability” still strikes me as just plain wrong.
“If you know nothing, 50% prior probability” still strikes me as just plain wrong.
That strikes me as even weirder and wrong. So given a variable A which could be every possible variable, I should assign it… 75% and ~A 25%? or 25%, and make ~A 75%? Or what? - Isn’t 50% the only symmetrical answer?
Basically, given a single variable and its negation, isn’t 1⁄2 the max-entropy distribution, just as a collection of n variables has 1/n as the max-ent answer for them?
Okay, I was among the first people here who called Zed’s statement plain wrong, but I now think that there are enough high-status individuals of the community that are taking that same position, that it would serve knowledge more if I explained a bit in what slight sense his statement might not be completely wrong.
One would normally say that you calculate 3^4 by multiplying 3 four times: 3 3 3 3 But someone like Zed would say: “No! Every exponential calculation starts out with the number 1. You ought say 3 ^ 4 =1 3 3 3 * 3”. And most of us would then say: “What the hell sense does that make? What would it help an AI to begin by multiplying the number 1 with 3? You are not making sense.” And then Zed would say “But 0^0 = 1 -- and you can only see that if you add the number 1 in the sequence of the numbers to multiply.” And then we would say “What does it even mean to raise zero in the zeroth power? That has no meaning.” And we would be right in the sense it has no meaning in the physical universe. But Zed would be right in the sense he’s mathematically correct, and it has mathematical meaning, and equations wouldn’t work without the fact of 0^0=1.
I think we can visualize the “starting probability of a proposition” as “50%” in the same way we can visualize the “starting multiplier” of an exponential calculation as “1″. This starting number really does NOT help a computer calculate anything. In fact it’s a waste of processor cycles for a computer to make that “1*3” calcullation, instead of just using the number 3 as the first number to use.
But “1” can be considered to be the number that remains if all the multipliers are taken away one by one.
Likewise, imagine that we have used both several pieces of evidence and the complexity of a proposition to calculate its probability -- but then for some reason we have to start taking away these evidence -- (e.g. perhaps the AI has to calculate what probability a different AI would have calculated, using less evidence). As we take away more and more evidence, we’ll eventually end up reaching towards 50%, same way that 0^0=1.
To convince me otherwise, explain how an AI that assigns probabilities to arbitrary labels about which it knows nothing will function in a superior fashion to an AI that only assigns probabilities to things for which it has nonzero notion of its truth condition.
If you’re thinking truly reductionistically about programming an AI, you’ll realize that “probability” is nothing more than a numerical measure of the amount of information the AI has. And when the AI counts the number of bits of information it has, it has to start at some number, and that number is zero.
The point is about the internal computations of the AI, not the output on the screen. The output on the screen may very well be “ERROR: SYNTAX” rather than “50%” for large classes of human inputs. The human inputs are not what I’m talking about when I refer to unspecified hypotheses like A,B, and C. I’m talking about when, deep within its inner workings, the AI is computing a certain number associated with a string of binary digits. And if the string is empty, the associated number is 0.
The translation of
-- “What is P(A), for totally unspecified hypothesis A?”
-- “50%.”
into AI-internal-speak is
-- “Okay, I’m about to feed you a binary string. What digits have I fed you so far?”
-- “Nothing yet.”
“If you know nothing, 50% prior probability” still strikes me as just plain wrong.
That’s because in almost all practical human uses, “know nothing” doesn’t actually mean “zero information content”.
If you’re thinking truly reductionistically about programming an AI, you’ll realize that “probability” is nothing more than a numerical measure of the amount of information the AI has.
And here I thought it was a numerical measure of how credible it is that the universe looks a particular way. “Probability” is what I plug into expected utility calculations. I didn’t realize that I ought to be weighing futures based on “the amount of information” I have about them, rather than how likely they are to come to pass.
“Uncertainty exists in the map, not in the territory. In the real world, the coin has either come up heads, or come up tails. Any talk of ‘probability’ must refer to the information that I have about the coin—my state of partial ignorance and partial knowledge—not just the coin itself. Furthermore, I have all sorts of theorems showing that if I don’t treat my partial knowledge a certain way, I’ll make stupid bets. If I’ve got to plan, I’ll plan for a 50⁄50 state of uncertainty, where I don’t weigh outcomes conditional on heads any more heavily in my mind than outcomes conditional on tails. You can call that number whatever you like, but it has to obey the probability laws on pain of stupidity. So I don’t have the slightest hesitation about calling my outcome-weighting a probability.”
That’s all we’re talking about here. This is exactly like the biased coin where you don’t know what the bias is. All we know is that our hypothesis is either true or false. If that’s all we know, there’s no probability other than 50% that we can sensibly assign. (Maybe using fancy words like “maximum entropy” will help.)
I fully acknowledge that it’s a rare situation when that’s all we know. Usually, if we know enough to be able to state the hypothesis, we already have enough information to drive the probability away from 50%. I grant this. But 50% is still where the probability gets driven away from.
Denying this is tantamount to denying the existence of the number 0.
Let n be an integer. Knowing nothing else about n, would you assign 50% probability to n being odd? To n being positive? To n being greater than 3? You see how fast you get into trouble.
You need a prior distribution on n. Without a prior, these probabilities are not 50%. They are undefined.
The particular mathematical problem is that you can’t define a uniform distribution over an unbounded domain. This doesn’t apply to the biased coin: in that case, you know the bias is somewhere between 0 and 1, and for every distribution that favors heads, there’s one that favors tails, so you can actually perform the integration.
Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements. Like, if you took the first N Godel numbers, I’d expect more falsehoods than truths. Similarly for statements like “Obama is the 44th president”: so many ways to go wrong, just a few ways to go right.
Edit: that last paragraph isn’t right. For every true proposition, there’s a false one of equal complexity.
Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements.
I’m pretty certain this intuition is false. It feels true because it’s much harder to come up with a true statement from N bits if you restrict yourself to positive claims about reality. If you get random statements like “the frooble fuzzes violently” they’re bound to be false, right? But for every nonsensical or false statement you also get the negation of a nonsensical or false statement. “not( the frooble fuzzes violiently)”. It’s hard to arrive at a statement like “Obama is the 44th president” and be correct, but it’s very easy to enumerate a million things that do not orbit Pluto (and be correct).
(FYI: somewhere below there is a different discussion about whether there are more n-bit statements about reality that are false than true)
There’s a 1-to-1 correspondence between any true statement and its negation, and the sets aren’t overlapping, so there’s an equal number of true and false statements—and they can be coded in the identical amount of bits, as the interpreting machine can always be made to consider the negation of the statement you’ve written to it.
You just need to add the term ‘...NOT!’ at the end. As in ’The Chudley Cannons are a great team… NOT!”
Or we may call it the “He loves me, he loves me not” principle.
Doesn’t it take more bits to specify NOT P than to specify P? I mean, I can take any proposition and add ”..., and I like pudding” but this doesn’t mean that half of all n-bit propositions are about me liking pudding.
Doesn’t it take more bits to specify NOT P than to specify P?
No. If “NOT P” took more bits to specify than “P”, this would also mean that “NOT NOT P” would take more bits to specify than “NOT P”. But NOT NOT P is identical to P, so it would mean that P takes more bits to specify than itself.
With actual propositions now, instead of letters:
If you have the proposition “The Moon is Earth’s satellite”, and the proposition “The Moon isn’t Earth’s satellite”, each is the negation of the other. If a proposition’s negation takes more bits to specify than the proposition, then you’re saying that each statement takes more bits to specify than the other.
Even simpler—can you think any reason why it would necessarily take more bits to codify “x != 5” than “x == 5″?
We’re talking about minimum message length, and the minimum message of NOT NOT P is simply P.
Once you consider double negation, I don’t have any problem with saying that
“the Moon is Earth’s satellite”
is a simpler proposition than
“The following statement is false: the Moon is Earth’s satellite”
The abstract syntax tree for “x != 5” is bigger than the AST of “x == 5“. One of them uses numeric equality only, the other uses numeric equality and negation. I expect, though I haven’t verified, that the earliest, simplest compilers generated more processor instructions to compute “x != 5” than they did to compute “x == 5”
Aris is right. NOT is just an operator that flips a bit. Take a single bit: 1. Now apply NOT. You get 0. Or you could have a bit that is 0. Now apply NOT. You get 1. Same number of bits. Truth tables for A and ~A are the same size.
We’re talking about minimum message length, and the minimum message of NOT NOT P is simply P.
That’s what i said. But you also said that NOT P takes more bits to specify than P. You can’t have it both ways.
You don’t understand this point. If I’ve already communicated P to you—do you need any further bits of info to calculate NOT P? No: Once you know P, NOT P is also perfectly well defined, which means that NOT P by necessity has the SAME message length as P.
The abstract syntax tree for “x != 5” is bigger than the AST of “x == 5″.
You aren’t talking about minimum message length anymore, you’re talking about human conventions. One might just as well reply that since “No” is a two-letter word that means rejection takes less bits to encode than the confirmation of “Yes” which is a three-letter word.
I expect, though I haven’t verified, that the earliest, simplest compilers generated more processor instructions to compute “x != 5” than they did to compute “x == 5″
If we have a computer that evaluates statements and returns 1 for true and 0 for false—we can just as well imagine that it returns 0 for true and 1 for false and calculates the negation of those statements. In fact you wouldn’t be able to KNOW whether the computer calculates the statements or their negation, which means when you’re inputting a statement, it’s the same as inputting its negation.
I think I get it. You need n bits of evidence to evaluate a statement whose MML is n bits long. Once you know the truth value of P, you don’t need any more evidence to compute NOT(P), so MML(P) has to equal MML(NOT(P)). In the real world we tend to care about true statements more than false statements, so human formalisms make it easier to talk about truths rather than falsehoods. But for every such formalism, there is an equivalent one that makes it easier to talk about false statements.
I think I had confused the statement of a problem with the amount of evidence needed to evaluate it. Thanks for the correction!
I read the rest of this discussion but did not understand the conclusion. Do you now think that the first N Godel numbers would be expected to have the same number of truths as falsehoods?
It turns out not to matter. Consider a formalism G’, identical to Godel numbering, but that reverses the sign, such that G(N) is true iff G’(N) is false. In the first N numbers in G+G’, there are an equal number of truths and falsehoods.
For every formalism that makes it easy to encode true statements, there’s an isomorphic one that does the same for false statements, and vice versa. This is why the set of statements of a given complexity can never be unbalanced.
We agree that we can’t assign a probability to a property of a number without a prior distribution. And yet it seems like you’re saying that it is nonetheless correct to assign a probability of truth to a statement without a prior distribution, and that the probability is 50% true, 50% false.
Doesn’t the second statement follow from the first? Something like this:
For any P, a nontrivial predicate on integers, and an integer n, Pr(P(n)) is undefined without a distribution on n.
Define X(n), a predicate on integers, true if and only if the nth Godel number is true.
Pr(X(n)) is undefined without a distribution on n.
Integers and statements are isomorphic. If you’re saying that you can assign a probability to a statement without knowing anything about the statement, then you’re saying that you can assign a probability to a property of a number without knowing anything about the number.
We agree that we can’t assign a probability to a property of a number without a prior distribution. And yet it seems like you’re saying that it is nonetheless correct to assign a probability of truth to a statement without a prior distribution,
That is not what I claim. I take it for granted that all probability statements require a prior distribution. What I claim is that if the prior probability of a hypothesis evaluates to something other than 50%, then the prior distribution cannot be said to represent “total ignorance” of whether the hypothesis is true.
This is only important at the meta-level, where one is regarding the probability function as a variable—such as in the context of modeling logical uncertainty, for example. It allows one to regard “calculating the prior probability” as a special case of “updating on evidence”.
I think I see what you’re saying. You’re saying that if you do the math out, Pr(S) comes out to 0.5, just like 0! = 1 or a^0 = 1, even though the situation is rare where you’d actually want to calculate those things (permutations of zero elements or the empty product, respectively). Do I understand you, at least?
I expect Pr(S) to come out to be undefined, but I’ll work through it and see. Anyway, I’m not getting any karma for these comments, so I guess nobody wants to see them. I won’t fill the channel with any more noise.
I fully acknowledge that it’s a rare situation when that’s all we know.
When is this ever the situation?
Usually, if we know enough to be able to state the hypothesis, we already have enough information to drive the probability away from 50%. I grant this. But 50% is still where the probability gets driven away from.
Can you give an example of “driving the probability away from 50%”? I note that no one responded to my earlier request for such an example.
When is this ever the situation?...Can you give an example of “driving the probability away from 50%”? I note that no one responded to my earlier request for such an example.
No one can give an example because it is logically impossible for it to be the situation, it’s not just rare. It cannot be that “All we know is that our hypothesis is either true or false.” because to know that something is a hypothesis entails knowing more than nothing. It’s like saying “knowing that a statement is either false or a paradox, but having no information at all as to whether it is false or a paradox”.
-- “What is P(A), for totally unspecified hypothesis A?”
-- “50%.”
into AI-internal-speak is
-- “Okay, I’m about to feed you a binary string. What digits have I fed you so far?”
-- “Nothing yet.”
You seem to be using a translation scheme that I have not encountered before. You give one example of its operation, but that is not enough for me to distill the general rule. As with all translation schemes, it will be easier to see the pattern if we see how it works on several different examples.
So, with that in mind, suppose that the AI were asked the question
-- “What is P(A), for a hypothesis A whose first digit is 1, but which is otherwise totally unspecified?”
What should the AI’s answer be, prior to translation into “AI-internal-speak”?
Why does not knowing the hypothesis translate into assigning the hypothesis probability 0.5 ?
If this is the approach that you want to take, then surely the AI-internal-speak translation of “What is P(A), for totally unspecified hypothesis A?” would be “What proportion of binary strings encode true statements?”
ETA: On second thought, even that wouldn’t make sense, because the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map. Perhaps sense could be salvaged by passing to a meta-language. Then the AI could translate “What is P(A), for totally unspecified hypothesis A?” as “What is the expected value of the proportion of binary strings that encode true statements?”.
But really, the question “What is P(A), for totally unspecified hypothesis A?” just isn’t well-formed. For the AI to evaluate “P(A)”, the AI needs already to have been fed a symbol A in the domain of P.
Your AI-internal-speak version is a perfectly valid question to ask, but why do you consider it to be the translation of “What is P(A), for totally unspecified hypothesis A?” ?
Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the “50%” line on a non-initiate: even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false.
I don’t see how the claim is “sophisticated and true”. Let P and Q be statements. You cannot simultaneously assign 50% prior probability to each of the following three statements:
P
P & Q
P & ~Q
This remains true even if you don’t know the complexities of these statements.
I think that either you are making a use-mention error, or you are confusing syntax with semantics.
Formally speaking, the expression “p(A)” makes sense only if A is a sentence in some formal system.
I can think of three ways to try to understand what’s going in your dialogue, but none leads to your conclusion. Let Alice and Bob be the first and second interlocutor, respectively. Let p be Bob’s probability function. My three interpretations of your dialogue are as follows:
Alice and Bob are using different formal systems. In this case, Bob cannot use Alice’s utterances; he can only mention them.
Alice and Bob are both using the same formal system, so that A, B, and C are sentences—e.g., atomic proposition letters—for both Alice and Bob.
Alice is talking about Bob’s formal system. She somehow knows that Bob’s model-theoretic interpretations of the sentences C and A&B are the same, even though [C = A&B] isn’t a theorem in Bob’s formal system. (So, in particular, Bob’s formal system is not complete.)
Under the first interpretation, Bob cannot evaluate expressions of the form “p(A)”, because “A” is not a sentence in his formal system. The closest he can come is to evaluate expressions like “p(Alice was thinking of a true proposition when she said ‘A’)”. If Bob attends to the use-mention distinction carefully, he cannot be trapped in the way that you portray. For, while C = A & B may be a theorem in Alice’s system,
(Alice was thinking of a true proposition when she said ‘C’) = (Alice was thinking of a true proposition when she said ‘A’) & (Alice was thinking of a true proposition when she said ‘B’)
is not (we may suppose) a theorem in Bob’s formal system. (If, by chance, it is a theorem in Bob’s formal system, then the essence of the remarks below apply.)
Now consider the second interpretation. Then, evidently, C = A & B is a theorem in Alice and Bob’s shared formal system. (Otherwise, Alice would not be in a position to assert that C = A & B.) But then p, by definition, will respect logical connectives so that, for example, if p(B & ~A) > 0, then p(A) < p(C). This is true even if Bob hasn’t yet worked out that C = A & B is in fact a consequence of his axioms. It just follows from the fact that p is a coherent probability function over propositions.
This means that, if the algorithm that determines how Bob answers a question like “What is p(A)?” is indeed an implementation of the probability function p, then he simply will not in all cases assert that p(A) = 0.5, p(B) = 0.5, and p(C) = 0.5.
Finally, under the third interpretation, Bob did not say that p(A|B) = 1 when he said that p(C)/ p(B) = 1, because A&B is not syntactically equivalent to C under Bob’s formal system. So again Alice’s trap fails to spring.
Well, we could also assume and specify additional things that would make “p(A)” make sense even if “A” is not a statement in some formal system. So I don’t see how your remark is meaningful.
Well, we could also assume and specify additional things that would make “p(A)” make sense even if “A” is not a statement in some formal system.
Do you mean, for example, that p could be a measure and A could be a set? Since komponisto was talking about expressions of the form p(A) such that A can appear in expressions like A&B, I understood the context to be one in which we were already considering p to be a function over sentences or propositions (which, following komponisto, I was equating), and not, for example, sets.
Do you mean that “p(A)” can make sense in some case where A is a sentence, but not a sentence in some formal system? If so, would you give an example? Do you mean, for example, that A could be a statement in some non-formal language like English?
In my own interpretation, A is a hypothesis -- something that represents a possible state of the world. Hypotheses are of course subject to Boolean algebra, so you could perhaps model them as sentences or sets.
You have made a number of interesting comments that will probably take me some time to respond to.
I’ve been trying to develop a formal understanding of your claim that the prior probability of an unknown arbitrary hypothesis A makes sense and should equal 0.5. I’m not there yet, but I have a couple of tentative approaches. I was wondering whether either one looks at all like what you are getting at.
The first approach is to let the sample space Ω be the set of all hypotheses, endowed with a suitable probability distribution p. It’s not clear to me what probability distribution p you would have in mind, though. Presumably it would be “uniform” in some appropriate sense, because we are supposed to start in a state of complete ignorance about the elements of Ω.
At any rate, you would then define the random variable v : Ω → {True, False} that returns the actual truth value of each hypothesis. The quantity “p(A), for arbitrary unknown A” would be interpreted to mean the value of p(v = True). One would then show that half of the hypotheses in Ω (with respect to p-measure) are true. That is, one would have p(v = True) = 0.5, yielding your claim.
I have two difficulties with this approach. First, as I mentioned, I don’t see how to define p. Second, as I mentioned in this comment, “the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map.” (ETA: I should emphasize that this second difficulty seems fatal to me. Defining p might just be a technicality. But making probability a property of the territory is fundamentally contrary to the Bayesian Way.)
The second approach tries to avoid that last difficulty by going “meta”. Under this approach, you would take the sample space Ω to be the set of logically consistent possible worlds. More precisely, Ω would be the set of all valuation maps v : {hypotheses} → {True, False} assigning a truth value to every hypothesis. (By calling a map v a “valuation map” here, I just mean that it respects the usual logical connectives and quantifiers. E.g., if v(A) = True and v(B) = True, then v(A & B) = True.) You would then endow Ω with some appropriate probability distribution p. However, again, I don’t yet see precisely what p should be.
Then, for each hypothesis A, you would have a random variable V_A : Ω → {True, False} that equals True on precisely those valuation maps v such that v(A) = True. The claim that “p(A) = 0.5 for arbitrary unknown A” would unpack as the claim that, for every hypothesis A, p(V_A = True) = 0.5 — that is, that each hypothesis A is true in exactly half of all possible worlds (with respect to p-measure).
Do either of these approaches look to you like they are on the right track?
ETA: Here’s a third approach which combines the previous two: When you’re asked “What’s p(A), where A is an arbitrary unknown hypothesis?”, and you are still in a state of complete ignorance, then you know neither the world you’re in, nor the hypothesis A whose truth in that world you are being asked to consider. So, let the sample space Ω be the set of ordered pairs (v, A), where v is a valuation map and A is a hypothesis. You endow Ω with some appropriate probability distribution p, and you have a random variable V : Ω → {True, False} that maps (v, A) to True precisely when v(A) = True — i.e., when A is true under v. You give the response “0.5″ to the question because (we suppose) p(V = True) = 0.5.
But I still don’t see how to define p. Is there a well-known and widely-agreed-upon definition for p? On the one hand, p is a probability distribution over a countably infinite set (assuming that we identify the set of hypotheses with the set of sentences in some formal language). [ETA: That was a mistake. The sample space is countable in the first of the approaches above, but there might be uncountably many logically consistent ways to assign truth values to hypotheses.] On the other hand, it seems intuitively like p should be “uniform” in some sense, to capture the condition that we start in a state of total ignorance. How can these conditions be met simultaneously?
I think the second approach (and possibly the third also, but I haven’t yet considered it as deeply) is close to the right idea.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
Here are a few problems that I have with this approach:
This approach makes your focus on the case where the hypotheses A is “unspecified” seem very mysterious. Under this model, we have P(V_A = True) = 0.5 even for a hypothesis A that is entirely specified, down to its last bit. So why all the talk about how a true prior probability for A needs to be based on complete ignorance even of the content of A? Under this model, even if you grant complete knowledge of A, you’re still assigning it a prior probability of 0.5. Much of the push-back you got seemed to be around the meaningfulness of assigning a probability to an unspecified hypothesis. But you could have sidestepped that issue and still established the claim in the OP under this model, because here the claim is true even of specified hypotheses. (However, you would still need to justify that this model is how we ought to think about Bayesian updating. My remaining concerns address this.)
By having Ω be the collection of all bit strings of length n, you’ve dropped the condition that the maps v respect logical operations. This is equivalent to dropping the requirement that the possible worlds be logically possible. E.g., your sample space would include maps v such that v(A) = v(~A) for some hypothesis A. But, maybe you figure that this is a feature, not a bug, because knowledge about logical consistency is something that the agent shouldn’t yet have in its prior state of complete ignorance. But then …
… If the agent starts out as logically ignorant, how can it work with only a finite number of hypotheses? It doesn’t start out knowing that A, A&A, A&A&A, etc., can all be collapsed down to just A, and that’s infinitely many hypotheses right there. But maybe you mean for the n hypotheses to be “atomic” propositions, each represented by a distinct proposition letter A, B, C, …, with no logical dependencies among them, and all other hypotheses built up out of these “atoms” with logical connectives. It’s not clear to me how you would handle quantifiers this way, but set that aside. The more important problem is …
… How do you ever accomplish any nontrivial Bayesian updating under this model? For suppose that you learn somehow that A is true. Now, conditioned on A, what is the probability of B? Still 0.5. Even if you learn the truth value of every hypothesis except B, you still would assign probability 0.5 to B.
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
Is this a description of what the prior distribution might be like? Or is it a description of what updating on the prior distribution might yield?
If you meant the former, wouldn’t you lose your justification for claiming that the prior probability of an unspecified hypothesis is exactly 0.5? For, couldn’t it be the case that most hypotheses are true in most worlds (counted by weight), so that an unknown random hypothesis would be more likely to be true than not?
If you meant the latter, I would like to see how this updating would work in more detail. I especially would like to see how Problem 4 above could be overcome.
Knowing that a statement is a proposition is far from being in total ignorance.
Writing about propositions using the word “statements” and then correcting people who say you are wrong based on true things they say about actual statements would be annoying. Please make it clear you aren’t doing that.
Neither the grandparent nor (so far as I can tell) the great-grandparent makes the distinction between “statements” and “propositions” that you have drawn elsewhere.
I used the term “statement” because that was what was used in the great-grandparent (just as I used it in my other comment because it was used in the post). Feel free to mentally substitute “proposition” if that is what you prefer.
My previous comment should have sufficed to communicate to you that I do not regard the distinction you are making as relevant to the present discussion. It should be amply clear by this point that I am exclusively concerned with things-that-must-be-either-true-or-false, and that calling attention to a separate class of utterances that do not have truth-values (and therefore do not have probabilities assigned to them) is not an interesting thing to do in this context. Downvoted for failure to take a hint.
It should be amply clear by this point that I am exclusively concerned with things-that-must-be-either-true-or-false
Then when Eliezer says:
Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols, and that’s where the rough measure of prior probability starts—not at 50%.
you shouldn’t say:
Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the “50%” line on a non-initiate: even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false.
as a response to Eliezer making true statements about statements and not playing along with OP’s possible special definition of “statement”. If when Eliezer read “statement”, he’d interpreted it as “proposition”, he might be unreasonable in inferring there was an error, but he didn’t, so he wasn’t. So you shouldn’t implicitly call him out as having made a simple error.
As far as “even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false” goes, no one can read minds. It is possible the OP meant to convey actual perfect understanding with the inaccurate language he used. Likewise for “Not just low-status like believing in a deity, but majorly low status,” assuming idiosyncratic enough meaning.
calling attention to a separate class of utterances that do not have truth-values (and therefore do not have probabilities assigned to them) is not an interesting thing to do in this context.
I’m calling attention to a class that contains all propositions as well as other things. Probabilities may be assigned to statements being true even if they are actually false, or neither true nor false. If a statement is specified to be a proposition, you have information such that a bare 50% won’t do.
Where did you get the idea that “statement” in Eliezer’s comment is to be understood in your idiosyncratic sense of “utterances that may or may not be ‘propositions’”? Not only do I dispute this, I explicitly did so earlier when I wrote (emphasis added):
Neither the grandparent nor (so far as I can tell) the great-grandparent makes the distinction between “statements” and “propositions” that you have drawn elsewhere.
Indeed, it is manifestly clear from this sentence in his comment:
it’s questionable whether you can even call that a statement, since you can’t say anything about its truth-conditions.
that Eliezer means by “statement” what you have insisted on calling a “proposition”: something with truth-conditions, i.e. which is capable of assuming a truth-value. I, in turn, simply followed this usage in my reply. I have never had the slightest interest in entering a sub-discussion about whether this is a good choice of terminology. Furthermore, I deny the following:
Probabilities may be assigned to [statements/propositions/what-the-heck-ever] being true even if they are...neither true nor false.
and, indeed, regard the falsity of that claim as a basic background assumption upon which my entire discussion was premised.
Perhaps it would make things clearer if the linguistic terminology (“statement”, “proposition”, etc) were abandoned altogether (being really inappropriate to begin with), in favor of the term “hypothesis”. I can then state my position in (hopefully) unambiguous terms: all hypotheses are either true or false (otherwise they are not hypotheses), hypotheses are the only entities to which probabilities may be assigned, and a Bayesian with literally zero information about whether a hypothesis is true or false must assign it a probability of 50% -- the last point being an abstract technicality that seldom if ever needs to be mentioned explicitly, lest it cause confusion of the sort we have been seeing here (so that Bayesian Bob indeed made a mistake by saying it, although I am impressed with Zed for having him say it).
and a Bayesian with literally zero information about whether a hypothesis is true or false must assign it a probability of 50%
You can state it better like this: “A Bayesian with literally zero information about the hypothesis.”
“Zero information about whether a hypothesis is true or false” implies that we know the hypothesis, and we just don’t know whether it’s a member in the set of true propositions.
“Zero information about the hypothesis” indicates what you really seem to want to say—that we don’t know anything about this hypothesis; not its content, not its length, not even who made the hypothesis, or how it came to our attention.
I don’t see how this can make sense in one sense. If we don’t know exactly how it came to our attention, we know that it didn’t come to our attention in a way that stuck with us, so that is some information we have about how it came to our attention—we know some ways that it didn’t come to our attention.
You’re thinking of human minds. But perhaps we’re talking about a computer that knows it’s trying to determine the truth-value of a proposition, but the history of how the proposition got inputted into it got deleted from its memory; or perhaps it was designed to never holds that history in the first place.
the history of how the proposition got inputted into it got deleted from its memory
So it knows that whoever gave it the proposition didn’t have the power, desire, or competence to tell it how it got the proposition.
It knows the proposition is not from a mind that is meticulous about making sure those to whom it gives propositions know where the propositions are from.
If the computer doesn’t know that it doesn’t know how it learned of something, and can’t know that, I’m not sure it counts as a general intelligence.
Indeed, it is manifestly clear from this sentence in his comment:
What odds does “manifestly clear” imply when you say it? I believe he was referring to either X or Y as otherwise the content of the statement containing “one and only one...X or Y” would be a confusing...coincidence is the best word I can think of. So I think it most likely “call that a statement” is a very poorly worded phrase referring to simultaneously separately statement X or statement Y.
In general, there is a problem with prescribing taboo when one of the two parties is claiming a third party is wrong.
I am impressed by your patience in light of my comments. I think it not terribly unlikely that in this argument I am the equivalent of Jordan Leopold or Ray Fittipaldo (not an expert!), while you are Andy Sutton.
But I still don’t think that’s probable, and think it is easy to see that you have cheated at rationalist’s taboo as one term is replacing the excluded ones, a sure sign that mere label swapping has taken place.
I still think that if I only know that something is a hypothesis and know nothing more, I have enough knowledge to examine how I know that and use an estimate of the hypothesis’ bits that is superior to a raw 0%. I don’t think “a Bayesian with literally zero information about whether a hypothesis is true or false” is a meaningful sentence. You know it’s a hypothesis because you have information. Granted, the final probability you estimate could be 50⁄50.
That’s wildly wrong. “50% probability” is what you assign if someone tells you, “One and only one of the statements X or Y is true, but I’m not going to give you the slightest hint as to what they mean” and it’s questionable whether you can even call that a statement, since you can’t say anything about its truth-conditions.
Any statement for which you have the faintest idea of its truth conditions will be specified in sufficient detail that you can count the bits, or count the symbols, and that’s where the rough measure of prior probability starts—not at 50%. 50% is where you start if you start with 1 bit. If you start with 0 bits the problem is just underspecified.
Update a bit in this direction: That part where Rational Rian said “What the hell do you mean, it starts with 50% probability”, he was perfectly right. If you’re not confident of your ability to wield the math, don’t be so quick to distrust your intuitive side!
What a perfect illustration of what I was talking about when I wrote:
You can call 0 bits “underspecifed” if you like, but the antilogarithm of 0 is still 1, and odds of 1 still corresponds to 50% probability.
Given your preceding comment, I realize you have a high prior on people making simple errors. And, at the very least, this is a perfect illustration of why never to use the “50%” line on a non-initiate: even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false.
Nevertheless, that doesn’t change the fact that knowing the complexity of a statement is knowing something about the statement (and hence not being in total ignorance).
I still don’t think you’re saying something sophisticated and true. I think you’re saying something sophisticated and nonsensical. I think it’s meaningless to assign a probability to the assertion “understand up without any clams” because you can’t say what configurations of the universe would make it true or false, nor interpret it as a question about the logical validity of an implication. Assigning probabilities to A, B, C as in your linked writing strikes me as equally nonsensical. The part where you end up with a probability of 25% after doing an elaborate calculation based on having no idea what your symbols are talking about is not a feature, it is a bug. To convince me otherwise, explain how an AI that assigns probabilities to arbitrary labels about which it knows nothing will function in a superior fashion to an AI that only assigns probabilities to things for which it has nonzero notion of its truth condition.
“If you know nothing, 50% prior probability” still strikes me as just plain wrong.
That strikes me as even weirder and wrong. So given a variable A which could be every possible variable, I should assign it… 75% and ~A 25%? or 25%, and make ~A 75%? Or what? - Isn’t 50% the only symmetrical answer?
Basically, given a single variable and its negation, isn’t 1⁄2 the max-entropy distribution, just as a collection of n variables has 1/n as the max-ent answer for them?
Okay, I was among the first people here who called Zed’s statement plain wrong, but I now think that there are enough high-status individuals of the community that are taking that same position, that it would serve knowledge more if I explained a bit in what slight sense his statement might not be completely wrong.
One would normally say that you calculate 3^4 by multiplying 3 four times: 3 3 3 3
But someone like Zed would say: “No! Every exponential calculation starts out with the number 1. You ought say 3 ^ 4 =1 3 3 3 * 3”.
And most of us would then say: “What the hell sense does that make? What would it help an AI to begin by multiplying the number 1 with 3? You are not making sense.”
And then Zed would say “But 0^0 = 1 -- and you can only see that if you add the number 1 in the sequence of the numbers to multiply.”
And then we would say “What does it even mean to raise zero in the zeroth power? That has no meaning.”
And we would be right in the sense it has no meaning in the physical universe. But Zed would be right in the sense he’s mathematically correct, and it has mathematical meaning, and equations wouldn’t work without the fact of 0^0=1.
I think we can visualize the “starting probability of a proposition” as “50%” in the same way we can visualize the “starting multiplier” of an exponential calculation as “1″. This starting number really does NOT help a computer calculate anything. In fact it’s a waste of processor cycles for a computer to make that “1*3” calcullation, instead of just using the number 3 as the first number to use.
But “1” can be considered to be the number that remains if all the multipliers are taken away one by one.
Likewise, imagine that we have used both several pieces of evidence and the complexity of a proposition to calculate its probability -- but then for some reason we have to start taking away these evidence -- (e.g. perhaps the AI has to calculate what probability a different AI would have calculated, using less evidence). As we take away more and more evidence, we’ll eventually end up reaching towards 50%, same way that 0^0=1.
I feel compelled to point out that 0^0 is undefined, since the limit of x^0 at x=0 is 1 but the limit of 0^x at x=0 is 0.
Yes, in combinatorics assuming 0^0=1 is sensible since it simplifies a lot of formulas which would otherwise have to include special cases at 0.
If you’re thinking truly reductionistically about programming an AI, you’ll realize that “probability” is nothing more than a numerical measure of the amount of information the AI has. And when the AI counts the number of bits of information it has, it has to start at some number, and that number is zero.
The point is about the internal computations of the AI, not the output on the screen. The output on the screen may very well be “ERROR: SYNTAX” rather than “50%” for large classes of human inputs. The human inputs are not what I’m talking about when I refer to unspecified hypotheses like A,B, and C. I’m talking about when, deep within its inner workings, the AI is computing a certain number associated with a string of binary digits. And if the string is empty, the associated number is 0.
The translation of
-- “What is P(A), for totally unspecified hypothesis A?”
-- “50%.”
into AI-internal-speak is
-- “Okay, I’m about to feed you a binary string. What digits have I fed you so far?”
-- “Nothing yet.”
That’s because in almost all practical human uses, “know nothing” doesn’t actually mean “zero information content”.
And here I thought it was a numerical measure of how credible it is that the universe looks a particular way. “Probability” is what I plug into expected utility calculations. I didn’t realize that I ought to be weighing futures based on “the amount of information” I have about them, rather than how likely they are to come to pass.
A wise person once said (emphasis—and the letter c—added):
That’s all we’re talking about here. This is exactly like the biased coin where you don’t know what the bias is. All we know is that our hypothesis is either true or false. If that’s all we know, there’s no probability other than 50% that we can sensibly assign. (Maybe using fancy words like “maximum entropy” will help.)
I fully acknowledge that it’s a rare situation when that’s all we know. Usually, if we know enough to be able to state the hypothesis, we already have enough information to drive the probability away from 50%. I grant this. But 50% is still where the probability gets driven away from.
Denying this is tantamount to denying the existence of the number 0.
Let n be an integer. Knowing nothing else about n, would you assign 50% probability to n being odd? To n being positive? To n being greater than 3? You see how fast you get into trouble.
You need a prior distribution on n. Without a prior, these probabilities are not 50%. They are undefined.
The particular mathematical problem is that you can’t define a uniform distribution over an unbounded domain. This doesn’t apply to the biased coin: in that case, you know the bias is somewhere between 0 and 1, and for every distribution that favors heads, there’s one that favors tails, so you can actually perform the integration.
Finally, on an empirical level, it seems like there are more false n-bit statements than true n-bit statements. Like, if you took the first N Godel numbers, I’d expect more falsehoods than truths. Similarly for statements like “Obama is the 44th president”: so many ways to go wrong, just a few ways to go right.
Edit: that last paragraph isn’t right. For every true proposition, there’s a false one of equal complexity.
I’m pretty certain this intuition is false. It feels true because it’s much harder to come up with a true statement from N bits if you restrict yourself to positive claims about reality. If you get random statements like “the frooble fuzzes violently” they’re bound to be false, right? But for every nonsensical or false statement you also get the negation of a nonsensical or false statement. “not( the frooble fuzzes violiently)”. It’s hard to arrive at a statement like “Obama is the 44th president” and be correct, but it’s very easy to enumerate a million things that do not orbit Pluto (and be correct).
(FYI: somewhere below there is a different discussion about whether there are more n-bit statements about reality that are false than true)
There’s a 1-to-1 correspondence between any true statement and its negation, and the sets aren’t overlapping, so there’s an equal number of true and false statements—and they can be coded in the identical amount of bits, as the interpreting machine can always be made to consider the negation of the statement you’ve written to it.
You just need to add the term ‘...NOT!’ at the end. As in ’The Chudley Cannons are a great team… NOT!”
Or we may call it the “He loves me, he loves me not” principle.
Doesn’t it take more bits to specify NOT P than to specify P? I mean, I can take any proposition and add ”..., and I like pudding” but this doesn’t mean that half of all n-bit propositions are about me liking pudding.
No. If “NOT P” took more bits to specify than “P”, this would also mean that “NOT NOT P” would take more bits to specify than “NOT P”. But NOT NOT P is identical to P, so it would mean that P takes more bits to specify than itself.
With actual propositions now, instead of letters:
If you have the proposition “The Moon is Earth’s satellite”, and the proposition “The Moon isn’t Earth’s satellite”, each is the negation of the other. If a proposition’s negation takes more bits to specify than the proposition, then you’re saying that each statement takes more bits to specify than the other.
Even simpler—can you think any reason why it would necessarily take more bits to codify “x != 5” than “x == 5″?
We’re talking about minimum message length, and the minimum message of NOT NOT P is simply P.
Once you consider double negation, I don’t have any problem with saying that
“the Moon is Earth’s satellite”
is a simpler proposition than
“The following statement is false: the Moon is Earth’s satellite”
The abstract syntax tree for “x != 5” is bigger than the AST of “x == 5“. One of them uses numeric equality only, the other uses numeric equality and negation. I expect, though I haven’t verified, that the earliest, simplest compilers generated more processor instructions to compute “x != 5” than they did to compute “x == 5”
Aris is right. NOT is just an operator that flips a bit. Take a single bit: 1. Now apply NOT. You get 0. Or you could have a bit that is 0. Now apply NOT. You get 1. Same number of bits. Truth tables for A and ~A are the same size.
That’s what i said. But you also said that NOT P takes more bits to specify than P. You can’t have it both ways.
You don’t understand this point. If I’ve already communicated P to you—do you need any further bits of info to calculate NOT P? No: Once you know P, NOT P is also perfectly well defined, which means that NOT P by necessity has the SAME message length as P.
You aren’t talking about minimum message length anymore, you’re talking about human conventions. One might just as well reply that since “No” is a two-letter word that means rejection takes less bits to encode than the confirmation of “Yes” which is a three-letter word.
If we have a computer that evaluates statements and returns 1 for true and 0 for false—we can just as well imagine that it returns 0 for true and 1 for false and calculates the negation of those statements. In fact you wouldn’t be able to KNOW whether the computer calculates the statements or their negation, which means when you’re inputting a statement, it’s the same as inputting its negation.
I think I get it. You need n bits of evidence to evaluate a statement whose MML is n bits long. Once you know the truth value of P, you don’t need any more evidence to compute NOT(P), so MML(P) has to equal MML(NOT(P)). In the real world we tend to care about true statements more than false statements, so human formalisms make it easier to talk about truths rather than falsehoods. But for every such formalism, there is an equivalent one that makes it easier to talk about false statements.
I think I had confused the statement of a problem with the amount of evidence needed to evaluate it. Thanks for the correction!
A big thumbs up for you, and you’re very welcome! :-)
I read the rest of this discussion but did not understand the conclusion. Do you now think that the first N Godel numbers would be expected to have the same number of truths as falsehoods?
It turns out not to matter. Consider a formalism G’, identical to Godel numbering, but that reverses the sign, such that G(N) is true iff G’(N) is false. In the first N numbers in G+G’, there are an equal number of truths and falsehoods.
For every formalism that makes it easy to encode true statements, there’s an isomorphic one that does the same for false statements, and vice versa. This is why the set of statements of a given complexity can never be unbalanced.
Gotcha, thanks.
Who said anything about not having a prior distribution? “Let n be a [randomly selected] integer” isn’t even a meaningful statement without one!
What gave you the impression that I thought probabilities could be assigned to non-hypotheses?
This is irrelevant: once you have made an observation like this, you are no longer in a state of total ignorance.
We agree that we can’t assign a probability to a property of a number without a prior distribution. And yet it seems like you’re saying that it is nonetheless correct to assign a probability of truth to a statement without a prior distribution, and that the probability is 50% true, 50% false.
Doesn’t the second statement follow from the first? Something like this:
For any P, a nontrivial predicate on integers, and an integer n, Pr(P(n)) is undefined without a distribution on n.
Define X(n), a predicate on integers, true if and only if the nth Godel number is true.
Pr(X(n)) is undefined without a distribution on n.
Integers and statements are isomorphic. If you’re saying that you can assign a probability to a statement without knowing anything about the statement, then you’re saying that you can assign a probability to a property of a number without knowing anything about the number.
That is not what I claim. I take it for granted that all probability statements require a prior distribution. What I claim is that if the prior probability of a hypothesis evaluates to something other than 50%, then the prior distribution cannot be said to represent “total ignorance” of whether the hypothesis is true.
This is only important at the meta-level, where one is regarding the probability function as a variable—such as in the context of modeling logical uncertainty, for example. It allows one to regard “calculating the prior probability” as a special case of “updating on evidence”.
I think I see what you’re saying. You’re saying that if you do the math out, Pr(S) comes out to 0.5, just like 0! = 1 or a^0 = 1, even though the situation is rare where you’d actually want to calculate those things (permutations of zero elements or the empty product, respectively). Do I understand you, at least?
I expect Pr(S) to come out to be undefined, but I’ll work through it and see. Anyway, I’m not getting any karma for these comments, so I guess nobody wants to see them. I won’t fill the channel with any more noise.
[ replied to the wrong person ]
When is this ever the situation?
Can you give an example of “driving the probability away from 50%”? I note that no one responded to my earlier request for such an example.
No one can give an example because it is logically impossible for it to be the situation, it’s not just rare. It cannot be that “All we know is that our hypothesis is either true or false.” because to know that something is a hypothesis entails knowing more than nothing. It’s like saying “knowing that a statement is either false or a paradox, but having no information at all as to whether it is false or a paradox”.
You seem to be using a translation scheme that I have not encountered before. You give one example of its operation, but that is not enough for me to distill the general rule. As with all translation schemes, it will be easier to see the pattern if we see how it works on several different examples.
So, with that in mind, suppose that the AI were asked the question
-- “What is P(A), for a hypothesis A whose first digit is 1, but which is otherwise totally unspecified?”
What should the AI’s answer be, prior to translation into “AI-internal-speak”?
Why does not knowing the hypothesis translate into assigning the hypothesis probability 0.5 ?
If this is the approach that you want to take, then surely the AI-internal-speak translation of “What is P(A), for totally unspecified hypothesis A?” would be “What proportion of binary strings encode true statements?”
ETA: On second thought, even that wouldn’t make sense, because the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map. Perhaps sense could be salvaged by passing to a meta-language. Then the AI could translate “What is P(A), for totally unspecified hypothesis A?” as “What is the expected value of the proportion of binary strings that encode true statements?”.
But really, the question “What is P(A), for totally unspecified hypothesis A?” just isn’t well-formed. For the AI to evaluate “P(A)”, the AI needs already to have been fed a symbol A in the domain of P.
Your AI-internal-speak version is a perfectly valid question to ask, but why do you consider it to be the translation of “What is P(A), for totally unspecified hypothesis A?” ?
I don’t see how the claim is “sophisticated and true”. Let P and Q be statements. You cannot simultaneously assign 50% prior probability to each of the following three statements:
P
P & Q
P & ~Q
This remains true even if you don’t know the complexities of these statements.
See here.
I think that either you are making a use-mention error, or you are confusing syntax with semantics.
Formally speaking, the expression “p(A)” makes sense only if A is a sentence in some formal system.
I can think of three ways to try to understand what’s going in your dialogue, but none leads to your conclusion. Let Alice and Bob be the first and second interlocutor, respectively. Let p be Bob’s probability function. My three interpretations of your dialogue are as follows:
Alice and Bob are using different formal systems. In this case, Bob cannot use Alice’s utterances; he can only mention them.
Alice and Bob are both using the same formal system, so that A, B, and C are sentences—e.g., atomic proposition letters—for both Alice and Bob.
Alice is talking about Bob’s formal system. She somehow knows that Bob’s model-theoretic interpretations of the sentences C and A&B are the same, even though [C = A&B] isn’t a theorem in Bob’s formal system. (So, in particular, Bob’s formal system is not complete.)
Under the first interpretation, Bob cannot evaluate expressions of the form “p(A)”, because “A” is not a sentence in his formal system. The closest he can come is to evaluate expressions like “p(Alice was thinking of a true proposition when she said ‘A’)”. If Bob attends to the use-mention distinction carefully, he cannot be trapped in the way that you portray. For, while C = A & B may be a theorem in Alice’s system,
(Alice was thinking of a true proposition when she said ‘C’) = (Alice was thinking of a true proposition when she said ‘A’) & (Alice was thinking of a true proposition when she said ‘B’)
is not (we may suppose) a theorem in Bob’s formal system. (If, by chance, it is a theorem in Bob’s formal system, then the essence of the remarks below apply.)
Now consider the second interpretation. Then, evidently, C = A & B is a theorem in Alice and Bob’s shared formal system. (Otherwise, Alice would not be in a position to assert that C = A & B.) But then p, by definition, will respect logical connectives so that, for example, if p(B & ~A) > 0, then p(A) < p(C). This is true even if Bob hasn’t yet worked out that C = A & B is in fact a consequence of his axioms. It just follows from the fact that p is a coherent probability function over propositions.
This means that, if the algorithm that determines how Bob answers a question like “What is p(A)?” is indeed an implementation of the probability function p, then he simply will not in all cases assert that p(A) = 0.5, p(B) = 0.5, and p(C) = 0.5.
Finally, under the third interpretation, Bob did not say that p(A|B) = 1 when he said that p(C)/ p(B) = 1, because A&B is not syntactically equivalent to C under Bob’s formal system. So again Alice’s trap fails to spring.
How does it makes sense then? Quite a bit more would need to be assumed and specified.
Hence the “only if”. I am stating a necessary, but not sufficient, condition. Or do I miss your point?
Well, we could also assume and specify additional things that would make “p(A)” make sense even if “A” is not a statement in some formal system. So I don’t see how your remark is meaningful.
Do you mean, for example, that p could be a measure and A could be a set? Since komponisto was talking about expressions of the form p(A) such that A can appear in expressions like A&B, I understood the context to be one in which we were already considering p to be a function over sentences or propositions (which, following komponisto, I was equating), and not, for example, sets.
Do you mean that “p(A)” can make sense in some case where A is a sentence, but not a sentence in some formal system? If so, would you give an example? Do you mean, for example, that A could be a statement in some non-formal language like English?
Or do you mean something else?
In my own interpretation, A is a hypothesis -- something that represents a possible state of the world. Hypotheses are of course subject to Boolean algebra, so you could perhaps model them as sentences or sets.
You have made a number of interesting comments that will probably take me some time to respond to.
I’ve been trying to develop a formal understanding of your claim that the prior probability of an unknown arbitrary hypothesis A makes sense and should equal 0.5. I’m not there yet, but I have a couple of tentative approaches. I was wondering whether either one looks at all like what you are getting at.
The first approach is to let the sample space Ω be the set of all hypotheses, endowed with a suitable probability distribution p. It’s not clear to me what probability distribution p you would have in mind, though. Presumably it would be “uniform” in some appropriate sense, because we are supposed to start in a state of complete ignorance about the elements of Ω.
At any rate, you would then define the random variable v : Ω → {True, False} that returns the actual truth value of each hypothesis. The quantity “p(A), for arbitrary unknown A” would be interpreted to mean the value of p(v = True). One would then show that half of the hypotheses in Ω (with respect to p-measure) are true. That is, one would have p(v = True) = 0.5, yielding your claim.
I have two difficulties with this approach. First, as I mentioned, I don’t see how to define p. Second, as I mentioned in this comment, “the truth of a binary string is a property involving the territory, while prior probability should be entirely determined by the map.” (ETA: I should emphasize that this second difficulty seems fatal to me. Defining p might just be a technicality. But making probability a property of the territory is fundamentally contrary to the Bayesian Way.)
The second approach tries to avoid that last difficulty by going “meta”. Under this approach, you would take the sample space Ω to be the set of logically consistent possible worlds. More precisely, Ω would be the set of all valuation maps v : {hypotheses} → {True, False} assigning a truth value to every hypothesis. (By calling a map v a “valuation map” here, I just mean that it respects the usual logical connectives and quantifiers. E.g., if v(A) = True and v(B) = True, then v(A & B) = True.) You would then endow Ω with some appropriate probability distribution p. However, again, I don’t yet see precisely what p should be.
Then, for each hypothesis A, you would have a random variable V_A : Ω → {True, False} that equals True on precisely those valuation maps v such that v(A) = True. The claim that “p(A) = 0.5 for arbitrary unknown A” would unpack as the claim that, for every hypothesis A, p(V_A = True) = 0.5 — that is, that each hypothesis A is true in exactly half of all possible worlds (with respect to p-measure).
Do either of these approaches look to you like they are on the right track?
ETA: Here’s a third approach which combines the previous two: When you’re asked “What’s p(A), where A is an arbitrary unknown hypothesis?”, and you are still in a state of complete ignorance, then you know neither the world you’re in, nor the hypothesis A whose truth in that world you are being asked to consider. So, let the sample space Ω be the set of ordered pairs (v, A), where v is a valuation map and A is a hypothesis. You endow Ω with some appropriate probability distribution p, and you have a random variable V : Ω → {True, False} that maps (v, A) to True precisely when v(A) = True — i.e., when A is true under v. You give the response “0.5″ to the question because (we suppose) p(V = True) = 0.5.
But I still don’t see how to define p. Is there a well-known and widely-agreed-upon definition for p? On the one hand, p is a probability distribution over a countably infinite set (assuming that we identify the set of hypotheses with the set of sentences in some formal language). [ETA: That was a mistake. The sample space is countable in the first of the approaches above, but there might be uncountably many logically consistent ways to assign truth values to hypotheses.] On the other hand, it seems intuitively like p should be “uniform” in some sense, to capture the condition that we start in a state of total ignorance. How can these conditions be met simultaneously?
I think the second approach (and possibly the third also, but I haven’t yet considered it as deeply) is close to the right idea.
It’s pretty easy to see how it would work if there are only a finite number of hypotheses, say n: in that case, Ω is basically just the collection of binary strings of length n (assuming the hypothesis space is carved up appropriately), and each map V_A is evaluation at a particular coordinate. Sure enough, at each coordinate, half the elements of Ω evaluate to 1, and half to 0 !
More generally, one could imagine a probability distribution on the hypothesis space controlling the “weighting” of elements of Ω. For instance, if hypothesis #6 gets its probability raised, then those mappings v in Ω such that v(6) = 1 would be weighted more than those such that v(6) = 0. I haven’t checked that this type of arrangement is actually possible, but something like it ought to be.
Here are a few problems that I have with this approach:
This approach makes your focus on the case where the hypotheses A is “unspecified” seem very mysterious. Under this model, we have P(V_A = True) = 0.5 even for a hypothesis A that is entirely specified, down to its last bit. So why all the talk about how a true prior probability for A needs to be based on complete ignorance even of the content of A? Under this model, even if you grant complete knowledge of A, you’re still assigning it a prior probability of 0.5. Much of the push-back you got seemed to be around the meaningfulness of assigning a probability to an unspecified hypothesis. But you could have sidestepped that issue and still established the claim in the OP under this model, because here the claim is true even of specified hypotheses. (However, you would still need to justify that this model is how we ought to think about Bayesian updating. My remaining concerns address this.)
By having Ω be the collection of all bit strings of length n, you’ve dropped the condition that the maps v respect logical operations. This is equivalent to dropping the requirement that the possible worlds be logically possible. E.g., your sample space would include maps v such that v(A) = v(~A) for some hypothesis A. But, maybe you figure that this is a feature, not a bug, because knowledge about logical consistency is something that the agent shouldn’t yet have in its prior state of complete ignorance. But then …
… If the agent starts out as logically ignorant, how can it work with only a finite number of hypotheses? It doesn’t start out knowing that A, A&A, A&A&A, etc., can all be collapsed down to just A, and that’s infinitely many hypotheses right there. But maybe you mean for the n hypotheses to be “atomic” propositions, each represented by a distinct proposition letter A, B, C, …, with no logical dependencies among them, and all other hypotheses built up out of these “atoms” with logical connectives. It’s not clear to me how you would handle quantifiers this way, but set that aside. The more important problem is …
… How do you ever accomplish any nontrivial Bayesian updating under this model? For suppose that you learn somehow that A is true. Now, conditioned on A, what is the probability of B? Still 0.5. Even if you learn the truth value of every hypothesis except B, you still would assign probability 0.5 to B.
Is this a description of what the prior distribution might be like? Or is it a description of what updating on the prior distribution might yield?
If you meant the former, wouldn’t you lose your justification for claiming that the prior probability of an unspecified hypothesis is exactly 0.5? For, couldn’t it be the case that most hypotheses are true in most worlds (counted by weight), so that an unknown random hypothesis would be more likely to be true than not?
If you meant the latter, I would like to see how this updating would work in more detail. I especially would like to see how Problem 4 above could be overcome.
Knowing that a statement is a proposition is far from being in total ignorance.
Writing about propositions using the word “statements” and then correcting people who say you are wrong based on true things they say about actual statements would be annoying. Please make it clear you aren’t doing that.
Neither the grandparent nor (so far as I can tell) the great-grandparent makes the distinction between “statements” and “propositions” that you have drawn elsewhere. I used the term “statement” because that was what was used in the great-grandparent (just as I used it in my other comment because it was used in the post). Feel free to mentally substitute “proposition” if that is what you prefer.
Shall I mentally substitute “acoustic vibrations in the air” for “an auditory experience in a brain”?
My previous comment should have sufficed to communicate to you that I do not regard the distinction you are making as relevant to the present discussion. It should be amply clear by this point that I am exclusively concerned with things-that-must-be-either-true-or-false, and that calling attention to a separate class of utterances that do not have truth-values (and therefore do not have probabilities assigned to them) is not an interesting thing to do in this context. Downvoted for failure to take a hint.
Then when Eliezer says:
you shouldn’t say:
as a response to Eliezer making true statements about statements and not playing along with OP’s possible special definition of “statement”. If when Eliezer read “statement”, he’d interpreted it as “proposition”, he might be unreasonable in inferring there was an error, but he didn’t, so he wasn’t. So you shouldn’t implicitly call him out as having made a simple error.
As far as “even Yudkowsky won’t realize you’re saying something sophisticated and true rather than banal and false” goes, no one can read minds. It is possible the OP meant to convey actual perfect understanding with the inaccurate language he used. Likewise for “Not just low-status like believing in a deity, but majorly low status,” assuming idiosyncratic enough meaning.
I’m calling attention to a class that contains all propositions as well as other things. Probabilities may be assigned to statements being true even if they are actually false, or neither true nor false. If a statement is specified to be a proposition, you have information such that a bare 50% won’t do.
Where did you get the idea that “statement” in Eliezer’s comment is to be understood in your idiosyncratic sense of “utterances that may or may not be ‘propositions’”? Not only do I dispute this, I explicitly did so earlier when I wrote (emphasis added):
Indeed, it is manifestly clear from this sentence in his comment:
that Eliezer means by “statement” what you have insisted on calling a “proposition”: something with truth-conditions, i.e. which is capable of assuming a truth-value. I, in turn, simply followed this usage in my reply. I have never had the slightest interest in entering a sub-discussion about whether this is a good choice of terminology. Furthermore, I deny the following:
and, indeed, regard the falsity of that claim as a basic background assumption upon which my entire discussion was premised.
Perhaps it would make things clearer if the linguistic terminology (“statement”, “proposition”, etc) were abandoned altogether (being really inappropriate to begin with), in favor of the term “hypothesis”. I can then state my position in (hopefully) unambiguous terms: all hypotheses are either true or false (otherwise they are not hypotheses), hypotheses are the only entities to which probabilities may be assigned, and a Bayesian with literally zero information about whether a hypothesis is true or false must assign it a probability of 50% -- the last point being an abstract technicality that seldom if ever needs to be mentioned explicitly, lest it cause confusion of the sort we have been seeing here (so that Bayesian Bob indeed made a mistake by saying it, although I am impressed with Zed for having him say it).
Make sense now?
You can state it better like this: “A Bayesian with literally zero information about the hypothesis.”
“Zero information about whether a hypothesis is true or false” implies that we know the hypothesis, and we just don’t know whether it’s a member in the set of true propositions.
“Zero information about the hypothesis” indicates what you really seem to want to say—that we don’t know anything about this hypothesis; not its content, not its length, not even who made the hypothesis, or how it came to our attention.
I don’t see how this can make sense in one sense. If we don’t know exactly how it came to our attention, we know that it didn’t come to our attention in a way that stuck with us, so that is some information we have about how it came to our attention—we know some ways that it didn’t come to our attention.
You’re thinking of human minds. But perhaps we’re talking about a computer that knows it’s trying to determine the truth-value of a proposition, but the history of how the proposition got inputted into it got deleted from its memory; or perhaps it was designed to never holds that history in the first place.
So it knows that whoever gave it the proposition didn’t have the power, desire, or competence to tell it how it got the proposition.
It knows the proposition is not from a mind that is meticulous about making sure those to whom it gives propositions know where the propositions are from.
If the computer doesn’t know that it doesn’t know how it learned of something, and can’t know that, I’m not sure it counts as a general intelligence.
What odds does “manifestly clear” imply when you say it? I believe he was referring to either X or Y as otherwise the content of the statement containing “one and only one...X or Y” would be a confusing...coincidence is the best word I can think of. So I think it most likely “call that a statement” is a very poorly worded phrase referring to simultaneously separately statement X or statement Y.
In general, there is a problem with prescribing taboo when one of the two parties is claiming a third party is wrong.
I am impressed by your patience in light of my comments. I think it not terribly unlikely that in this argument I am the equivalent of Jordan Leopold or Ray Fittipaldo (not an expert!), while you are Andy Sutton.
But I still don’t think that’s probable, and think it is easy to see that you have cheated at rationalist’s taboo as one term is replacing the excluded ones, a sure sign that mere label swapping has taken place.
I still think that if I only know that something is a hypothesis and know nothing more, I have enough knowledge to examine how I know that and use an estimate of the hypothesis’ bits that is superior to a raw 0%. I don’t think “a Bayesian with literally zero information about whether a hypothesis is true or false” is a meaningful sentence. You know it’s a hypothesis because you have information. Granted, the final probability you estimate could be 50⁄50.