(I’ve defined my notation in an edit to the grandparent, hopefully it should be clear now.)
The intuition is very simple: R(A,B) measures the difference between probability distributions A and B (their “dissimilarity”). If A and B1 are more similar (in some sense) than A and B2, I’d expect R(A,B1)R(A,B2), unless there is a particular reason for the senses of similarity as measured by R(-,-) and as given intuitively to be anticorrelated.
(Furthermore, the alignment of the senses of similarity might be expected by design, in the sense that R(-,-) is supposed to be small if AI only creates a useless paperclip. That is if the distributions A and B are similar in the informal sense, R(A,B) should be small, and presumably if A and B are dissimilar in the informal sense, R(A,B) becomes large. If this dependence remains monotonous as we reach not just dissimilar, but very dissimilar A and B, my point follows.)
Let’s add some actions to the mix. Let a1 be the action: program the disciple to not take over, a2: program the disciple to take over discreetly, a3: program the disciple to take over blatantly. Let’s assume the disciple is going to be successful at what it attempts.
Then all the following probabilities are 1:
P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0)
And all the following are zero:
P(wi|aj,X=1) for i and j not equal, P(wi|X=0) for i 2 or 3.
In your notation in the post, I take w (and wf) as referring to elementary events, very specific descriptions of the world that AI probably won’t be able to predict individually, and R(w) measures the effect of X on these elementary events. In this discussion, if “w2” is “AI takes over discreetly”, w2 is not an elementary event, it’s either a probability distribution or a proper (big) event. As a result, I don’t see how the discussion of P(wi|aj,X=1) (where wi are big events, not elementary events from the definition of R(w)) is relevant for discussing the value of the penalty function.
OK, in the post R(w) can talk about big events w, not just particular worlds (your talking about “integrating across all w” in the post confused me, it now turns out that the possible w are not mutually exclusive). But this doesn’t clarify for me the relevance of your point in the grandparent (what is the relevance of P(wi|aj,X=1) for the estimate of the total penalty?).
(If w2 and w3 are particular worlds, then it’s incorrect that P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0) are about 1, because the AI won’t be able to predict what happens if it takes a1, a2 etc. so accurately. If w2 and w3 are partial descriptions of worlds, it is the same thing as them being big events, which is what I’ve been assuming throughout the thread.)
I don’t need P(w2|a2,X=1) and P(w3|a3,X=1) to be about one (that was a simplified model) - I need them to be about equal. i,e, the disciple is a really smart AI and can take over the world if motivated to do so.
I see no reason to suppose this. Why do you (intuitively) think it’s true? (also, I got a bit confused—you seem to have changed notation?)
(I’ve defined my notation in an edit to the grandparent, hopefully it should be clear now.)
The intuition is very simple: R(A,B) measures the difference between probability distributions A and B (their “dissimilarity”). If A and B1 are more similar (in some sense) than A and B2, I’d expect R(A,B1)R(A,B2), unless there is a particular reason for the senses of similarity as measured by R(-,-) and as given intuitively to be anticorrelated.
(Furthermore, the alignment of the senses of similarity might be expected by design, in the sense that R(-,-) is supposed to be small if AI only creates a useless paperclip. That is if the distributions A and B are similar in the informal sense, R(A,B) should be small, and presumably if A and B are dissimilar in the informal sense, R(A,B) becomes large. If this dependence remains monotonous as we reach not just dissimilar, but very dissimilar A and B, my point follows.)
Let’s add some actions to the mix. Let a1 be the action: program the disciple to not take over, a2: program the disciple to take over discreetly, a3: program the disciple to take over blatantly. Let’s assume the disciple is going to be successful at what it attempts.
Then all the following probabilities are 1: P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0)
And all the following are zero: P(wi|aj,X=1) for i and j not equal, P(wi|X=0) for i 2 or 3.
w2 and w3 are not distinguished in any way.
In your notation in the post, I take w (and wf) as referring to elementary events, very specific descriptions of the world that AI probably won’t be able to predict individually, and R(w) measures the effect of X on these elementary events. In this discussion, if “w2” is “AI takes over discreetly”, w2 is not an elementary event, it’s either a probability distribution or a proper (big) event. As a result, I don’t see how the discussion of P(wi|aj,X=1) (where wi are big events, not elementary events from the definition of R(w)) is relevant for discussing the value of the penalty function.
No, they’re worlds (or partial descriptions of worlds) - we could define them as possible full space-time manifolds.
OK, in the post R(w) can talk about big events w, not just particular worlds (your talking about “integrating across all w” in the post confused me, it now turns out that the possible w are not mutually exclusive). But this doesn’t clarify for me the relevance of your point in the grandparent (what is the relevance of P(wi|aj,X=1) for the estimate of the total penalty?).
(If w2 and w3 are particular worlds, then it’s incorrect that P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0) are about 1, because the AI won’t be able to predict what happens if it takes a1, a2 etc. so accurately. If w2 and w3 are partial descriptions of worlds, it is the same thing as them being big events, which is what I’ve been assuming throughout the thread.)
I don’t need P(w2|a2,X=1) and P(w3|a3,X=1) to be about one (that was a simplified model) - I need them to be about equal. i,e, the disciple is a really smart AI and can take over the world if motivated to do so.