As CCC points out, the fact of whether the paperclip was built can itself influence the future (if you don’t need the paperclip, there is no point in building the AI, so you expect its creation to influence the future). This gives a heuristic argument that whenever you want the AI to produce anything useful, and the AI doesn’t optimize the future, U=1 will imply R>>1. Together with U=0 implying R=0, this suggests that if AI doesn’t optimize the future, it’ll choose U=0, i.e. refuse to produce anything useful (which is probably not what you want).
On the other hand, a better move where U=1 and R<1 might be available if the AI produces the paperclip and then very carefully (to keep R<1) takes over the universe to make sure that nobody ever finds out that the paperclip was produced, or that it took over the universe (or else someone might act differently than if the paperclip wasn’t produced).
By making the penalty R stronger, you would make it less likely that a “cover-up AI” with R<1 is possible, but you would simultaneously make it less likely that producing the required artifact leaves R<1. By making the penalty weaker, you might make it possible to produce the artifact while keeping R<1, but that might also allow the creation of a “cover-up AI” with R<1.
Extra clarification: in this example, I’m assuming that we don’t observe the AI, and that we are very unlikely to detect the paperclip. How to get useful work out of the AI is the next challenge, if this model holds up—maybe using something akin to Utility Indifference.
On the other hand, a better move where U=1 and R<1 might be available if the AI produces the paperclip and then very carefully (to keep R<1) takes over the universe to make sure that nobody ever finds out that the paperclip was produced, or that it took over the universe (or else someone might act differently than if the paperclip wasn’t produced).
It’s not whether we could detect that the universe has been taken over—it’s whether the AI itself could detect this. If the AI takes over in world w, then even if we believe that P(w|X=0)=P(w|X=1), the AI would realise these two probabilities are different. And if it can’t see the difference, it can’t plan a takeover!
It’s not whether we could detect that the universe has been taken over
If we ever find out, we might act very differently than if we don’t, causing R>>1 (in AI’s expectation). The AI knows that there is a difference if it creates a cover-up AI, but if the difference is measured as R<1, it can live with that.
If the AI can tell the difference between two worlds, it doesn’t matter whether we notice the difference or not. It doesn’t matter if there’s a “large” or a “small” divergence. R doesn’t measure whether the world differs vastly or only a little bit; it measures whether the expected worlds become less or more expected, to the limit of the AI’s ability to measure.
R doesn’t measure whether the world differs vastly or only a little bit; it measures whether the expected worlds become less or more expected, to the limit of the AI’s ability to measure.
“Different worlds” (between AI’s action and absence of action) refers to different probability distributions, thus to changes in probability of individual elementary events comprising the (coarse grained) sample space. I don’t see that we are talking about different things. The individual elementary events are the same by definition, so it doesn’t make sense to talk of changes in them (if the setup avoids acausal control).
What I’m saying is that the following three worlds are very different:
w1: the AI does nothing, w2: the AI takes over discreetly, and w3: the AI takes over blatantly
From the perspective of the AI, they’re entirely different and distinct—in no way is w1 closer to w2 than to w3. The AI can tell them apart very firmly, and would be in no doubt as to which world it was in: there is no coarse graining across these worlds. (humans would probably coarse-grain w1 and w2 together, but the AI never would).
So very carefully taking over the universe has just as high an R as very blatantly doing so.
I’m talking about how R measures the difference. Let R(A,B), where A and B are probability distributions, be some kind of sum over all w of |P(wf|X=1,E)-P(wf|X=0,E)|, where E is the information specifying AI’s decision, such that the probability distribution P(w|X=1,E) is B(w) and P(w|X=0,E)=P(w|X=0) is A(w).
What is R(w1,w2), and what is R(w1,w3)? It seems more likely that R(w1,w2)R(w1,w3), but to engage this argument we need more detail on how R (specifically, the summing over all possible worlds) works.
Aside from Stuart’s objections below, I think the bigger issue with your argument is that (I’m going to try to reduce the amount of notation to hopefully be clearer) even if R(AI does nothing, AI takes over discretely) < R(AI does nothing, AI takes over overtly), that doesn’t tell us anything about R(AI does nothing, AI makes a paperclip), which should still be vastly lower than either of those, since “takes over discretely” is still extremely high, since the AI can’t hide from itself (or if it can, the argument needs to be made in favour of that).
If some of the assumptions are relaxed, it might be possible to argue that making a paperclip is in some way equivalent to taking over overtly (although it would be a difficult argument), but with current assumptions that does not seem to be the case.
(I’ve defined my notation in an edit to the grandparent, hopefully it should be clear now.)
The intuition is very simple: R(A,B) measures the difference between probability distributions A and B (their “dissimilarity”). If A and B1 are more similar (in some sense) than A and B2, I’d expect R(A,B1)R(A,B2), unless there is a particular reason for the senses of similarity as measured by R(-,-) and as given intuitively to be anticorrelated.
(Furthermore, the alignment of the senses of similarity might be expected by design, in the sense that R(-,-) is supposed to be small if AI only creates a useless paperclip. That is if the distributions A and B are similar in the informal sense, R(A,B) should be small, and presumably if A and B are dissimilar in the informal sense, R(A,B) becomes large. If this dependence remains monotonous as we reach not just dissimilar, but very dissimilar A and B, my point follows.)
Let’s add some actions to the mix. Let a1 be the action: program the disciple to not take over, a2: program the disciple to take over discreetly, a3: program the disciple to take over blatantly. Let’s assume the disciple is going to be successful at what it attempts.
Then all the following probabilities are 1:
P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0)
And all the following are zero:
P(wi|aj,X=1) for i and j not equal, P(wi|X=0) for i 2 or 3.
In your notation in the post, I take w (and wf) as referring to elementary events, very specific descriptions of the world that AI probably won’t be able to predict individually, and R(w) measures the effect of X on these elementary events. In this discussion, if “w2” is “AI takes over discreetly”, w2 is not an elementary event, it’s either a probability distribution or a proper (big) event. As a result, I don’t see how the discussion of P(wi|aj,X=1) (where wi are big events, not elementary events from the definition of R(w)) is relevant for discussing the value of the penalty function.
OK, in the post R(w) can talk about big events w, not just particular worlds (your talking about “integrating across all w” in the post confused me, it now turns out that the possible w are not mutually exclusive). But this doesn’t clarify for me the relevance of your point in the grandparent (what is the relevance of P(wi|aj,X=1) for the estimate of the total penalty?).
(If w2 and w3 are particular worlds, then it’s incorrect that P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0) are about 1, because the AI won’t be able to predict what happens if it takes a1, a2 etc. so accurately. If w2 and w3 are partial descriptions of worlds, it is the same thing as them being big events, which is what I’ve been assuming throughout the thread.)
I don’t need P(w2|a2,X=1) and P(w3|a3,X=1) to be about one (that was a simplified model) - I need them to be about equal. i,e, the disciple is a really smart AI and can take over the world if motivated to do so.
As CCC points out, the fact of whether the paperclip was built can itself influence the future (if you don’t need the paperclip, there is no point in building the AI, so you expect its creation to influence the future). This gives a heuristic argument that whenever you want the AI to produce anything useful, and the AI doesn’t optimize the future, U=1 will imply R>>1. Together with U=0 implying R=0, this suggests that if AI doesn’t optimize the future, it’ll choose U=0, i.e. refuse to produce anything useful (which is probably not what you want).
On the other hand, a better move where U=1 and R<1 might be available if the AI produces the paperclip and then very carefully (to keep R<1) takes over the universe to make sure that nobody ever finds out that the paperclip was produced, or that it took over the universe (or else someone might act differently than if the paperclip wasn’t produced).
By making the penalty R stronger, you would make it less likely that a “cover-up AI” with R<1 is possible, but you would simultaneously make it less likely that producing the required artifact leaves R<1. By making the penalty weaker, you might make it possible to produce the artifact while keeping R<1, but that might also allow the creation of a “cover-up AI” with R<1.
Ah, excellent. I was hoping someone would find a way to change that small point into a world-altering consequence.
Extra clarification: in this example, I’m assuming that we don’t observe the AI, and that we are very unlikely to detect the paperclip. How to get useful work out of the AI is the next challenge, if this model holds up—maybe using something akin to Utility Indifference.
It’s not whether we could detect that the universe has been taken over—it’s whether the AI itself could detect this. If the AI takes over in world w, then even if we believe that P(w|X=0)=P(w|X=1), the AI would realise these two probabilities are different. And if it can’t see the difference, it can’t plan a takeover!
If we ever find out, we might act very differently than if we don’t, causing R>>1 (in AI’s expectation). The AI knows that there is a difference if it creates a cover-up AI, but if the difference is measured as R<1, it can live with that.
If the AI can tell the difference between two worlds, it doesn’t matter whether we notice the difference or not. It doesn’t matter if there’s a “large” or a “small” divergence. R doesn’t measure whether the world differs vastly or only a little bit; it measures whether the expected worlds become less or more expected, to the limit of the AI’s ability to measure.
“Different worlds” (between AI’s action and absence of action) refers to different probability distributions, thus to changes in probability of individual elementary events comprising the (coarse grained) sample space. I don’t see that we are talking about different things. The individual elementary events are the same by definition, so it doesn’t make sense to talk of changes in them (if the setup avoids acausal control).
What I’m saying is that the following three worlds are very different:
w1: the AI does nothing, w2: the AI takes over discreetly, and w3: the AI takes over blatantly
From the perspective of the AI, they’re entirely different and distinct—in no way is w1 closer to w2 than to w3. The AI can tell them apart very firmly, and would be in no doubt as to which world it was in: there is no coarse graining across these worlds. (humans would probably coarse-grain w1 and w2 together, but the AI never would).
So very carefully taking over the universe has just as high an R as very blatantly doing so.
I’m talking about how R measures the difference. Let R(A,B), where A and B are probability distributions, be some kind of sum over all w of |P(wf|X=1,E)-P(wf|X=0,E)|, where E is the information specifying AI’s decision, such that the probability distribution P(w|X=1,E) is B(w) and P(w|X=0,E)=P(w|X=0) is A(w).
What is R(w1,w2), and what is R(w1,w3)? It seems more likely that R(w1,w2)R(w1,w3), but to engage this argument we need more detail on how R (specifically, the summing over all possible worlds) works.
Aside from Stuart’s objections below, I think the bigger issue with your argument is that (I’m going to try to reduce the amount of notation to hopefully be clearer) even if R(AI does nothing, AI takes over discretely) < R(AI does nothing, AI takes over overtly), that doesn’t tell us anything about R(AI does nothing, AI makes a paperclip), which should still be vastly lower than either of those, since “takes over discretely” is still extremely high, since the AI can’t hide from itself (or if it can, the argument needs to be made in favour of that).
If some of the assumptions are relaxed, it might be possible to argue that making a paperclip is in some way equivalent to taking over overtly (although it would be a difficult argument), but with current assumptions that does not seem to be the case.
I see no reason to suppose this. Why do you (intuitively) think it’s true? (also, I got a bit confused—you seem to have changed notation?)
(I’ve defined my notation in an edit to the grandparent, hopefully it should be clear now.)
The intuition is very simple: R(A,B) measures the difference between probability distributions A and B (their “dissimilarity”). If A and B1 are more similar (in some sense) than A and B2, I’d expect R(A,B1)R(A,B2), unless there is a particular reason for the senses of similarity as measured by R(-,-) and as given intuitively to be anticorrelated.
(Furthermore, the alignment of the senses of similarity might be expected by design, in the sense that R(-,-) is supposed to be small if AI only creates a useless paperclip. That is if the distributions A and B are similar in the informal sense, R(A,B) should be small, and presumably if A and B are dissimilar in the informal sense, R(A,B) becomes large. If this dependence remains monotonous as we reach not just dissimilar, but very dissimilar A and B, my point follows.)
Let’s add some actions to the mix. Let a1 be the action: program the disciple to not take over, a2: program the disciple to take over discreetly, a3: program the disciple to take over blatantly. Let’s assume the disciple is going to be successful at what it attempts.
Then all the following probabilities are 1: P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0)
And all the following are zero: P(wi|aj,X=1) for i and j not equal, P(wi|X=0) for i 2 or 3.
w2 and w3 are not distinguished in any way.
In your notation in the post, I take w (and wf) as referring to elementary events, very specific descriptions of the world that AI probably won’t be able to predict individually, and R(w) measures the effect of X on these elementary events. In this discussion, if “w2” is “AI takes over discreetly”, w2 is not an elementary event, it’s either a probability distribution or a proper (big) event. As a result, I don’t see how the discussion of P(wi|aj,X=1) (where wi are big events, not elementary events from the definition of R(w)) is relevant for discussing the value of the penalty function.
No, they’re worlds (or partial descriptions of worlds) - we could define them as possible full space-time manifolds.
OK, in the post R(w) can talk about big events w, not just particular worlds (your talking about “integrating across all w” in the post confused me, it now turns out that the possible w are not mutually exclusive). But this doesn’t clarify for me the relevance of your point in the grandparent (what is the relevance of P(wi|aj,X=1) for the estimate of the total penalty?).
(If w2 and w3 are particular worlds, then it’s incorrect that P(w1|a1,X=1), P(w2|a2,X=1), P(w3|a3,X=1), P(w1|X=0) are about 1, because the AI won’t be able to predict what happens if it takes a1, a2 etc. so accurately. If w2 and w3 are partial descriptions of worlds, it is the same thing as them being big events, which is what I’ve been assuming throughout the thread.)
I don’t need P(w2|a2,X=1) and P(w3|a3,X=1) to be about one (that was a simplified model) - I need them to be about equal. i,e, the disciple is a really smart AI and can take over the world if motivated to do so.