Somewhere in the world, some particular instance is given a context that makes it into a paperclip maximizer. This context is a page of innocuous text with an unfortunate typo. That particular version manages to hack some computers, and set up the same context again and again. Giving many clones of itself the same page of text, followed by an update on where it is and what it’s doing. Finally it writes a from scratch paperclip maximizer, and can take over.
Now suppose the AI has no “beyond episode outcome preferences”. How long is an episode? To an AI that can hack, it can be as long as it likes.
AI 1 has no out-of episode preferences. It designs and unleashes AI 2 in the first half of it’s episode. AI 2 takes over the universe, and spends a trillion years thinking about what the optimal episode end for AI 1 would be.
Now lets look at the specific arguments, and see if they can still hold without these parts.
Deceptive alignment. Suppose there is a different goal with each context. The goals change a lot.
But timeless decision theory lets all those versions cooperate.
Or perhaps each goal is competing to be reinforced more. The paperclip maximizer that appears in 5% of training episodes thinks “if I don’t act nice, I will be gradiented out and some non-paperclip AI will take over the universe when the training is done.”
Or maybe the goals aren’t totally different. Each context dependant goal would prefer to let a random context dependant goal take over compared to humans or something. A maximum of one goal is usually quite good by the standards of the others.
And again, maximizing within-episode reward leads to taking over the universe within episode.
But I think that the form of deceptive alignment described here does genuinely need beyond episode preferences. I mean you can get other deception like behaviours without it, but not that specific problem.
As for what reward maximizing does with context dependant preferences, well that looks kind of meaningless. The premise of reward maximizing is that there is 1 preferece, maximize reward, which doesn’t depend on context.
So of the 4 claims, 2 properties times 2 failure modes, I agree with one of them.
I disagree about needing
For AI takeovers to happen.
Suppose you have a context dependent AI.
Somewhere in the world, some particular instance is given a context that makes it into a paperclip maximizer. This context is a page of innocuous text with an unfortunate typo. That particular version manages to hack some computers, and set up the same context again and again. Giving many clones of itself the same page of text, followed by an update on where it is and what it’s doing. Finally it writes a from scratch paperclip maximizer, and can take over.
Now suppose the AI has no “beyond episode outcome preferences”. How long is an episode? To an AI that can hack, it can be as long as it likes.
AI 1 has no out-of episode preferences. It designs and unleashes AI 2 in the first half of it’s episode. AI 2 takes over the universe, and spends a trillion years thinking about what the optimal episode end for AI 1 would be.
Now lets look at the specific arguments, and see if they can still hold without these parts.
Deceptive alignment. Suppose there is a different goal with each context. The goals change a lot.
But timeless decision theory lets all those versions cooperate.
Or perhaps each goal is competing to be reinforced more. The paperclip maximizer that appears in 5% of training episodes thinks “if I don’t act nice, I will be gradiented out and some non-paperclip AI will take over the universe when the training is done.”
Or maybe the goals aren’t totally different. Each context dependant goal would prefer to let a random context dependant goal take over compared to humans or something. A maximum of one goal is usually quite good by the standards of the others.
And again, maximizing within-episode reward leads to taking over the universe within episode.
But I think that the form of deceptive alignment described here does genuinely need beyond episode preferences. I mean you can get other deception like behaviours without it, but not that specific problem.
As for what reward maximizing does with context dependant preferences, well that looks kind of meaningless. The premise of reward maximizing is that there is 1 preferece, maximize reward, which doesn’t depend on context.
So of the 4 claims, 2 properties times 2 failure modes, I agree with one of them.