The alternative would be an AI that goes through the motions and mimics ‘how an agent would behave in a given situation’ with a certain level of fidelity, but which doesn’t actually exhibit goal-directed behavior.
If the agent would act as if it wanted something, and the AI mimics how an agent would behave, the AI will act as if it wanted something.
It will only ever ‘act as thought it’s playing minecraft’, and the concept that ‘in order to be able to continue to play minecraft I must prevent my creators from shutting me off’ is not part of that conceptual landscape, so it’s not the kind of thing the AI will pretend to care about.
I can see at least five ways in which this could fail:
It’s simpler to learn a goal of playing Minecraft well (rather than learning the goal of playing as similar to the footage as possible). Maybe it’s faster, or it saves space, or both, etc. An example of this would be AlphaStar, who learned first by mimicking humans, but then was rewarded for winning games.
One part of this learning would be creating a mental model of the world, since that helps an agent to better achieve its goals. The better this model is, the greater the chance it will contain humans, the AI, and the disutility of being turned off.
AIs already have inputs and outputs from/into the Internet and real life—they can influence much more than playing Minecraft. For a truly helpful AI, this influence will be deliberately engineered by humans to become even greater.
Eventually, we’ll want the AI to do better than humans. If it only emulates a human (by imitating what a human would do) (which itself could create a mesa-optimizer, if I understand it correctly), it will only be as useful as a human.
Even if the AI is only tasked with outputting whatever the training footage would output and nothing more (like being good at playing Minecraft in a different world environment), ever, and it’s not simpler to learn how to play Minecraft the best way it can, that itself, with sufficient cognition, ends the world. (The strawberry problem.)
So I think maybe some combination of (1), (2) and (3) will happen.
If the agent would act as if it wanted something, and the AI mimics how an agent would behave, the AI will act as if it wanted something.
I can see at least five ways in which this could fail:
It’s simpler to learn a goal of playing Minecraft well (rather than learning the goal of playing as similar to the footage as possible). Maybe it’s faster, or it saves space, or both, etc. An example of this would be AlphaStar, who learned first by mimicking humans, but then was rewarded for winning games.
One part of this learning would be creating a mental model of the world, since that helps an agent to better achieve its goals. The better this model is, the greater the chance it will contain humans, the AI, and the disutility of being turned off.
AIs already have inputs and outputs from/into the Internet and real life—they can influence much more than playing Minecraft. For a truly helpful AI, this influence will be deliberately engineered by humans to become even greater.
Eventually, we’ll want the AI to do better than humans. If it only emulates a human (by imitating what a human would do) (which itself could create a mesa-optimizer, if I understand it correctly), it will only be as useful as a human.
Even if the AI is only tasked with outputting whatever the training footage would output and nothing more (like being good at playing Minecraft in a different world environment), ever, and it’s not simpler to learn how to play Minecraft the best way it can, that itself, with sufficient cognition, ends the world. (The strawberry problem.)
So I think maybe some combination of (1), (2) and (3) will happen.