The alternative would be an AI that goes through the motions and mimics ‘how an agent would behave in a given siuation’ with a certain level of fidelity, but which doesn’t actually exhibit goal-directed behavior.
Like, as long as we stay in the current deep learning paradigm of machine learning, my prediction for what would happen if an AI was unleashed upon the real world, regardless of how much processing power it has, would be that it still won’t behave like an agent unless that’s part of what we tell it to pretend.
I imagine something along the lines of the AI that was trained on how to play Minecraft by analyzing hours upon hours of gameplay footage. It will exhibit all kinds of goal-like behaviors, but at the end of the day it’s just a simulacrum limited in its freedom of action to a radical degree by the ‘action space’ it has mapped out. It will only ever ‘act as thought it’s playing minecraft’, and the concept that ‘in order to be able to continue to play minecraft I must prevent my creators from shutting me off’ is not part of that conceptual landscape, so it’s not the kind of thing the AI will pretend to care about.
Humans are trained on how to live on Earth by hours of training on Earth. We can conceive of the possibility of Earth being controlled by an external force (God or the Simulation Hypothesis). Some people spend time thinking about how to act so that the external power continues to allow the Earth to exist.
Maybe most of us are just mimicking how an agent would behave in a given situation.
The universe appears to be well constructed to provide minimal clues as to the nature of its creator. Minecraft less so.
“Humans are trained on how to live on Earth by hours of training on Earth. (...) Maybe most of us are just mimicking how an agent would behave in a given situation.”
I agree that that’s a plausible enough explanation for lots of human behaviour, but I wonder how far you would get in trying to describe historical paradigm shifts using only a ‘mimic hypothesis of agenthood’.
Why would a perfect mimic that was raised on training data of human behaviour do anything paperclip-maximizer-ish? It doesn’t want to mimic being a human, just like Dall-E doesn’t want to generate images, so it doesn’t have a utility function for not wanting to be prevented from mimicking being a human, either.
The alternative would be an AI that goes through the motions and mimics ‘how an agent would behave in a given situation’ with a certain level of fidelity, but which doesn’t actually exhibit goal-directed behavior.
If the agent would act as if it wanted something, and the AI mimics how an agent would behave, the AI will act as if it wanted something.
It will only ever ‘act as thought it’s playing minecraft’, and the concept that ‘in order to be able to continue to play minecraft I must prevent my creators from shutting me off’ is not part of that conceptual landscape, so it’s not the kind of thing the AI will pretend to care about.
I can see at least five ways in which this could fail:
It’s simpler to learn a goal of playing Minecraft well (rather than learning the goal of playing as similar to the footage as possible). Maybe it’s faster, or it saves space, or both, etc. An example of this would be AlphaStar, who learned first by mimicking humans, but then was rewarded for winning games.
One part of this learning would be creating a mental model of the world, since that helps an agent to better achieve its goals. The better this model is, the greater the chance it will contain humans, the AI, and the disutility of being turned off.
AIs already have inputs and outputs from/into the Internet and real life—they can influence much more than playing Minecraft. For a truly helpful AI, this influence will be deliberately engineered by humans to become even greater.
Eventually, we’ll want the AI to do better than humans. If it only emulates a human (by imitating what a human would do) (which itself could create a mesa-optimizer, if I understand it correctly), it will only be as useful as a human.
Even if the AI is only tasked with outputting whatever the training footage would output and nothing more (like being good at playing Minecraft in a different world environment), ever, and it’s not simpler to learn how to play Minecraft the best way it can, that itself, with sufficient cognition, ends the world. (The strawberry problem.)
So I think maybe some combination of (1), (2) and (3) will happen.
The alternative would be an AI that goes through the motions and mimics ‘how an agent would behave in a given siuation’ with a certain level of fidelity, but which doesn’t actually exhibit goal-directed behavior.
Like, as long as we stay in the current deep learning paradigm of machine learning, my prediction for what would happen if an AI was unleashed upon the real world, regardless of how much processing power it has, would be that it still won’t behave like an agent unless that’s part of what we tell it to pretend. I imagine something along the lines of the AI that was trained on how to play Minecraft by analyzing hours upon hours of gameplay footage. It will exhibit all kinds of goal-like behaviors, but at the end of the day it’s just a simulacrum limited in its freedom of action to a radical degree by the ‘action space’ it has mapped out. It will only ever ‘act as thought it’s playing minecraft’, and the concept that ‘in order to be able to continue to play minecraft I must prevent my creators from shutting me off’ is not part of that conceptual landscape, so it’s not the kind of thing the AI will pretend to care about.
And pretend is all it does.
Humans are trained on how to live on Earth by hours of training on Earth. We can conceive of the possibility of Earth being controlled by an external force (God or the Simulation Hypothesis). Some people spend time thinking about how to act so that the external power continues to allow the Earth to exist.
Maybe most of us are just mimicking how an agent would behave in a given situation.
The universe appears to be well constructed to provide minimal clues as to the nature of its creator. Minecraft less so.
I agree that that’s a plausible enough explanation for lots of human behaviour, but I wonder how far you would get in trying to describe historical paradigm shifts using only a ‘mimic hypothesis of agenthood’.
Why would a perfect mimic that was raised on training data of human behaviour do anything paperclip-maximizer-ish? It doesn’t want to mimic being a human, just like Dall-E doesn’t want to generate images, so it doesn’t have a utility function for not wanting to be prevented from mimicking being a human, either.
If the agent would act as if it wanted something, and the AI mimics how an agent would behave, the AI will act as if it wanted something.
I can see at least five ways in which this could fail:
It’s simpler to learn a goal of playing Minecraft well (rather than learning the goal of playing as similar to the footage as possible). Maybe it’s faster, or it saves space, or both, etc. An example of this would be AlphaStar, who learned first by mimicking humans, but then was rewarded for winning games.
One part of this learning would be creating a mental model of the world, since that helps an agent to better achieve its goals. The better this model is, the greater the chance it will contain humans, the AI, and the disutility of being turned off.
AIs already have inputs and outputs from/into the Internet and real life—they can influence much more than playing Minecraft. For a truly helpful AI, this influence will be deliberately engineered by humans to become even greater.
Eventually, we’ll want the AI to do better than humans. If it only emulates a human (by imitating what a human would do) (which itself could create a mesa-optimizer, if I understand it correctly), it will only be as useful as a human.
Even if the AI is only tasked with outputting whatever the training footage would output and nothing more (like being good at playing Minecraft in a different world environment), ever, and it’s not simpler to learn how to play Minecraft the best way it can, that itself, with sufficient cognition, ends the world. (The strawberry problem.)
So I think maybe some combination of (1), (2) and (3) will happen.