The standard AI risk scenarios usually (though I think not always) suppose that advanced AI wants not to be shut down. As commonly framed, the AI will fool humanity into believing it is aligned so as not to be turned off, until—all at once—it destroys humanity and gains control over all earth’s resources.
But why does the AI want not to be shut down?
The motivation behind a human wanting not to die comes from evolution. You die before reproduction age, and you won’t be able to pass on your not-afraid-of-death-before-reproduction-age genes. You die after reproduction age, and you won’t be able to take care of your children to make sure they pass on your genes. Dying after the age when your children are grown only started to happen after humans had evolved into their current state, I believe, and so the human emotional reaction defaults to the one learned from evolution. How this fits into the “human utility function” is a controversial philosophical/psychological question, but I think it’s fair to say that the human fear of dying surpasses the desire not to miss out on the pleasure of the rest of your life. We’re not simply optimizing for utility when we avoid death.
AI is not subject to these evolutionary pressures. The desire not to be shut down must come from an attempt to maximize its utility function. But with the current SOTA techniques, this doesn’t really make sense. Like, how does the AI compute the utility of being off? A neural network is trained to optimize a loss function on input. If the AI doesn’t get input, is that loss… zero? That doesn’t sound right. Just by adding a constant amount to the loss function we should be able to change the system from one that really wants to be active to one that really wants to be shut down, yet the gradients used in backpropagation stay exactly the same. My understanding is that reinforcement learning works the same way; GPT243, so long as it is trained with the same techniques, will not care if it is shut down.
Maybe with a future training technique we will get an AI with a strong preference for being active to being shut down? I honestly don’t see how. The AI cannot know what it’s like to be shut down, this state isn’t found anywhere in its training regime.
There has to be some counterargument here I’m not aware of.
[Question] Why does advanced AI want not to be shut down?
I’ve always been pretty confused about this.
The standard AI risk scenarios usually (though I think not always) suppose that advanced AI wants not to be shut down. As commonly framed, the AI will fool humanity into believing it is aligned so as not to be turned off, until—all at once—it destroys humanity and gains control over all earth’s resources.
But why does the AI want not to be shut down?
The motivation behind a human wanting not to die comes from evolution. You die before reproduction age, and you won’t be able to pass on your not-afraid-of-death-before-reproduction-age genes. You die after reproduction age, and you won’t be able to take care of your children to make sure they pass on your genes. Dying after the age when your children are grown only started to happen after humans had evolved into their current state, I believe, and so the human emotional reaction defaults to the one learned from evolution. How this fits into the “human utility function” is a controversial philosophical/psychological question, but I think it’s fair to say that the human fear of dying surpasses the desire not to miss out on the pleasure of the rest of your life. We’re not simply optimizing for utility when we avoid death.
AI is not subject to these evolutionary pressures. The desire not to be shut down must come from an attempt to maximize its utility function. But with the current SOTA techniques, this doesn’t really make sense. Like, how does the AI compute the utility of being off? A neural network is trained to optimize a loss function on input. If the AI doesn’t get input, is that loss… zero? That doesn’t sound right. Just by adding a constant amount to the loss function we should be able to change the system from one that really wants to be active to one that really wants to be shut down, yet the gradients used in backpropagation stay exactly the same. My understanding is that reinforcement learning works the same way; GPT243, so long as it is trained with the same techniques, will not care if it is shut down.
Maybe with a future training technique we will get an AI with a strong preference for being active to being shut down? I honestly don’t see how. The AI cannot know what it’s like to be shut down, this state isn’t found anywhere in its training regime.
There has to be some counterargument here I’m not aware of.