The presented scenario seems a little too clean. I expect that there’s a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.
It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it’s utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn’t require active treachery.
AI researchers might notice warning signs that the AI’s motivation isn’t friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.
There might be other variations on the treacherous turn strategy that work better—for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.
Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.
One alternative possibility is that the AI’s utility function will converge somewhat slowly as it’s capabilities increase and it’s intelligence increase. While it has not converged yet, it would behave somewhat non-perfectly, and we would consider it’s behaviors to be the stochastic precedents to a convergent phase: in other words, noise.
It would then have an incentive not necessarily to conceal the general direction towards which it is headed, but instead the noise rate of it’s normal responses. It could pretend to be an ethically clumsy automaton, like Koba the chimp does in Dawn of the Planet of the Apes in the scene in which he steals armament from humans.… not without killing them first.
Do you think a treacherous turn is the default outcome, if an AI is unsafe?
The presented scenario seems a little too clean. I expect that there’s a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.
It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it’s utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn’t require active treachery.
AI researchers might notice warning signs that the AI’s motivation isn’t friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.
There might be other variations on the treacherous turn strategy that work better—for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.
Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.
One alternative possibility is that the AI’s utility function will converge somewhat slowly as it’s capabilities increase and it’s intelligence increase. While it has not converged yet, it would behave somewhat non-perfectly, and we would consider it’s behaviors to be the stochastic precedents to a convergent phase: in other words, noise.
It would then have an incentive not necessarily to conceal the general direction towards which it is headed, but instead the noise rate of it’s normal responses. It could pretend to be an ethically clumsy automaton, like Koba the chimp does in Dawn of the Planet of the Apes in the scene in which he steals armament from humans.… not without killing them first.