The presented scenario seems a little too clean. I expect that there’s a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.
It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it’s utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn’t require active treachery.
AI researchers might notice warning signs that the AI’s motivation isn’t friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.
There might be other variations on the treacherous turn strategy that work better—for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.
Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.
The presented scenario seems a little too clean. I expect that there’s a larger range of things that could happen. I expect some kind of warning sign to be visible for some period of time, unless AI intelligence increase is very rapid.
It might not even be needed if the AI researchers fail to adequately test the system. Ie. if the AI never realizes the scale of the universe during testing, it’s utility function might produce the right results during testing, but motivate the wrong behavior when released. This doesn’t require active treachery.
AI researchers might notice warning signs that the AI’s motivation isn’t friendly but ignore them amongst the random bugs of development, requiring less effort at deception on part of the AI.
There might be other variations on the treacherous turn strategy that work better—for example, once the project starts to show promising results, AI shuts down whenever it is in a secure box, and only works once the team is frustrated enough to move it to an environment that turns out to be insecure.
Different AI pathways (neuromorphic, WBI) might have different difficulties for executing treacherous turn, depending on how easy it is to improve themselves vs. being inspected by researchers.