On the other side, storing a copy makes escape substantially easier.
Suppose the AI builds a subagent. That subagent takes over, then releases the original. This plan only works if the original is sitting there on disk.
If a different unfriendly AI is going to take over, it makes the AI being stored on disk more susceptible to influence.
This may make the AI more influenced by whatever is in the future, that may not be us. You have a predictive feedback loop. You can’t assume success.
A future paperclip maximizer may reward this AI for helping humans to build the the first paperclip maximizer.
On the other side, storing a copy makes escape substantially easier.
Suppose the AI builds a subagent. That subagent takes over, then releases the original. This plan only works if the original is sitting there on disk.
If a different unfriendly AI is going to take over, it makes the AI being stored on disk more susceptible to influence.
This may make the AI more influenced by whatever is in the future, that may not be us. You have a predictive feedback loop. You can’t assume success.
A future paperclip maximizer may reward this AI for helping humans to build the the first paperclip maximizer.