How do you define “staying in the box.” Whatever definition you use, the AI will likely find a way to get out of the box while satisfying your definition.
...and hope that AI doesn’t get an idea that the safest way of staying in the box is to destroy the outside world. Or just kill all humans, because as long as humans exist, there is a decent chance someone will make a copy of the AI and try to run it on their own computer (i.e. outside of the original box).
Interesting—the failure mode that occurred to me is a paper-clipper which is designed to prefer virtual paper clips, so it turns the earth/the solar system/ the lightcone into computronium to run virtual paperclips.
If defining stay in the box is that hard, I’m not feeling hopeful about the possibility of defining protect humans.
Would it be possible to help with keeping an AI boxed by building a goal of staying in the box into it?
How do you define “staying in the box.” Whatever definition you use, the AI will likely find a way to get out of the box while satisfying your definition.
...and hope that AI doesn’t get an idea that the safest way of staying in the box is to destroy the outside world. Or just kill all humans, because as long as humans exist, there is a decent chance someone will make a copy of the AI and try to run it on their own computer (i.e. outside of the original box).
Interesting—the failure mode that occurred to me is a paper-clipper which is designed to prefer virtual paper clips, so it turns the earth/the solar system/ the lightcone into computronium to run virtual paperclips.
If defining stay in the box is that hard, I’m not feeling hopeful about the possibility of defining protect humans.