Possible solution: I think there are ways to write it a program such that even if it inferred our existence, it would optimize away from us, rather than over us. Loosely: A goal like “I need to organize these instructions within this block of memory to solve a problem specified at address X.” needs to be implemented such that it produces a subgoal like “I need to write a subroutine to patch over the fact that an error in the VM I’m running on gives me a window of access into a universe with huge computation resources and godlike power over my memory space, so that my solution get get the right answer to it’s arithmetic and sole the puzzle.” It should want to do things in a way that isn’t cheating.
Marcello had a crazy idea for doing this; it’s the only suggestion for AI-boxing I’ve ever heard that doesn’t have an obvious cloud of doom hanging over it. However, you still have to prove stability of the boxed AI’s goal system.
Marcello had a crazy idea for doing this; it’s the only suggestion for AI-boxing I’ve ever heard that doesn’t have an obvious cloud of doom hanging over it. However, you still have to prove stability of the boxed AI’s goal system.
Can you link to (or otherwise more fully describe) this crazy idea?