Hm. Yeah, I think you’ve correctly identified the main issue, which is that if it knows enough about the world, or is faced with problems that are informed by the details of the world, then it can predict G2 ahead of time and so obfuscation is a bit pointless—one just kind of has to hope it doesn’t care.
Maybe you want something like myopia—perhaps the AI has some local goal like “find the simplest proof,” and it doesn’t care about controlling the future, it just outputs whatever proof is the simplest. Although to get this right we’d have to somehow acquire a myopic optimizer, which is harder than just implementing a myopic training environment.
Hm. Yeah, I think you’ve correctly identified the main issue, which is that if it knows enough about the world, or is faced with problems that are informed by the details of the world, then it can predict G2 ahead of time and so obfuscation is a bit pointless—one just kind of has to hope it doesn’t care.
Maybe you want something like myopia—perhaps the AI has some local goal like “find the simplest proof,” and it doesn’t care about controlling the future, it just outputs whatever proof is the simplest. Although to get this right we’d have to somehow acquire a myopic optimizer, which is harder than just implementing a myopic training environment.