There are many models; the model of the box which we simulate and the AI’s models of the model of the box. For this ultimate box to work there would have to be a proof that every possible model the AI could form contains at most a representation of the ultimate box model. This seems at least as hard as any of the AI boxing methods, if not harder because it requires the AI to be absolutely blinded to its own reasoning process despite having a human subject to learn about naturalized induction/embodiment from.
It’s tempting to say that we could “define the AI’s preferences only over the model” but that implies a static AI model of the box-model that can’t benefit from learning or else a proof that all AI models are restricted as above. In short, it’s perfectly fine to run a SAT-solver over possible permutations of the ultimate box model trying to maximize some utility function but that’s not self-improving AI.
There are many models; the model of the box which we simulate and the AI’s models of the model of the box. For this ultimate box to work there would have to be a proof that every possible model the AI could form contains at most a representation of the ultimate box model. This seems at least as hard as any of the AI boxing methods, if not harder because it requires the AI to be absolutely blinded to its own reasoning process despite having a human subject to learn about naturalized induction/embodiment from.
It’s tempting to say that we could “define the AI’s preferences only over the model” but that implies a static AI model of the box-model that can’t benefit from learning or else a proof that all AI models are restricted as above. In short, it’s perfectly fine to run a SAT-solver over possible permutations of the ultimate box model trying to maximize some utility function but that’s not self-improving AI.