Lastly, we can make the agent uncertain if it is in training or deployment. In this case, the agent is forced to behave in ways that it expects humans will find acceptable (or at worst, randomly defect[2]). This works best if the agent is also uncertain how many such simulated test runs occur (so it does not know how often it ‘should’ randomly defect), but may be practically difficult to achieve (how do you simulate reality well enough to fool a superintelligence?).
I think we really shouldn’t try simulation with a superintelligence, but also that we don’t need to go as far as trying to align a superintelligence. Keep the AI limited and controlled, don’t let it get super-intelligent. Then you can test your simulation on smart humans before you even have the test AI. Simulation tests are only one piece of the puzzle, but I feel like some people dismiss simulation as infeasible. I think it is feasible, about should be part of the overall strategy.
I think we really shouldn’t try simulation with a superintelligence, but also that we don’t need to go as far as trying to align a superintelligence. Keep the AI limited and controlled, don’t let it get super-intelligent. Then you can test your simulation on smart humans before you even have the test AI. Simulation tests are only one piece of the puzzle, but I feel like some people dismiss simulation as infeasible. I think it is feasible, about should be part of the overall strategy.