Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.