Could you please expand on the hot take, please? Consider that a big part of the appeal for me is just being able to display the problem and make it relatable for people who aren’t from the field.
Also, what kind of richness do you think makes the qualitative difference that you allude to? If the world was 3D or had continuous action or had more game mechanics, would that have made the difference for you?
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.
I’ll check that out, thank you.
Could you please expand on the hot take, please? Consider that a big part of the appeal for me is just being able to display the problem and make it relatable for people who aren’t from the field.
Also, what kind of richness do you think makes the qualitative difference that you allude to? If the world was 3D or had continuous action or had more game mechanics, would that have made the difference for you?
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.