There’s an off-switch environment in AI Safety Gridworlds, which is sort of what like you’re talking about.
But I’m going to give a hot take and say that you shouldn’t do work on AI safety gridworlds in 2022. Yes, they capture the essence of the problem, but they can’t capture the essence of the solution—you can’t have rich human feedback, rich world models, or really most rich things.
Could you please expand on the hot take, please? Consider that a big part of the appeal for me is just being able to display the problem and make it relatable for people who aren’t from the field.
Also, what kind of richness do you think makes the qualitative difference that you allude to? If the world was 3D or had continuous action or had more game mechanics, would that have made the difference for you?
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.
Why do you say this? For instance, the simplicity & restricted action space of a gridworld means that there are very limited mechanisms available to steer/point with the agent towards what we want. That’s a pretty extreme limitation compared what to can do in a high resolution sim world. It seems plausible that “pointing at what we want in an informative/high-bandwidth way” would be an important component of a solution.
right, for sure, but a good solution to the high bandwidth case that doesn’t scale down to gridworld correctly is useless, it can’t be trusted to actually be a good solution to the larger problem if it definitely breaks when shrunk. it should work just as well in gridworld, or conway’s life*, or smoothlife*, or mujoco, or a fluid sim, or real life. my point is that a good solution should at an absolute bare minimum scale down. if your base rl algorithm can’t solve gridworld, it definitely just doesn’t work at all. similarly for rl safety.
* although I’m not sure if conway’s or smoothlife have an energy conservation law, and that might be pretty important. also, maybe continuous state spaces are really important somehow, I definitely see some possibilities where that’s the case. I don’t know, though. it seems to me like we want to be able to go all the way down to the smallest units and still have a working solution.
I think I have a lot more uncertainty here. Like, yes it is always nice when solutions scale down all the way into the simplest setting, but there is no guarantee that that’s how things work in this domain. Reality is allowed to say “the minimum requirements for building a FAI are X”, where X entails a high-bandwidth or otherwise highly-specific interface between us and the agent.
There’s an off-switch environment in AI Safety Gridworlds, which is sort of what like you’re talking about.
But I’m going to give a hot take and say that you shouldn’t do work on AI safety gridworlds in 2022. Yes, they capture the essence of the problem, but they can’t capture the essence of the solution—you can’t have rich human feedback, rich world models, or really most rich things.
I’ll check that out, thank you.
Could you please expand on the hot take, please? Consider that a big part of the appeal for me is just being able to display the problem and make it relatable for people who aren’t from the field.
Also, what kind of richness do you think makes the qualitative difference that you allude to? If the world was 3D or had continuous action or had more game mechanics, would that have made the difference for you?
Yes, you can display the problem. But for simple gridworlds like this you can just figure out what’s going to happen by using your brain—surprises are very rare. So if you want to show someone the off-switch problem, you can just explain to them what the gridworld will be, without ever needing to actually run the gridworld.
I think one stab at the necessary richness is that it should support nontrivial modeling of the human(s). If Boltzmann-rationality doesn’t quickly crash and burn, your toy model is probably too simple to provide interesting feedback on models more complicated than Boltzmann-rationality.
This doesn’t rule out gridworlds, it just rules out gridworlds where the desired policy is simple (e.g. “attempt to go to the goal without disabling the off-switch.”). And it doesn’t necessarily rule in complicated 3D environments—they might still have simple policies, or might have simple correlates of reward (e.g. having the score on the screen) that mean that you’re just solving an RL problem, not an AI safety problem.
if it definitely won’t work in a gridworld, it definitely won’t work in a high resolution sim world
Why do you say this? For instance, the simplicity & restricted action space of a gridworld means that there are very limited mechanisms available to steer/point with the agent towards what we want. That’s a pretty extreme limitation compared what to can do in a high resolution sim world. It seems plausible that “pointing at what we want in an informative/high-bandwidth way” would be an important component of a solution.
right, for sure, but a good solution to the high bandwidth case that doesn’t scale down to gridworld correctly is useless, it can’t be trusted to actually be a good solution to the larger problem if it definitely breaks when shrunk. it should work just as well in gridworld, or conway’s life*, or smoothlife*, or mujoco, or a fluid sim, or real life. my point is that a good solution should at an absolute bare minimum scale down. if your base rl algorithm can’t solve gridworld, it definitely just doesn’t work at all. similarly for rl safety.
* although I’m not sure if conway’s or smoothlife have an energy conservation law, and that might be pretty important. also, maybe continuous state spaces are really important somehow, I definitely see some possibilities where that’s the case. I don’t know, though. it seems to me like we want to be able to go all the way down to the smallest units and still have a working solution.
I think I have a lot more uncertainty here. Like, yes it is always nice when solutions scale down all the way into the simplest setting, but there is no guarantee that that’s how things work in this domain. Reality is allowed to say “the minimum requirements for building a FAI are X”, where X entails a high-bandwidth or otherwise highly-specific interface between us and the agent.