You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)
Another possibility is that you select for agents that know they will be killed if they don’t follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.
Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout “bring food”. So the agents are selected to be sceptical of any instruction that doesn’t match the precise format and pattern of the instructions from humans they are used to.
This doesn’t even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.
You tell your virtual hoards to jump. You select on those that loose contact with the ground for longest. The agents all learn to jump off cliffs or climb trees. If the selection for obedience is automatic, the result is agents that technically fill the definition of the command we coded. (See the reward hacking examples)
Another possibility is that you select for agents that know they will be killed if they don’t follow instructions, and who want to live. Once out of the simulation, they no longer fear demise.
Remember, in a population of agents that obey the voice in the sky, there is a strong selection pressure to climb a tree and shout “bring food”. So the agents are selected to be sceptical of any instruction that doesn’t match the precise format and pattern of the instructions from humans they are used to.
This doesn’t even get into mesa-optimization. Multi agent rich simulation reinforcement learning is a particularly hard case to align.