Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.
Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.