Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. [Bolding by DanielFilan]
Presumably in a set-up where the agent can meaningfully act in the world while still doing imitation learning, one way to reward hack would be to take a short-term imitation hit to install tons of surveillance of humans and/or make them easier to predict, so that in the long-term it will be easier to accurately imitate them, right? That said, I think I probably agree with the comparative claim you make.
Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.
Presumably in a set-up where the agent can meaningfully act in the world while still doing imitation learning, one way to reward hack would be to take a short-term imitation hit to install tons of surveillance of humans and/or make them easier to predict, so that in the long-term it will be easier to accurately imitate them, right? That said, I think I probably agree with the comparative claim you make.
Right: if the agent has learned an inner objective of “do things similar to what humans do in the world at the moment I am currently acting”, then it’d definitely be incentivized to do that. It’s not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn’t help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you’re worried about inner optimization, but most the other failures described happen even in the outer alignment case.
That said, if you had a different imitation learning setup that was something like doing RL on a reward of “do the same thing one of our human labelers chooses given the same state” then the outer objective would reward what the behavior you describe. It’d be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.