paulfchristiano comments on What is narrow value learning?

paulfchristiano 11 Jan 2019 0:43 UTC
LW: 6 AF: 3
AF
Why use IRL instead of behavioral cloning, where you mimic the actions that the demonstrator took?
IRL also can produce different actions at equilibrium (given finite capacity), it’s not merely an inductive bias.
E.g. suppose the human does X half the time and Y half the time, and the agent can predict the details of X but not Y. Behavioral cloning then does X half the time, and half the time does some crazy thing where it’s trying to predict Y but can’t. IRL will just learn that it can get OK reward by outputting X (since otherwise the human wouldn’t do it) and will avoid trying to do things it can’t predict.