Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, there has been little analysis of the general problem of inferring the reward of a human of unknown rationality. The observed behavior can, in principle, be decomposed into two components: a reward function and a planning algorithm, both of which have to be inferred from behavior. This paper presents a No Free Lunch theorem, showing that, without making `normative’ assumptions beyond the data, nothing about the human reward function can be deduced from human behavior. Unlike most No Free Lunch theorems, this cannot be alleviated by regularising with simplicity assumptions. We show that the simplest hypotheses which explain the data are generally degenerate.
Impossibility of deducing preferences and rationality from human policy: https://arxiv.org/abs/1712.05812
Accepted too