You don’t need Newcomb-like problems to see this arise. This is a specific example of a more general problem with IRL, which is that if you have a bad model of how the human chooses actions given a utility function, then you will infer the wrong thing. You could have a bad model of the human’s decision theory, or you could not realize that humans are subject to the planning fallacy and so infer that humans must enjoy missing deadlines.
You don’t need Newcomb-like problems to see this arise. This is a specific example of a more general problem with IRL, which is that if you have a bad model of how the human chooses actions given a utility function, then you will infer the wrong thing. You could have a bad model of the human’s decision theory, or you could not realize that humans are subject to the planning fallacy and so infer that humans must enjoy missing deadlines.
Readings that argue this is hard:
Impossibility of deducing preferences and rationality from human policy
Model Mis-specification and Inverse Reinforcement Learning
The easy goal inference problem is still hard
Work that tries to address it:
Learning the Preferences of Ignorant, Inconsistent Agents
Learning the Preferences of Bounded Agents (very similar)
My own work (coming soon)