The most basic examples are comparisons between derived preferences that assume the human is always rational (i.e. every action they take, no matter how mistaken it may appear, is in the service of some complicated plan for how the universe’s history should go. My friend getting drunk and knocking over his friend’s dresser was all planned and totally in accordance with their preferences.), and derived preferences that assume the human is irrational in some way (e.g. maybe they would prefer not to drink so much coffee, but can’t wake up without it, and so the action that best fulfills their preferences is to help them drink less coffee).
But more intuitive examples might involve comparison between two different sorts of human irrationality.
For example, in the case of coffee, the AI is supposed to learn that the human has some pattern of thoughts and inclinations that mean it actually doesn’t want coffee, and its actions of drinking coffee are due to some sort of limitation or mistake.
But consider a different mistake: not doing heroin. After all, upon trying heroin, the human would be happy and would exhibit behavior consistent with wanting heroin. So we might imagine an AI that infers that humans want heroin, and that their current actions of not trying heroin are due to some sort of mistake.
Both theories can be prediction-identical—the two different sets of “real preferences” just need to be filtered through two different models of human irrationality. Depending on what you classify as “irrational,” this degree of freedom translates into a change in what you consider “the real preferences.”
The most basic examples are comparisons between derived preferences that assume the human is always rational (i.e. every action they take, no matter how mistaken it may appear, is in the service of some complicated plan for how the universe’s history should go. My friend getting drunk and knocking over his friend’s dresser was all planned and totally in accordance with their preferences.), and derived preferences that assume the human is irrational in some way (e.g. maybe they would prefer not to drink so much coffee, but can’t wake up without it, and so the action that best fulfills their preferences is to help them drink less coffee).
But more intuitive examples might involve comparison between two different sorts of human irrationality.
For example, in the case of coffee, the AI is supposed to learn that the human has some pattern of thoughts and inclinations that mean it actually doesn’t want coffee, and its actions of drinking coffee are due to some sort of limitation or mistake.
But consider a different mistake: not doing heroin. After all, upon trying heroin, the human would be happy and would exhibit behavior consistent with wanting heroin. So we might imagine an AI that infers that humans want heroin, and that their current actions of not trying heroin are due to some sort of mistake.
Both theories can be prediction-identical—the two different sets of “real preferences” just need to be filtered through two different models of human irrationality. Depending on what you classify as “irrational,” this degree of freedom translates into a change in what you consider “the real preferences.”