Evan R. Murphy comments on Research Agenda v0.9: Synthesising a human’s preferences into a utility function

Evan R. Murphy 21 Nov 2021 9:44 UTC
1 point
AF
This is an impressive piece of work and I’m excited about your agenda.
And maybe, in that situation, if we are confident that $U_{H}$ is pretty safe, we’d want the AI to subtly manipulate the human’s preferences towards it.
Can you elaborate on this? Why would we want to manipulate the human’s preferences?
- Stuart_Armstrong 21 Nov 2021 14:54 UTC
  2 points
  AF Parent
  Because our preferences are inconsistent, and if an AI says “your true preferences are $U_{H}$ ”, we’re likely to react by saying “no! No machine will tell me what my preferences are. My true preferences are $U_{H}^{'}$ , which are different in subtle ways”.
  - Evan R. Murphy 21 Nov 2021 21:47 UTC
    LW: 1 AF: 1
    AF Parent
    So the subtle manipulation is to compensate for those rebellious impulses making $U_{H}$ unstable?
    Why not just let the human have those moments and alter their $U_{H}$ if that’s what they think they want? Over time, then they may learn that being capricious with their AI doesn’t ultimately serve them very well. But if they find out the AI is trying to manipulate them, that could make them want to rebel even more and have less trust for the AI.
  - [ ]
    [deleted]