paulfchristiano comments on Approval-directed agents

paulfchristiano 15 Dec 2014 5:41 UTC
3 points
I wrote a follow-up partly addressing the issue of actions vs. outcomes. (Or at least, covering one technical isssue I omtitted from the original post for want of space.)

I agree that Hugh must reason about how well different actions satisfy Hugh’s goals, and the AI must reason (or make implicit generalizations about) these judgments. Where am I moving the values complexity problem? The point was to move it into the AI’s predictions about what actions Hugh would approve of.

What part of the argument in particular do you think I am being imprecise about? There are particular failure modes, like “deceiving Hugh” or especially “resisting correction” which I would expect to avoid via this procedure. I see no reason why the system would resist correction, for example. I don’t see how this is due to confusion about outcomes vs. actions.