Rohin Shah comments on AI safety without goal-directed behavior

Rohin Shah 9 Jan 2019 10:19 UTC
LW: 2 AF: 1
AF
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.