Rohin Shah comments on AI safety without goal-directed behavior

Rohin Shah 9 Jan 2019 2:00 UTC
LW: 2 AF: 1
AF
For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.
Strong agree, and I do think it’s the biggest downside of trying to build non-goal-directed agents.
The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
- Wei Dai 9 Jan 2019 5:09 UTC
  LW: 4 AF: 2
  AF Parent
  
  For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
  
  Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
  
  Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
  
  That’s a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.
  
  For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
  
  Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
  - Rohin Shah 9 Jan 2019 10:19 UTC
    LW: 2 AF: 1
    AF Parent
    Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
    Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
    Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
    That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.