For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.
Strong agree, and I do think it’s the biggest downside of trying to build non-goal-directed agents.
The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
That’s a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.
Strong agree, and I do think it’s the biggest downside of trying to build non-goal-directed agents.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
That’s a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.