I’m curious if you’re more optimistic about non-goal-directed approaches to AI safety than goal-directed approaches, or if you’re about equally optimistic (or rather equally pessimistic). The latter would still justify your conclusion that we ought to look into non-goal-directed approaches, but if that’s the case I think it would be good to be explicit about it so as to not unintentionally give people false hope (ETA: since so far in this sequence you’ve mostly talked about the problems associated with goal-directed agents and not so much about problems associated with the alternatives). I think I’m about equally pessimistic, because while goal-directed agents have a bunch of safety problems, they also have a number of advantages that may be pretty hard to replicate in the alternative approaches.
We have an existing body of theory about goal-directed agents (which MIRI is working on refining and expanding) which plausibly makes it possible to one day reason rigorously about the kinds of goal-directed agents we might build and determine their safety properties. Paul and others working on his approach are (as I understand it) trying to invent a theory of corrigibility, but I don’t know if such a thing even exists in platonic theory space. And if it did, we’re starting from scratch so it might take a long time to reach parity with the theory of goal-directed agents.
Goal-directed agents give you economic efficiency “for free”. Alternative approaches have to simultaneously solve efficiency and safety, and may end up approximating goal-directed agent anyway due to competitive pressures.
Goal-directed agents can more easily avoid a bunch of human safety problems that are inherited by alternative approaches which all roughly follow the human-in-the-loop paradigm. These include value drift (including vulnerability to corruption/manipulation), problems with cooperation/coordination, lack of transparency/interpretability, and general untrustworthiness of humans.
While I mostly agree with all three of your advantages, I am more optimistic about non-goal-directed approaches to AI safety. I think this is primarily because I’m generally optimistic about AI safety, and the well-documented problems with goal-directed agents makes me pessimistic about that particular approach.
If I had to guess at what drives my optimism that you don’t have, it would be that we can aim for an adequate, not-formalized solution, and this will very likely be okay. All else equal, I would prefer a more formal solution, but I don’t think we have the time for that. I would guess that while this lack of formality makes me only a little more worried, it is a big source of worry for you and MIRI researchers. This means that argument 1 isn’t a big update for me.
Re: argument 2, it’s worth noting that a system that has some chance of causing catastrophe is going to be less economically efficient. Now people might build it anyway because they underestimate the chance of catastrophe, or because of race dynamics, but I’m hopeful that (assuming it’s true) we can convince all the relevant actors that goal-directed agents have a significant chance of causing catastrophe. In that case, non-goal-directed agents have a lower bar to meet. But overall this is a significant update.
Re: argument 3, I don’t really see why goal-directed agents are more likely to avoid human safety problems. It seems intuitively plausible—if you get the right goal, then you don’t have to rely on humans, and so you avoid their safety problems. However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn’t the goal have all of the human safety problems anyway?
I’m also optimistic about our ability to solve human safety problems in non-goal-directed approaches—see for example the reply I just wrote on your CAIS comment.
All else equal, I would prefer a more formal solution, but I don’t think we have the time for that.
I should have added that having a theory isn’t just so we can have a more formal solution (which as you mention we might not have the time for) but it also helps us be less confused (e.g., have better intuitions) in our less formal thinking. (In other words I agree with what MIRI calls “deconfusion”.) For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.
However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn’t the goal have all of the human safety problems anyway?
The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify. All of these have their own problems, of course, but they do avoid a lot of the human safety problems that the non-goal-directed approaches would have to address some other way.
For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.
Strong agree, and I do think it’s the biggest downside of trying to build non-goal-directed agents.
The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
That’s a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.
I’m curious if you’re more optimistic about non-goal-directed approaches to AI safety than goal-directed approaches, or if you’re about equally optimistic (or rather equally pessimistic). The latter would still justify your conclusion that we ought to look into non-goal-directed approaches, but if that’s the case I think it would be good to be explicit about it so as to not unintentionally give people false hope (ETA: since so far in this sequence you’ve mostly talked about the problems associated with goal-directed agents and not so much about problems associated with the alternatives). I think I’m about equally pessimistic, because while goal-directed agents have a bunch of safety problems, they also have a number of advantages that may be pretty hard to replicate in the alternative approaches.
We have an existing body of theory about goal-directed agents (which MIRI is working on refining and expanding) which plausibly makes it possible to one day reason rigorously about the kinds of goal-directed agents we might build and determine their safety properties. Paul and others working on his approach are (as I understand it) trying to invent a theory of corrigibility, but I don’t know if such a thing even exists in platonic theory space. And if it did, we’re starting from scratch so it might take a long time to reach parity with the theory of goal-directed agents.
Goal-directed agents give you economic efficiency “for free”. Alternative approaches have to simultaneously solve efficiency and safety, and may end up approximating goal-directed agent anyway due to competitive pressures.
Goal-directed agents can more easily avoid a bunch of human safety problems that are inherited by alternative approaches which all roughly follow the human-in-the-loop paradigm. These include value drift (including vulnerability to corruption/manipulation), problems with cooperation/coordination, lack of transparency/interpretability, and general untrustworthiness of humans.
While I mostly agree with all three of your advantages, I am more optimistic about non-goal-directed approaches to AI safety. I think this is primarily because I’m generally optimistic about AI safety, and the well-documented problems with goal-directed agents makes me pessimistic about that particular approach.
If I had to guess at what drives my optimism that you don’t have, it would be that we can aim for an adequate, not-formalized solution, and this will very likely be okay. All else equal, I would prefer a more formal solution, but I don’t think we have the time for that. I would guess that while this lack of formality makes me only a little more worried, it is a big source of worry for you and MIRI researchers. This means that argument 1 isn’t a big update for me.
Re: argument 2, it’s worth noting that a system that has some chance of causing catastrophe is going to be less economically efficient. Now people might build it anyway because they underestimate the chance of catastrophe, or because of race dynamics, but I’m hopeful that (assuming it’s true) we can convince all the relevant actors that goal-directed agents have a significant chance of causing catastrophe. In that case, non-goal-directed agents have a lower bar to meet. But overall this is a significant update.
Re: argument 3, I don’t really see why goal-directed agents are more likely to avoid human safety problems. It seems intuitively plausible—if you get the right goal, then you don’t have to rely on humans, and so you avoid their safety problems. However, even with goal-directed agents, the goal has to come from somewhere, which means it comes from humans. (If not, we almost certainly get catastrophe.) So wouldn’t the goal have all of the human safety problems anyway?
I’m also optimistic about our ability to solve human safety problems in non-goal-directed approaches—see for example the reply I just wrote on your CAIS comment.
I should have added that having a theory isn’t just so we can have a more formal solution (which as you mention we might not have the time for) but it also helps us be less confused (e.g., have better intuitions) in our less formal thinking. (In other words I agree with what MIRI calls “deconfusion”.) For example currently I find it really confusing to think about corrigible agents relative to goal-directed agents.
The goal could come from idealized humans, or from a metaphilosophical algorithm, or be an explicit set of values that we manually specify. All of these have their own problems, of course, but they do avoid a lot of the human safety problems that the non-goal-directed approaches would have to address some other way.
Strong agree, and I do think it’s the biggest downside of trying to build non-goal-directed agents.
For the case of idealized humans, couldn’t real humans defer to idealized humans if they thought that was better?
Similarly, it seems like a non-goal-directed agent could be instructed to use the metaphilosophical algorithm. I guess I could imagine a metaphilosophical algorithm such that following it requires you to be goal-directed, but it doesn’t seem very likely to me.
For an explicit set of values, those values come from humans, so wouldn’t they be subject to human safety problems? It seems like you would need to claim that humans are better at stating their values than acting in accordance with them, which seems true in some settings and false in others.
Real humans could be corrupted or suffer some other kind of safety failure before the choice to defer to idealized humans becomes a feasible option. I don’t see how to recover from this, except by making an AI with a terminal goal of deferring to idealized humans (as soon as it becomes powerful enough to compute what idealized humans would want).
That’s a good point. Solving metaphilosophy does seem to have the potential to help both approaches about equally.
Well I’m not arguing that goal-directed approaches are more promising than non-goal-directed approaches, just that they seem roughly equally (un)promising in aggregate.
Your first comment was about advantages of goal-directed agents over non-goal-directed ones. Your next comment talked about explicit value specification as a solution to human safety problems; it sounded like you were arguing that this was an example of an advantage of goal-directed agents over non-goal-directed ones. If you don’t think it’s an advantage, then I don’t think we disagree here.
That makes sense, I agree that goal-directed AI pointed at idealized humans could solve human safety problems, and it’s not clear whether non-goal-directed AI could do something similar.