On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning.
Agreed (though we may be using the word “planning” differently, see below).
If we’re aiming at “weak” goal-directedness (which might be consistent with your position?)
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI
Yes, that’s what I thought. My position is more correlated because I don’t see (strong) goal-directedness as a necessity, but I do think that deep RL is likely (though not beyond reasonable doubt) to lead to strongly goal-directed systems.
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goal-directed. Meaning that, the algorithm doesn’t start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goal-direct agents, for the reasons I mentioned before. Moreover, I think that any agent which is “extrinsically goal-directed” rather than intrinsically goal-directed will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than real-time. We can consider three possible worlds:
World A: Agents that aren’t powerful enough for even a limited scope short-term emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it’s a fatal problem.
World B: Agents that aren’t powerful enough for a short-term emulation of humans cannot gain strategic advantage. Agents that aren’t powerful enough for a long-term emulation of humans (i.e high bandwidth and faster than real-time) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goal-directed approaches.
World C: Agents that aren’t powerful enough for a long-term emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goal-directed agents is viable. However, if unaligned intrinsically goal-directed agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower risk-aversiveness, because even a sped-up version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of risk-aversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think I don’t understand what you mean here. I’ll say some things that may or may not be relevant:
I don’t think the ability to plan implies goal-directedness. Tabooing goal-directedness, I don’t think an AI that can “intrinsically” plan will necessarily pursue convergent instrumental subgoals. For example, the AI could have “intrinsic” planning capabilities, that find plans that when executed by a human lead to outcomes the human wants. Depending on how it finds such plans, such an AI may not pursue any of the convergent instrumental subgoals. (Google Maps would be an example of such an AI system, and by my understanding Google Maps has “intrinsic” planning capabilities.)
I also don’t think that we will find the one true algorithm for planning (I agree with most of Richard’s positions in Realism about rationality).
I don’t think that my intuitions depend on an AI’s ability to emulate humans (e.g. Google Maps does not emulate humans).
Google Maps is not a relevant example. I am talking about “generally intelligent” agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don’t think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of “one true algorithm for planning”? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard’s position. But then again, I don’t understand the relevance.
(fn1) When I say “construct models” I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, model-free RL is also constructing models. Although I’m also leaning towards the position that explicitly model-based approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of “mathematical behaviors”.
(fn3) As an analogy, a “qualitatively optimal” algorithm for a problem in P is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By “relatively simple” I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.
Agreed (though we may be using the word “planning” differently, see below).
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Yes, that’s what I thought. My position is more correlated because I don’t see (strong) goal-directedness as a necessity, but I do think that deep RL is likely (though not beyond reasonable doubt) to lead to strongly goal-directed systems.
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goal-directed. Meaning that, the algorithm doesn’t start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goal-direct agents, for the reasons I mentioned before. Moreover, I think that any agent which is “extrinsically goal-directed” rather than intrinsically goal-directed will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than real-time. We can consider three possible worlds:
World A: Agents that aren’t powerful enough for even a limited scope short-term emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it’s a fatal problem.
World B: Agents that aren’t powerful enough for a short-term emulation of humans cannot gain strategic advantage. Agents that aren’t powerful enough for a long-term emulation of humans (i.e high bandwidth and faster than real-time) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goal-directed approaches.
World C: Agents that aren’t powerful enough for a long-term emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goal-directed agents is viable. However, if unaligned intrinsically goal-directed agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower risk-aversiveness, because even a sped-up version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of risk-aversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think I don’t understand what you mean here. I’ll say some things that may or may not be relevant:
I don’t think the ability to plan implies goal-directedness. Tabooing goal-directedness, I don’t think an AI that can “intrinsically” plan will necessarily pursue convergent instrumental subgoals. For example, the AI could have “intrinsic” planning capabilities, that find plans that when executed by a human lead to outcomes the human wants. Depending on how it finds such plans, such an AI may not pursue any of the convergent instrumental subgoals. (Google Maps would be an example of such an AI system, and by my understanding Google Maps has “intrinsic” planning capabilities.)
I also don’t think that we will find the one true algorithm for planning (I agree with most of Richard’s positions in Realism about rationality).
I don’t think that my intuitions depend on an AI’s ability to emulate humans (e.g. Google Maps does not emulate humans).
Google Maps is not a relevant example. I am talking about “generally intelligent” agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don’t think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of “one true algorithm for planning”? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard’s position. But then again, I don’t understand the relevance.
(fn1) When I say “construct models” I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, model-free RL is also constructing models. Although I’m also leaning towards the position that explicitly model-based approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of “mathematical behaviors”.
(fn3) As an analogy, a “qualitatively optimal” algorithm for a problem in P is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By “relatively simple” I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.