In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: “VNM and similar theorems imply goal-directed behavior”. This is placed in the context of Rohin’s thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: “coherence arguments” imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function.
I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don’t think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don’t think that the conclusions Rohin draws or at least implies are the right conclusions.
The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents.
Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possibly restructuring the world in a rather extreme way to safe-guard it (compare with Bostrom’s arguments for mass surveillance). The ability to generate such plans is almost by definition goal-directed behavior.
Now, knowing that goal-directed agents are EU maximizers doesn’t buy us much. As Rohin justly observes, without further constraints it is a vacuous claim (although the situation becomes better if we constraint ourselves to instrumental reward functions). Moreover, the model of reasoning in complex environments that I’m advocating myself (quasi-Bayesian reinforcement learning) doesn’t even look like EU maximization (technically there is a way to interpret it as EU maximization but it underspecifies the behavior). This is a symptom of the fact that the setting and assumptions of VNM and similar theorems are not good enough to study goal-directed behavior. However, I think that it can be an interesting and important line of research, to try and figure out the right setting and assumptions.
This last point is IMO the correct takeaway from Rohin’s initial observation. In contrast, I remain skeptical about Rohin’s thesis that we should dispense with goal-directedness altogether, for the reason I mentioned before: powerful goal-directed agents seem necessary or at least very desirable to create a defense system from unaligned AI. Moreover, the study of goal-directed agents is important to understand the impact of any powerful AI system on the world, since even a system not designed to be goal-directed can develop such agency (due to reasons like
malign hypotheses, mesa-optimization and self-fullfiling prophecies).
This is placed in the context of Rohin’s thesis that solving AI alignment is best achieved by designing AI which is not goal-directed.
[...]
I remain skeptical about Rohin’s thesis that we should dispense with goal-directedness altogether
Hmm, perhaps I believed this when I wrote the sequence (I don’t think so, but maybe?), but I certainly don’t believe it now. I believe something more like:
Humans have goals and want AI systems to help them achieve them; this implies that the human-AI system as a whole should be goal-directed.
One particular way to do this is to create a goal-directed AI system, and plug in a goal that (we think) we want. Such AI systems are well-modeled as EU maximizers with “simple” utility functions.
But there could plausibly be AI systems that are not themselves goal-directed, but nonetheless the resulting human-AI system is sufficiently goal-directed. For example, a “genie” that properly interprets your instructions based on what you mean and not what you say seems not particularly goal-directed, but when combined with a human giving instructions becomes goal-directed.
One counterargument is that in order to be competitive, you must take the human out of the loop. I don’t find this compelling, for a few reasons. First, you can interpolate between lots of human feedback (the human says “do X for a minute” every minute to the “genie”) and not much human feedback (the human says “pursue my CEV forever”) depending on how competitive you need to be. This allows you to tradeoff between competitiveness and how much of the goal-directedness remains in the human. Second, you can help the human to provide more efficient and effective feedback (see e.g. recursive reward modeling). Finally, laws and regulations can be effective at reducing competition.
Nonetheless, it’s not obvious how to create such non-goal-directed AI, and the AI community seems very focused on building goal-directed AI, and so there’s a good chance we will build goal-directed AI and will need to focus on alignment of goal-directed AI systems.
As a result, we should be thinking about non-goal-directed AI approaches to alignment, while also working on alignment of goal-directed systems.
I think when I wrote the sequence, I thought the “just do deep RL” approach to AGI wouldn’t work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goal-directed. (However, I do not think it is clear that “just do deep RL” approaches lead to goal-directed systems.)
I think that the discussion might be missing a distinction between different types or degrees of goal-directedness. For example, consider DialogicReinforcement Learning. Does it describe a goal-directed agent? On the one hand, you could argue it doesn’t, because this agent doesn’t have fixed preferences and doesn’t have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing long-term planning in the physical world. So, I definitely agree that aligned AI systems will only be goal-directed in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goal-directed in the weak sense emself.
If we’re aiming at “weak” goal-directedness (which might be consistent with your position?), does it mean studying strong goal-directedness is redundant? I think that answer is, clearly no. Strong goal-directed systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goal-directed agents without understanding strong goal-directed agents seems to me like trying to understand molecules without understanding atoms.
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles or recursively amplified versions thereof in the style of IDA. However, I believe that such systems will still fail to be simultaneously safe and competitive because (i) forecasting is hard if you don’t know which features are important to forecast, and becomes doubly hard if you need to impose confidence threshold to avoid catastrophic errors and in particular malign hypotheses (thresholds of the sort used in delegative RL) (ii) it seems plausible that competitive AI would have to be recursively self-improving (I updated towards this position after coming up with Turing RL) and that might already necessitate long-term planning and (iii) such system are vulnerable to attacks from the future and to attacks from counterfactual scenarios.
I think when I wrote the sequence, I thought the “just do deep RL” approach to AGI wouldn’t work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goal-directed. (However, I do not think it is clear that “just do deep RL” approaches lead to goal-directed systems.)
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI (i.e. I think it’s true even if deep RL doesn’t lead to AGI). But also, the question seems somewhat underspecified, since it’s not clear which algorithmic innovation would count as still “just deep RL” and which wouldn’t.
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning.
Agreed (though we may be using the word “planning” differently, see below).
If we’re aiming at “weak” goal-directedness (which might be consistent with your position?)
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI
Yes, that’s what I thought. My position is more correlated because I don’t see (strong) goal-directedness as a necessity, but I do think that deep RL is likely (though not beyond reasonable doubt) to lead to strongly goal-directed systems.
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goal-directed. Meaning that, the algorithm doesn’t start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goal-direct agents, for the reasons I mentioned before. Moreover, I think that any agent which is “extrinsically goal-directed” rather than intrinsically goal-directed will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than real-time. We can consider three possible worlds:
World A: Agents that aren’t powerful enough for even a limited scope short-term emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it’s a fatal problem.
World B: Agents that aren’t powerful enough for a short-term emulation of humans cannot gain strategic advantage. Agents that aren’t powerful enough for a long-term emulation of humans (i.e high bandwidth and faster than real-time) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goal-directed approaches.
World C: Agents that aren’t powerful enough for a long-term emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goal-directed agents is viable. However, if unaligned intrinsically goal-directed agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower risk-aversiveness, because even a sped-up version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of risk-aversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think I don’t understand what you mean here. I’ll say some things that may or may not be relevant:
I don’t think the ability to plan implies goal-directedness. Tabooing goal-directedness, I don’t think an AI that can “intrinsically” plan will necessarily pursue convergent instrumental subgoals. For example, the AI could have “intrinsic” planning capabilities, that find plans that when executed by a human lead to outcomes the human wants. Depending on how it finds such plans, such an AI may not pursue any of the convergent instrumental subgoals. (Google Maps would be an example of such an AI system, and by my understanding Google Maps has “intrinsic” planning capabilities.)
I also don’t think that we will find the one true algorithm for planning (I agree with most of Richard’s positions in Realism about rationality).
I don’t think that my intuitions depend on an AI’s ability to emulate humans (e.g. Google Maps does not emulate humans).
Google Maps is not a relevant example. I am talking about “generally intelligent” agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don’t think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of “one true algorithm for planning”? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard’s position. But then again, I don’t understand the relevance.
(fn1) When I say “construct models” I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, model-free RL is also constructing models. Although I’m also leaning towards the position that explicitly model-based approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of “mathematical behaviors”.
(fn3) As an analogy, a “qualitatively optimal” algorithm for a problem in P is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By “relatively simple” I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.
In this essay, Rohin sets out to debunk what ey perceive as a prevalent but erroneous idea in the AI alignment community, namely: “VNM and similar theorems imply goal-directed behavior”. This is placed in the context of Rohin’s thesis that solving AI alignment is best achieved by designing AI which is not goal-directed. The main argument is: “coherence arguments” imply expected utility maximization, but expected utility maximization does not imply goal-directed behavior. Instead, it is a vacuous constraint, since any agent policy can be regarded as maximizing the expectation of some utility function.
I have mixed feelings about this essay. On the one hand, the core argument that VNM and similar theorems do not imply goal-directed behavior is true. To the extent that some people believed the opposite, correcting this mistake is important. On the other hand, (i) I don’t think the claim Rohin is debunking is the claim Eliezer had in mind in those sources Rohin cites (ii) I don’t think that the conclusions Rohin draws or at least implies are the right conclusions.
The actual claim that Eliezer was making (or at least my interpretation of it) is, coherence arguments imply that if we assume an agent is goal-directed then it must be an expected utility maximizer, and therefore EU maximization is the correct mathematical model to apply to such agents.
Why do we care about goal-directed agents in the first place? The reason is, on the one hand goal-directed agents are the main source of AI risk, and on the other hand, goal-directed agents are also the most straightforward approach to solving AI risk. Indeed, if we could design powerful agents with the goals we want, these agents would protect us from unaligned AIs and solve all other problems as well (or at least solve them better than we can solve them ourselves). Conversely, if we want to protect ourselves from unaligned AIs, we need to generate very sophisticated long-term plans of action in the physical world, possibly restructuring the world in a rather extreme way to safe-guard it (compare with Bostrom’s arguments for mass surveillance). The ability to generate such plans is almost by definition goal-directed behavior.
Now, knowing that goal-directed agents are EU maximizers doesn’t buy us much. As Rohin justly observes, without further constraints it is a vacuous claim (although the situation becomes better if we constraint ourselves to instrumental reward functions). Moreover, the model of reasoning in complex environments that I’m advocating myself (quasi-Bayesian reinforcement learning) doesn’t even look like EU maximization (technically there is a way to interpret it as EU maximization but it underspecifies the behavior). This is a symptom of the fact that the setting and assumptions of VNM and similar theorems are not good enough to study goal-directed behavior. However, I think that it can be an interesting and important line of research, to try and figure out the right setting and assumptions.
This last point is IMO the correct takeaway from Rohin’s initial observation. In contrast, I remain skeptical about Rohin’s thesis that we should dispense with goal-directedness altogether, for the reason I mentioned before: powerful goal-directed agents seem necessary or at least very desirable to create a defense system from unaligned AI. Moreover, the study of goal-directed agents is important to understand the impact of any powerful AI system on the world, since even a system not designed to be goal-directed can develop such agency (due to reasons like malign hypotheses, mesa-optimization and self-fullfiling prophecies).
Hmm, perhaps I believed this when I wrote the sequence (I don’t think so, but maybe?), but I certainly don’t believe it now. I believe something more like:
Humans have goals and want AI systems to help them achieve them; this implies that the human-AI system as a whole should be goal-directed.
One particular way to do this is to create a goal-directed AI system, and plug in a goal that (we think) we want. Such AI systems are well-modeled as EU maximizers with “simple” utility functions.
But there could plausibly be AI systems that are not themselves goal-directed, but nonetheless the resulting human-AI system is sufficiently goal-directed. For example, a “genie” that properly interprets your instructions based on what you mean and not what you say seems not particularly goal-directed, but when combined with a human giving instructions becomes goal-directed.
One counterargument is that in order to be competitive, you must take the human out of the loop. I don’t find this compelling, for a few reasons. First, you can interpolate between lots of human feedback (the human says “do X for a minute” every minute to the “genie”) and not much human feedback (the human says “pursue my CEV forever”) depending on how competitive you need to be. This allows you to tradeoff between competitiveness and how much of the goal-directedness remains in the human. Second, you can help the human to provide more efficient and effective feedback (see e.g. recursive reward modeling). Finally, laws and regulations can be effective at reducing competition.
Nonetheless, it’s not obvious how to create such non-goal-directed AI, and the AI community seems very focused on building goal-directed AI, and so there’s a good chance we will build goal-directed AI and will need to focus on alignment of goal-directed AI systems.
As a result, we should be thinking about non-goal-directed AI approaches to alignment, while also working on alignment of goal-directed systems.
I think when I wrote the sequence, I thought the “just do deep RL” approach to AGI wouldn’t work, and now I think it has more of a chance, and this has updated me towards powerful AI systems being goal-directed. (However, I do not think it is clear that “just do deep RL” approaches lead to goal-directed systems.)
I think that the discussion might be missing a distinction between different types or degrees of goal-directedness. For example, consider Dialogic Reinforcement Learning. Does it describe a goal-directed agent? On the one hand, you could argue it doesn’t, because this agent doesn’t have fixed preferences and doesn’t have consistent beliefs over time. On the other hand, you could argue it does, because this agent is still doing long-term planning in the physical world. So, I definitely agree that aligned AI systems will only be goal-directed in the weaker sense that I alluded to, rather than in the stronger sense, and this is because the user is only goal-directed in the weak sense emself.
If we’re aiming at “weak” goal-directedness (which might be consistent with your position?), does it mean studying strong goal-directedness is redundant? I think that answer is, clearly no. Strong goal-directed systems are a simpler special case on which to hone our theories of intelligence. Trying to understand weak goal-directed agents without understanding strong goal-directed agents seems to me like trying to understand molecules without understanding atoms.
On the other hand, I am skeptical about solutions to AI safety that require the user doing a sizable fraction of the actual planning. I think that planning does not decompose into an easy part and a hard part (which is not essentially planning in itself) in a way which would enable such systems to be competitive with fully autonomous planners. The strongest counterargument to this position, IMO, is the proposal to use counterfatual oracles or recursively amplified versions thereof in the style of IDA. However, I believe that such systems will still fail to be simultaneously safe and competitive because (i) forecasting is hard if you don’t know which features are important to forecast, and becomes doubly hard if you need to impose confidence threshold to avoid catastrophic errors and in particular malign hypotheses (thresholds of the sort used in delegative RL) (ii) it seems plausible that competitive AI would have to be recursively self-improving (I updated towards this position after coming up with Turing RL) and that might already necessitate long-term planning and (iii) such system are vulnerable to attacks from the future and to attacks from counterfactual scenarios.
To be clear, my own position is not strongly correlated with whether deep RL leads to AGI (i.e. I think it’s true even if deep RL doesn’t lead to AGI). But also, the question seems somewhat underspecified, since it’s not clear which algorithmic innovation would count as still “just deep RL” and which wouldn’t.
Agreed (though we may be using the word “planning” differently, see below).
I certainly agree that we will want AI systems that can find good actions, where “good” is based on long-term consequences. However, I think counterfactual oracles and recursive amplification also meet this criterion; I’m not sure why you think they are counterarguments. Perhaps you think that the AI system also needs to autonomously execute the actions it finds, whereas I find that plausible but not necessary?
Yes, that’s what I thought. My position is more correlated because I don’t see (strong) goal-directedness as a necessity, but I do think that deep RL is likely (though not beyond reasonable doubt) to lead to strongly goal-directed systems.
Maybe we need to further refine the terminology. We could say that counterfactual oracles are not intrinsically goal-directed. Meaning that, the algorithm doesn’t start with all the necessary components to produce good plans, but instead tries to learn these components by emulating humans. This approach comes with costs that I think will make it uncompetitive compared to intrinsically goal-direct agents, for the reasons I mentioned before. Moreover, I think that any agent which is “extrinsically goal-directed” rather than intrinsically goal-directed will have such penalties.
In order for an agent to gain strategic advantage it is probably not necessary for it be powerful enough to emulate humans accurately, reliably and significantly faster than real-time. We can consider three possible worlds:
World A: Agents that aren’t powerful enough for even a limited scope short-term emulation of humans can gain strategic advantage. This world is a problem even for Dialogic RL, but I am not sure whether it’s a fatal problem.
World B: Agents that aren’t powerful enough for a short-term emulation of humans cannot gain strategic advantage. Agents that aren’t powerful enough for a long-term emulation of humans (i.e high bandwidth and faster than real-time) can gain strategic advantage. This world is good for Dialogic RL but bad for extrinsically goal-directed approaches.
World C: Agents that aren’t powerful enough for a long-term emulation of humans cannot gain strategic advantage. In this world delegating the remaining part of the AI safety problem to extrinsically goal-directed agents is viable. However, if unaligned intrinsically goal-directed agents are deployed before a defense system is implemented, they will probably still win because of their more efficient use of computing resources, lower risk-aversiveness, because even a sped-up version of the human algorithm might still have suboptimal sample complexity and because of attacks from the future. Dialogic RL will also be disadvantaged compared to unaligned AI (because of risk-aversiveness) but at least the defense system will be constructed faster.
Allowing the AI to execute the actions it finds is also advantageous because of higher bandwidths and shorter reaction times. But this concerns me less.
I think I don’t understand what you mean here. I’ll say some things that may or may not be relevant:
I don’t think the ability to plan implies goal-directedness. Tabooing goal-directedness, I don’t think an AI that can “intrinsically” plan will necessarily pursue convergent instrumental subgoals. For example, the AI could have “intrinsic” planning capabilities, that find plans that when executed by a human lead to outcomes the human wants. Depending on how it finds such plans, such an AI may not pursue any of the convergent instrumental subgoals. (Google Maps would be an example of such an AI system, and by my understanding Google Maps has “intrinsic” planning capabilities.)
I also don’t think that we will find the one true algorithm for planning (I agree with most of Richard’s positions in Realism about rationality).
I don’t think that my intuitions depend on an AI’s ability to emulate humans (e.g. Google Maps does not emulate humans).
Google Maps is not a relevant example. I am talking about “generally intelligent” agents. Meaning that, these agents construct sophisticated models of the world starting from a relatively uninformed prior (comparably to humans or more so)(fn1)(fn2). This is in sharp contrast to Google Maps that operates strictly within the model it was given a priori. General intelligence is important, since without it I doubt it will be feasible to create a reliable defense system. Given general intelligence, convergent instrumental goals follow: any sufficiently sophisticated model of the world implies that achieving converging instrumental goals is instrumentally valuable.
I don’t think it makes that much difference whether a human executes the plan or the AI itself. If the AI produces a plan that is not human comprehensible and the human follows it blindly, the human effectively becomes just an extension of the AI. On the other hand, if the AI produces a plan which is human comprehensible, then after reviewing the plan the human can just as well delegate its execution to the AI.
I am not sure what is the significance in this context of “one true algorithm for planning”? My guess is, there is a relatively simple qualitatively optimal AGI algorithm(fn3), and then there are various increasingly complex quantitative improvements of it, which take into account specifics of computing hardware and maybe our priors about humans and/or the environment. Which is the way algorithms for most natural problems behave, I think. But also improvements probably stop mattering beyond the point where the AGI can come with them on its own within a reasonable time frame. And, I dispute Richard’s position. But then again, I don’t understand the relevance.
(fn1) When I say “construct models” I am mostly talking about the properties of the agent rather than the structure of the algorithm. That is, the agent can effectively adapt to a large class of different environments or exploit a large class of different properties the environment can have. In this sense, model-free RL is also constructing models. Although I’m also leaning towards the position that explicitly model-based approaches are more like to scale to AGI.
(fn2) Even if you wanted to make a superhuman AI that only solves mathematical problems, I suspect that the only way it could work is by having the AI generate models of “mathematical behaviors”.
(fn3) As an analogy, a “qualitatively optimal” algorithm for a problem in P is just any polynomial time algorithm. In the case of AGI, I imagine a similar computational complexity bound plus some (also qualitative) guarantee(s) about sample complexity and/or query complexity. By “relatively simple” I mean something like, can be described within 20 pages given that we can use algorithms for other natural problems.