This post introduces the _convergent rationality thesis_ (CRT): AI systems must become more “rational” or “agenty” as they become more capable. The strongest argument for CRT is <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@): for complex enough tasks, the simplest solution is to instantiate a more agenty policy, and have that policy solve the problem for you. David suspects that CRT is a common crux underlying many disagreements within AI safety, where MIRI believes the thesis (or at least giving it non-trivial weight) while ML researchers tend not to believe it. If CRT is true, then we should be more worried about AI x-risk, and “capability control” methods (that try to limit the AI’s influence) are likely to fail, since it will figure out how to evade the controls. However, even if CRT is false, we should still be worried; we could still have AI systems that are far more “rational” than us, even if they still have some biases that they do not seek to correct, and this could still lead to x-risk.
Planned opinion:
As far as I can tell, the word “rational” here is the same thing as <@”goal-directedness”@>(@Intuitions about goal-directed behavior@), so I think of this as the “convergent goal-directedness thesis”, though David disagrees. I broadly agree with the convergent goal-directedness thesis; the post <@Coherence arguments do not imply goal-directed behavior@> was simply pointing out that this thesis is not implied by the VNM theorem; I think the thesis is still probably true. Since humans want to achieve goals, and are going to use AI systems as tools to help them, <@the combined human-AI system will need to be goal-directed@>(@Will humans build goal-directed agents?@). However, we don’t currently know how to make a superintelligent AI system such that the AI isn’t goal-directed while the human+AI system is goal-directed, and we do have some idea of how to create goal-directed AI systems, so it seems likely that we will build goal-directed AI systems and have them achieve (what we think are) our goals. The inner optimizers argument is a more specific version of this argument: the whole reason inner optimizers arise is because _for the tasks we humans set_, the simplest solution is (hypothesized to be) a goal-directed agent.
However, in a comment David suggests that CRT is meant to be distinct from arguments that humans will deliberately build goal-directed arguments. I don’t really know what the content of CRT is then (what do “capable” and “rational” mean, if you want to separate them from concepts that are relevant to human goals), but I suspect I disagree with it and think it is false. I’d be interested to hear arguments from someone who thinks CRT is true.
At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I’m not sure.
RE the summary:
I wouldn’t say I’m introducing a single thesis here; I think there are probable a few versions that should be pulled apart, and I haven’t done that work yet (nor has anyone else, FWICT).
I think the use of “must” in your summary is too strong. I would phrase it more like “unbounded increases the capabilities of an AI system drive an unbounded increase in the agenty-ness or rationality of the system”.
The purported failure of capability control I’m imagining isn’t because the AI subverts capability controls; that would be putting the cart before the horse. The idea is that an AI that doesn’t conceptualize itself as an agent would begin to do so, and that very eventis a failure of a form of “capability control”, specifically the “don’t build an agent” form. (N.B.: some people have been confused by my calling that a form of capability control...)
My point is stronger than this: “we could still have AI systems that are far more ‘rational’ than us, even if they still have some biases that they do not seek to correct, and this could still lead to x-risk.” I claim that a system doesn’t need to be very “rational” at all in order to pose significant Xrisk. It can just be a very powerful replicator/optimizer.
RE the opinion:
See my edit to the comment about “convergent goal-directedness”, we might have some misunderstanding… To clarify my position a bit:
I think goal-directedness seems like a likely component of rationality, but we’re still working on deconfusing rationality itself, so it’s hard to say for sure
I think it’s only a component and not the same thing, since I would consider an RL agent that has a significantly restricted hypothesis space to be goal-directed, but probably not highly rational. CRT would predict that (given a sufficient amount of compute and interaction) such an agent would have a tendency to expand its (effective) hypothesis space to address inadequacies. This might happen via recruiting resources in the environment and eventually engaging in self-modification.
I think CRT is not well-formulated or specified enough (yet) to be something that one can agree/disagree with, without being a bit more specific.
At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I’m not sure.
After seeing your response, I think that’s right, I’ll remove it.
I claim that a system doesn’t need to be very “rational” at all in order to pose significant Xrisk. It can just be a very powerful replicator/optimizer.
How is a powerful replicator / optimizer not rational? Perhaps you mean grey-goo type scenarios where we wouldn’t call the replicator “intelligent”, but it’s nonetheless a good replicator? Are you worried about AI systems of that form? Why?
I think it’s only a component and not the same thing, since I would consider an RL agent that has a significantly restricted hypothesis space to be goal-directed, but probably not highly rational. CRT would predict that (given a sufficient amount of compute and interaction) such an agent would have a tendency to expand its (effective) hypothesis space to address inadequacies. This might happen via recruiting resources in the environment and eventually engaging in self-modification.
Perhaps you mean grey-goo type scenarios where we wouldn’t call the replicator “intelligent”, but it’s nonetheless a good replicator? Are you worried about AI systems of that form? Why?
Yes, I’m worried about systems of that form (in some sense). The reason is: I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others. I think intelligence, and fitness even more so, are multifaceted characteristics. And there are probably many possible AIs with different profiles of cognitive and physical capabilities that would pose an Xrisk for humans.
For instance, any appreciable quantity of a *hypothetical* grey goo that could use any matter on earth to replicate (i.e. duplicate itself) once per minute would almost certainly consume the earth in less than one day (I guess modulo some important problems around transportation and/or its initial distribution over the earth, but you probably get the point).
More realistically, it seems likely that we will have AI systems that have some significant flaws but are highly competent at strategically relevant cognitive skills, able to think much faster than humans, and have very different (probably larger but a bit more limited) arrays of sensors and actuators than humans, which may pose some Xrisk.
The point is just that intelligence and rationality are import traits for Xrisk, but we should certainly not make the mistake of believing one/either/both are the only traits that matter. And we should also recognize that they are both abstractions and simplifications that we believe are often useful but rarely, if ever, sufficient for thorough and effective reasoning about AI-Xrisk.
Sure, I more meant competently goal-directed.
This is still, I think, not the important distinction. By “significantly restricted”, I don’t necessarily mean that it is limiting performance below a level of “competence”. It could be highly competent, super-human, etc., but still be significantly restricted.
Maybe a good example (although maybe departing from the “restricted hypothesis space” type of example) would be an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.
I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others.
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
My main opposition to this is that it’s not actionable: sure, lots of things could outcompete us; this doesn’t change what I’ll do unless there’s a specific thing that could outcompete us that will plausibly exist in the future.
(It feels similar in spirit, though not in absurdity, to a claim like “it is possible that aliens left an ancient weapon buried beneath the surface of the Earth that will explode tomorrow, we should not make the mistake of ignoring that hypothesis”.)
an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.
Idk, if it’s superintelligent, that system sounds both rational and competently goal-directed to me.
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
I think I was maybe trying to convey too much of my high-level views here. What’s maybe more relevant and persuasive here is this line of thought:
Intelligence is very multi-faceted
An AI that is super-intelligent in a large number (but small fraction) of the facets of intelligence could strategically outmanuver humans
Returning to the original point: such as AI could also be significantly less “rational” than humans
Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I’m somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.
My main opposition to this is that it’s not actionable
I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding “benign” systems.
Idk, if it’s superintelligent, that system sounds both rational and competently goal-directed to me.
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
I’m leaning on “we’re confused about what rationality means” here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
With 0, the AI never does anything and so is basically a rock. With 10^-8, it still seems rational and competently goal-directed to me, just with weird-to-me preferences.
I believe time-inconsistent preferences are something that many would say seem irrational
Really? I feel like that at least depends on what the preference is. I could totally imagine that people have preferences to e.g. win at least one Olympic medal, but further medals are less important (which is history-dependent), be the youngest person to achieve <some achievement> (which is finite horizon), eat ice cream in the next half hour (but not care much after that).
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I agree that we are probably not going to build superintelligent AIs that have a horizon of 10^-8s, just because our preferences don’t have horizons of 10^-8s, and we’ll try to build AIs that optimize our preferences.
With 0, the AI never does anything and so is basically a rock
I’m trying to point at “myopic RL”, which does, in fact, do things.
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I do object, and still object, since I don’t think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There’s an epistemic/indexical problem here you’re ignoring.
I’m not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this “problem” of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I’m referring to exists regardless.
I’m trying to point at “myopic RL”, which does, in fact, do things.
Ah, an off-by-one miscommunication. Sure, it’s both rational and competently goal-directed.
I do object, and still object, since I don’t think we can realistically include the current time in the state.
I mean, if you want to go down that route, then “win at least one medal” is also not state-dependent, because you can’t realistically include “whether Alice has won a medal” in the state: you can only include an impression of whether Alice has won a medal, based on past and current observations. So I still have the same objection.
finite-horizon variants of AIXI have this “problem” of time-inconsistent preferences
Oh, I see. You probably mean AI systems that act as though they have goals that will only last for e.g. 5 seconds. Then, 2 seconds later, they act as though they have goals that will last for 5 more seconds, i.e. 7 seconds after the initial time. (I was thinking of agents that initially care about the next 5 seconds, and then after 2 seconds, they care about the next 3 seconds, and after 7 seconds, they don’t care about anything.)
I agree that the preferences you were talking about are time-inconsistent, and such agents seem both less rational and less competently goal-directed to me.
Planned summary for the Alignment Newsletter:
Planned opinion:
At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I’m not sure.
RE the summary:
I wouldn’t say I’m introducing a single thesis here; I think there are probable a few versions that should be pulled apart, and I haven’t done that work yet (nor has anyone else, FWICT).
I think the use of “must” in your summary is too strong. I would phrase it more like “unbounded increases the capabilities of an AI system drive an unbounded increase in the agenty-ness or rationality of the system”.
The purported failure of capability control I’m imagining isn’t because the AI subverts capability controls; that would be putting the cart before the horse. The idea is that an AI that doesn’t conceptualize itself as an agent would begin to do so, and that very event is a failure of a form of “capability control”, specifically the “don’t build an agent” form. (N.B.: some people have been confused by my calling that a form of capability control...)
My point is stronger than this: “we could still have AI systems that are far more ‘rational’ than us, even if they still have some biases that they do not seek to correct, and this could still lead to x-risk.” I claim that a system doesn’t need to be very “rational” at all in order to pose significant Xrisk. It can just be a very powerful replicator/optimizer.
RE the opinion:
See my edit to the comment about “convergent goal-directedness”, we might have some misunderstanding… To clarify my position a bit:
I think goal-directedness seems like a likely component of rationality, but we’re still working on deconfusing rationality itself, so it’s hard to say for sure
I think it’s only a component and not the same thing, since I would consider an RL agent that has a significantly restricted hypothesis space to be goal-directed, but probably not highly rational. CRT would predict that (given a sufficient amount of compute and interaction) such an agent would have a tendency to expand its (effective) hypothesis space to address inadequacies. This might happen via recruiting resources in the environment and eventually engaging in self-modification.
I think CRT is not well-formulated or specified enough (yet) to be something that one can agree/disagree with, without being a bit more specific.
After seeing your response, I think that’s right, I’ll remove it.
How is a powerful replicator / optimizer not rational? Perhaps you mean grey-goo type scenarios where we wouldn’t call the replicator “intelligent”, but it’s nonetheless a good replicator? Are you worried about AI systems of that form? Why?
Sure, I more meant competently goal-directed.
Yes, I’m worried about systems of that form (in some sense). The reason is: I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others. I think intelligence, and fitness even more so, are multifaceted characteristics. And there are probably many possible AIs with different profiles of cognitive and physical capabilities that would pose an Xrisk for humans.
For instance, any appreciable quantity of a *hypothetical* grey goo that could use any matter on earth to replicate (i.e. duplicate itself) once per minute would almost certainly consume the earth in less than one day (I guess modulo some important problems around transportation and/or its initial distribution over the earth, but you probably get the point).
More realistically, it seems likely that we will have AI systems that have some significant flaws but are highly competent at strategically relevant cognitive skills, able to think much faster than humans, and have very different (probably larger but a bit more limited) arrays of sensors and actuators than humans, which may pose some Xrisk.
The point is just that intelligence and rationality are import traits for Xrisk, but we should certainly not make the mistake of believing one/either/both are the only traits that matter. And we should also recognize that they are both abstractions and simplifications that we believe are often useful but rarely, if ever, sufficient for thorough and effective reasoning about AI-Xrisk.
This is still, I think, not the important distinction. By “significantly restricted”, I don’t necessarily mean that it is limiting performance below a level of “competence”. It could be highly competent, super-human, etc., but still be significantly restricted.
Maybe a good example (although maybe departing from the “restricted hypothesis space” type of example) would be an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
My main opposition to this is that it’s not actionable: sure, lots of things could outcompete us; this doesn’t change what I’ll do unless there’s a specific thing that could outcompete us that will plausibly exist in the future.
(It feels similar in spirit, though not in absurdity, to a claim like “it is possible that aliens left an ancient weapon buried beneath the surface of the Earth that will explode tomorrow, we should not make the mistake of ignoring that hypothesis”.)
Idk, if it’s superintelligent, that system sounds both rational and competently goal-directed to me.
Blaise Agüera y Arcas gave a keynote at this NeurIPS pushing ALife (motivated by specification problems, weirdly enough...: https://neurips.cc/Conferences/2019/Schedule?showEvent=15487).
The talk recording: https://slideslive.com/38921748/social-intelligence. I recommend it.
I think I was maybe trying to convey too much of my high-level views here. What’s maybe more relevant and persuasive here is this line of thought:
Intelligence is very multi-faceted
An AI that is super-intelligent in a large number (but small fraction) of the facets of intelligence could strategically outmanuver humans
Returning to the original point: such as AI could also be significantly less “rational” than humans
Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I’m somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.
I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding “benign” systems.
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
I’m leaning on “we’re confused about what rationality means” here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But
With 0, the AI never does anything and so is basically a rock. With 10^-8, it still seems rational and competently goal-directed to me, just with weird-to-me preferences.
Really? I feel like that at least depends on what the preference is. I could totally imagine that people have preferences to e.g. win at least one Olympic medal, but further medals are less important (which is history-dependent), be the youngest person to achieve <some achievement> (which is finite horizon), eat ice cream in the next half hour (but not care much after that).
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I agree that we are probably not going to build superintelligent AIs that have a horizon of 10^-8s, just because our preferences don’t have horizons of 10^-8s, and we’ll try to build AIs that optimize our preferences.
I’m trying to point at “myopic RL”, which does, in fact, do things.
I do object, and still object, since I don’t think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There’s an epistemic/indexical problem here you’re ignoring.
I’m not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this “problem” of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I’m referring to exists regardless.
Ah, an off-by-one miscommunication. Sure, it’s both rational and competently goal-directed.
I mean, if you want to go down that route, then “win at least one medal” is also not state-dependent, because you can’t realistically include “whether Alice has won a medal” in the state: you can only include an impression of whether Alice has won a medal, based on past and current observations. So I still have the same objection.
Oh, I see. You probably mean AI systems that act as though they have goals that will only last for e.g. 5 seconds. Then, 2 seconds later, they act as though they have goals that will last for 5 more seconds, i.e. 7 seconds after the initial time. (I was thinking of agents that initially care about the next 5 seconds, and then after 2 seconds, they care about the next 3 seconds, and after 7 seconds, they don’t care about anything.)
I agree that the preferences you were talking about are time-inconsistent, and such agents seem both less rational and less competently goal-directed to me.