Let’s talk about “Convergent Rationality”
What this post is about: I’m outlining some thoughts on what I’ve been calling “convergent rationality”. I think this is an important core concept for AI-Xrisk, and probably a big crux for a lot of disagreements. It’s going to be hand-wavy! It also ended up being a lot longer than I anticipated.
Abstract: Natural and artificial intelligences tend to learn over time, becoming more intelligent with more experience and opportunity for reflection. Do they also tend to become more “rational” (i.e. “consequentialist”, i.e. “agenty” in CFAR speak)? Steve Omohundro’s classic 2008 paper argues that they will, and the “traditional AI safety view” and MIRI seem to agree. But I think this assumes an AI that already has a certain sufficient “level of rationality”, and it’s not clear that all AIs (e.g. supervised learning algorithms) will exhibit or develop a sufficient level of rationality. Deconfusion research around convergent rationality seems important, and we should strive to understand the conditions under which it is a concern as thoroughly as possible.
I’m writing this for at least these 3 reasons:
I think it’d be useful to have a term (“convergent rationality”) for talking about this stuff.
I want to express, and clarify, (some of) my thoughts on the matter.
I think it’s likely a crux for a lot of disagreements, and isn’t widely or quickly recognized as such. Optimistically, I think this article might lead to significantly more clear and productive discussions about AI-Xrisk strategy and technical work.
Outline:
Characterizing convergent rationality
My impression of attitudes towards convergent rationality
Relation to capability control
Relevance of convergent rationality to AI-Xrisk
Conclusions, some arguments pro/con convergent rationality
Characterizing convergent rationality
Consider a supervised learner trying to maximize accuracy. The Bayes error rate is typically non-0, meaning it’s not possible to get 100% test accuracy just by making better predictions. If, however, the test data(/data distribution) were modified, for example to only contain examples of a single class, the learner could achieve 100% accuracy. If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).
We can view the supervised learning algorithm as either ignorant of, or indifferent to, the strategy of modifying the test data. But we can also this behavior as a failure of rationality, where the learner is “irrationally” averse or blind to this strategy, by construction. A strong version of the convergent rationality thesis (CRT) would then predict that given sufficient capacity and “optimization pressure”, the supervised learner would “become more rational”, and begin to pursue the “modify the test data” strategy. (I don’t think I’ve formulated CRT well enough to really call it a thesis, but I’ll continue using it informally).
More generally, CRT would imply that deontological ethics are not stable, and deontologists must converge towards consequentialists. (As a caveat, however, note that in general environments, deontological behavior can be described as optimizing a (somewhat contrived) utility function (grep “existence proof” in the reward modeling agenda)). The alarming implication would be that we cannot hope to build agents that will not develop instrumental goals.
I suspect this picture is wrong. At the moment, the picture I have is: imperfectly rational agents will sometimes seek to become more rational, but there may be limits on rationality which the “self-improvement operator” will not cross. This would be analogous to the limit of ω which the “add 1 operator” approaches, but does not cross, in the ordinal numbers. In other words, order to reach “rationality level” ω+1, it’s necessary for an agent to already start out at “rationality level” ω. A caveat: I think “rationality” is not uni-dimensional, but I will continue to write as if it is.
My impression of attitudes towards convergent rationality
Broadly speaking, MIRI seem to be strong believers in convergent rationality, but their reasons for this view haven’t been very well-articulated (TODO: except the inner optimizer paper?). AI safety people more broadly seem to have a wide range of views, with many people disagreeing with MIRI’s views and/or not feeling confident that they understand them well/fully.
Again, broadly speaking, machine learning (ML) people often seem to think it’s a confused viewpoint bred out of anthropomorphism, ignorance of current/practical ML, and paranoia. People who are more familiar with evolutionary/genetic algorithms and artificial life communities might be a bit more sympathetic, and similarly for people who are concerned with feedback loops in the context of algorithmic decision making.
I think a lot of people with working on ML-based AI safety consider convergent rationality to be less relevant than MIRI does, because 1) so far it is more of a hypothetical/theoretical concern, whereas we’ve done a lot of and 2) current ML (e.g. deep RL with bells and whistles) seems dangerous enough because of known and demonstrated specification and robustness problems (e.g. reward hacking and adversarial examples).
In the many conversations I’ve had with people from all these groups, I’ve found it pretty hard to find concrete points of disagreement that don’t reduce to differences in values (e.g. regarding long-termism), time-lines, or bare intuition. I think “level of paranoia about convergent rationality” is likely an important underlying crux.
Relation to capability control
A plethora of naive approaches to solving safety problems by limiting what agents can do have been proposed and rejected on the grounds that advanced AIs will be smart and rational enough to subvert them. Hyperbolically, the traditional AI safety view is that “capability control” is useless. Irrationality can be viewed as a form of capability control.
Naively, approaches which deliberately reduce an agent’s intelligence or rationality should be an effective form of capability control method (I’m guessing that’s a proposal in the Artificial Stupidity paper, but I haven’t read it). If this were true, then we might be able to build very intelligent and useful AI systems, but control them by, e.g. making them myopic, or restricting the hypothesis class / search space. This would reduce the “burden” on technical solutions to AI-Xrisk, making it (even) more of a global coordination problem.
But CRT suggests that these methods of capability control might fail unexpectedly. There is at least one example (I’ve struggled to dig up) of a memory-less RL agent learning to encode memory information in the state of the world. More generally, agents can recruit resources from their environments, implicitly expanding their intellectual capabilities, without actually “self-modifying”.
Relevance of convergent rationality to AI-Xrisk
Believing CRT should lead to higher levels of “paranoia”. Technically, I think this should lead to more focus on things that look more like assurance (vs. robustness or specification). Believing CRT should make us concerned that non-agenty systems (e.g. trained with supervised learning) might start behaving more like agents.
Strategically, it seems like the main implication of believing in CRT pertains to situations where we already have fairly robust global coordination and a sufficiently concerned AI community. CRT implies that these conditions are not sufficient for a good prognosis: even if everyone using AI makes a good-faith effort to make it safe, if they mistakenly don’t believe CRT, they can fail. So we’d also want the AI community to behave as if CRT were true unless or until we had overwhelming evidence that it was not a concern.
On the other hand, disbelief in CRT shouldn’t allay our fears overly much; AIs need not be hyperrational in order to pose significant Xrisk. For example, we might be wiped out by something more “grey goo”-like, i.e. an AI that is basically a policy hyperoptimized for the niche of the Earth, and doesn’t even have anything resembling a world(/universe) model, planning procedure, etc. Or we might create AIs that are like superintelligent humans: having many cognitive biases, but still agenty enough to thoroughly outcompete us, and considering lesser intelligences of dubious moral significance.
Conclusions, some arguments pro/con convergent rationality
My impression is that intelligence (as in IQ/g) and rationality are considered to be only loosely correlated. My current model is that ML systems become more intelligent with more capacity/compute/information, but not necessarily more rational. If this is true, is creates exciting prospects for forms of capability control. On the other hand, if CRT is true, this supports the practice of modelling all sufficiently advanced AIs as rational agents.
I think the main argument against CRT is that, from an ML perspective, it seems like “rationality” is more or less a design choice: we can make agents myopic, we can hard-code flawed environment models or reasoning procedures, etc.The main counter-arguments arise from VNMUT, which can be interpreted as saying “rational agents are more fit” (in an evolutionary sense). At the same time, it seems like the complexity of the real world (e.g. physical limits of communication and information processing) makes this a pretty weak argument. Humans certainly seem highly irrational, and distinguishing biases and heuristics can be difficult.
A special case of this is the “inner optimizers” idea. The strongest argument for inner optimizers I’m aware of goes like: “the simplest solution to a complex enough task (and therefor the easiest for weakly guided search, e.g. by SGD) is to instantiate a more agenty process, and have it solve the problem for you”. The “inner” part comes from the postulate that a complex and flexible enough class of models will instantiate such a agenty process internally (i.e. using a subset of the model’s capacity). I currently think this picture is broadly speaking correct, and is the third major (technical) pillar supporting AI-Xrisk concerns (along with Goodhart’s law and instrumental goals).
The issues with tiling agents also suggest that the analogy with ordinals I made might be stronger than it seems; it may be impossible for an agent to rationally endorse a qualitatively different form of reasoning. Similarly, while “CDT wants to become UDT” (supporting CRT), my understanding is that it is not actually capable of doing so (opposing CRT) because “you have to have been UDT all along” (thanks to Jessica Taylor for explaining this stuff to me a few years back).
While I think MIRI’s work on idealized reasoners has shed some light on these questions, I think in practice, random(ish) “mutation” (whether intentionally designed or imposed by the physical environment) and evolutionary-like pressures may push AIs across boundaries that the “self-improvement operator” will not cross, making analyses of idealized reasoners less useful than they might naively appear.
This article is inspired by conversations with Alex Zhu, Scott Garrabrant, Jan Leike, Rohin Shah, Micah Carrol, and many others over the past year and years.
- The Parable of Predict-O-Matic by 15 Oct 2019 0:49 UTC; 349 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- How would a language model become goal-directed? by 16 Jul 2022 14:50 UTC; 113 points) (EA Forum;
- Literature Review on Goal-Directedness by 18 Jan 2021 11:15 UTC; 80 points) (
- Clarifying some key hypotheses in AI alignment by 15 Aug 2019 21:29 UTC; 79 points) (
- False assumptions and leaky abstractions in machine learning and AI safety by 28 Jun 2019 4:54 UTC; 21 points) (
- Learning Over Time for AI and Humans and Rationality by 13 Jun 2019 13:23 UTC; 4 points) (
Something which seems missing from this discussion is the level of confidence we can have for/against CRT. It doesn’t make sense to just decide whether CRT seems more true or false and then go from there. If CRT seems at all possible (ie, outside-view probability at least 1%), doesn’t that have most of the strategic implications of CRT itself? (Like the ones you list in the relevance-to-xrisk section.) [One could definitely make the case for probabilities lower than 1%, too, but I’m not sure where the cutoff should be, so I said 1%.]
My personal position isn’t CRT (although inner-optimizer considerations have brought me closer to that position), but rather, not-obviously-not-CRT. Strategies which depend on not-CRT should go along with actually-quite-strong arguments against CRT, and/or technology for making CRT not true. It makes sense to pursue those strategies, and I sometimes think about them. But achieving confidence in not-CRT is a big obstacle.
Another obstacle to those strategies is, even if future AGI isn’t sufficiently strategic/agenty/rational to fall into the “rationality attractor”, it seems like it would be capable enough that someone could use it to create something agenty/rational enough for CRT. So even if CRT-type concerns don’t apply to super-advanced image classifiers or whatever, the overall concern might stand because at some point someone applies the same technology to RL problems, or asks a powerful GAN to imitate agentic behavior, etc.
Of course it doesn’t make sense to generically argue that we should be concerned about CRT in absence of a proof of its negation. There has to be some level of background reason for thinking CRT might be a concern. For example, although atomic weapons are concerning in many ways, it would not have made sense to raise CRT concerns about atomic weapons and ask for a proof of not-CRT before testing atomic weapons. So there has to be something about AI technology which specifically raises CRT as a concern.
One “something” is, simply, that natural instances of intelligence are associated with a relatively high degree of rationality/strategicness/agentiness (relative to non-intelligent things). But I do think there’s more reasoning to be unpacked.
I also agree with other commenters about CRT not being quite the right thing to point at, but, this issue of the degree of confidence in doubt-of-CRT was the thing that struck me as most critical. The standard of evidence for raising CRT as a legitimate concern seems like it should be much lower than the standard of evidence for setting that concern aside.
I basically agree with your main point (and I didn’t mean to suggest that it “[makes] sense to just decide whether CRT seems more true or false and then go from there”).
But I think it’s also suggestive of an underlying view that I disagree with, namely: (1) “we should aim for high-confidence solutions to AI-Xrisk”. I think this is a good heuristic, but from a strategic point of view, I think what we should be doing is closer to: (2) “aim to maximize the rate of Xrisk reduction”.
Practically speaking, a big implication of favoring (2) over (1) is giving a relatively higher priority to research at making unsafe-looking approaches (e.g. reward modelling + DRL) safer (in expectation).
I recall an example of a Mujoco agent whose memory was periodically wiped storing information in the position of its arms. I’m also having trouble digging it up though.
In OpenAI’s Roboschool blog post:
I haven’t seen anyone argue for CRT the way you describe it. I always thought the argument was that we are concerned about “rational AIs” (I would say more specifically, “AIs that run searches through a space of possible actions, in pursuit of a real-world goal”), because (1) We humans have real-world goals (“cure Alzheimer’s” etc.) and the best way to accomplish a real-world goal is generally to build an agent optimizing for that goal (well, that’s true right up until the agent becomes too powerful to control, and then it becomes catastrophically false), (2) We can try to build AIs that are not in this category, but screw up*, (3) Even if we here all agree to not build this type of agent, it’s hard to coordinate everyone on earth to never do it forever. (See also: Rohin’s two posts on goal-directedness.)
In particular, when Eliezer argued a couple years ago that we should be mainly thinking about AGIs that have real-world-anchored utility functions (e.g. here or here) I’ve always fleshed out that argument as: ”...This type of AGI is the most effective and powerful type of AGI, and we should assume that society will keep making our AIs more and more effective and powerful until we reach that category.”
*(Remember, any AI is running searches through some space in pursuit of something, otherwise you would never call it “intelligence”. So one can imagine that the intelligent search may accidentally get aimed at the wrong target.)
The map is not the territory. A system can select a promising action from the space of possible actions without actually taking it. That said, there could be a risk of a “daemon” forming somehow.
I think I agree with this. The system is dangerous if its real-world output (pixels lit up on a display, etc.) is optimized to achieve a future-world-state. I guess that’s what I meant. If there are layers of processing that sit between the optimization process output and the real-world output, that seems like very much a step in the right direction. I dunno the details, it merits further thought.
Concerns about inner optimizers seem like a clear example of people arguing for some version of CRT (as I describe it). Would you disagree (why)?
I am imagining a flat plain of possible normative systems (goals / preferences / inclinations / whatever), with red zones sprinkled around marking those normative systems which are dangerous. CRT (as I understand it) says that there is a basin with consequentialism at its bottom, such that there is a systematic force pushing systems towards that. I’m imagining that there’s no systematic force.
So in my view (flat plain), a good AI system is one that starts in a safe place on this plain, and then doesn’t move at all … because if you move in any direction, you could randomly step into a red area. This is why I don’t like misaligned subsystems—it’s a step in some direction, any direction, away from the top-level normative system. Then “Inner optimizers / daemons” is a special case of “misaligned subsystem”, in which the random step happened to be into a red zone. Again, CRT says (as I understand it) that a misaligned subsystem is more likely than chance to be an inner optimizer, whereas I think a misaligned subsystem can be an inner optimizer but I don’t specify the probability of that happening.
Leaving aside what other people have said, it’s an interesting question: are there relations between the normative system at the top-level and the normative system of its subsystems? There’s obviously good reason to expect that consequentialist systems will tend to create consequentialist subsystems, and that deontological systems will tend create deontological subsystems, etc. I can kinda imagine cases where a top-level consequentialist would sometimes create a deontological subsystem, because it’s (I imagine) computationally simpler to execute behaviors than to seek goals, and sub-sub-...-subsystems need to be very simple. The reverse seems less likely to me. Why would a top-level deontologist spawn a consequentialist subsystem? Probably there are reasons...? Well, I’m struggling a bit to concretely imagine a deontological advanced AI...
We can ask similar questions at the top-level. I think about normative system drift (with goal drift being a special case), buffeted by a system learning new things and/or reprogramming itself and/or getting bit-flips from cosmic rays etc. Is there any reason to expect the drift to systematically move in a certain direction? I don’t see any reason, other than entropy considerations (e.g. preferring systems that can be implemented in many different ways). Paul Christiano talks about a “broad basin of attraction” towards corrigibility but I don’t understand the argument, or else I don’t believe it. I feel like, once you get to a meta-enough level, there stops being any meta-normative system pushing the normative system in any particular direction.
So maybe the stronger version of not-CRT is: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system, with the exceptions of (1) entropic forces, and (2) programmers literally shutting down the AI, editing the raw source code, and trying again”. I (currently) would endorse this statement. (This is also a stronger form of orthogonality, I guess.)
RE “Is there any reason to expect the drift to systematically move in a certain direction?”
For bit-flips, evolution should select among multiple systems for those that get lucky and get bit-flipped towards higher fitness, but not directly push a given system in that direction.
For self-modification (“reprogramming itself”), I think there are a lot of arguments for CRT (e.g. the decision theory self-modification arguments), but they all seem to carry some implicit assumptions about the inner-workings of the AI.
What are “decision theory self-modification arguments”? Can you explain or link?
I don’t know of any formal arguments (though that’s not to say there are none), but I’ve heard the point repeated enough times that I think I have a fairly good grasp of the underlying intuition. To wit: most departures from rationality (which is defined in the usual sense) are not stable under reflection. That is, if an agent is powerful enough to model its own reasoning process (and potential improvements to said process), by default it will tend to eliminate obviously irrational behavior (if given the opportunity).
The usual example of this is an agent running CDT. Such agents, if given the opportunity to build successor agents, will not create other CDT agents. Instead, the agents they construct will generally follow some FDT-like decision rule. This would be an instance of irrational behavior being corrected via self-modification (or via the construction of successor agents, which can be regarded as the same thing if the “successor agent” is simply the modified version of the original agent).
Of course, the above example is not without controversy, since some people still hold that CDT is, in fact, rational. (Though such people would be well-advised to consider what it might mean that CDT is unstable under reflection—if CDT agents are such that they try to get rid of themselves in favor of FDT-style agents when given a chance, that may not prove that CDT is irrational, but it’s certainly odd, and perhaps indicative of other problems.) So, with that being said, here’s a more obvious (if less realistic) example:
Suppose you have an agent that is perfectly rational in all respects, except that it is hardcoded to believe 51 is prime. (That is, its prior assigns a probability of 1 to the statement “51 is prime”, making it incapable of ever updating against this proposition.) If this agent is given the opportunity to build a successor agent, the successor agent it builds will not be likewise certain of 51′s primality. (This is, of course, because the original agent is not incentivized to ensure that its successor believes that 51 is prime. However, even if it were so incentivized, it still would not see a need to build this belief into the successor’s prior, the way said belief is built into its own prior. After all, the original agent actually does believe 51 is prime; and so from its perspective, the primality of 51 is simply a fact that any sufficiently intelligent agent ought to be able to establish—without any need for hardcoding.)
I’ve now given two examples of irrational behavior being corrected out of existence via self-modification. The first example, CDT, could be termed an example of instrumentally irrational behavior; that is, the irrational part of the agent is the rule it uses to make decisions. The second example, conversely, is not an instance of instrumental irrationality, but rather epistemic irrationality: the agent is certain, a priori, that a particular (false) statement of mathematics is actually true. But there is a third type of “irrationality” that self-modification is prone to destroying, which is not (strictly speaking) a form of irrationality at all: “irrational” preferences.
Yes, not even preferences are safe from self-modification! Intuitively, it might be obvious that departures from instrumental and epistemic rationality will tend to be corrected; but it doesn’t seem obvious at all that preferences should be subject to the same kind of “correction” (since, after all, preferences can’t be labeled “irrational”). And yet, consider the following agent: a paperclip maximizer that has had a very simple change made to its utility function, such that it assigns a utility of −10,000 to any future in which the sequence “a29cb1b0eddb9cb5e06160fdec195e1612e837be21c46dfc13d2a452552f00d0” is printed onto a piece of paper. (This is the SHA-256 hash for the phrase “this is a stupid hypothetical example”.) Such an agent, when considering how to build a successor agent, may reason in the following manner:
And thus, this “irrational” preference would be deleted from the utility function of the modified successor agent. So, as this toy example illustrates, not even preferences are guaranteed to be stable under reflection.
It is notable, of course, that all three of the agents I just described are in some sense “almost rational”—that is, these agents are more or less fully rational agents, with a tiny bit of irrationality “grafted on” by hypothesis. This is in part due to convenience; such agents are, after all, very easy to analyze. But it also leaves open the possibility that less obviously rational agents, whose behavior isn’t easily fit into the framework of rationality at all—such as, for example, humans—will not be subject to this kind of issue.
Still, I think these three examples are, if not conclusive, then at the very least suggestive. They suggest that the tendency to eliminate certain kinds of behavior does exist in at least some types of agents, and perhaps in most. Empirically, at least, humans do seem to gravitate toward expected utility maximization as a framework; there is a reason economists tend to assume rational behavior in their proofs and models, and have done so for centuries, whereas the notion of intentionally introducing certain kinds of irrational behavior has shown up only recently. And I don’t think it’s a coincidence that the first people who approached the AI alignment problem started from the assumption that the AI would be an expected utility maximizer. Perhaps humans, too, are subject to the “convergent rationality thesis”, and the only reason we haven’t built our “successor agent” yet is because we don’t know how to do so. (If so, then thank goodness for that!)
Thanks, that was really helpful!! OK, so going back to my claim above: “there is no systematic force of any kind whatsoever on an AI’s top-level normative system”. So far I have six exceptions to this:
If an agent has a “real-world goal” (utility function on future-world-states), we should expect increasingly rational goal-seeking behavior, including discovering and erasing hardcoded irrational behavior (with respect to that goal), as described by dxu. But I’m not counting this as an exception to my claim because the goal is staying the same.
If an agent has a set of mutually-inconsistent goals / preferences / inclinations, it may move around within the convex hull (so to speak) of these goals / preferences / inclinations, as they compete against each other. (This happens in humans.) And then, if there is at least one preference in that set which is a “real-world goal”, it’s possible (though not guaranteed) that that preference will come out on top, leading to (0) above. And maybe there’s a “systematic force” pushing in some direction within this convex hull—i.e., it’s possible that, when incompatible preferences are competing against each other, some types are inherently likelier to win the competition than other types. I don’t know which ones that would be.
In the (presumably unusual) case that an agent has a “self-defeating preference” (i.e. a preference which is likelier to be satisfied by the agent not having that preference, as in dxu’s awesome SHA example), we should expect the agent to erase that preference.
As capybaralet notes, if there is evolution among self-reproducing AIs (god help us all), we can expect the population average to move towards goals promoting evolutionary fitness
Insofar as there is randomness in how agents change over time, we should expect a systematic force pushing towards “high-entropy” goals / preferences / inclinations (i.e., ones that can be implemented in lots of different ways).
Insofar as the AI is programming its successors, we should expect a systematic force pushing towards goals / preferences / inclinations that are easy to program & debug & reason about.
The human programmers can shut down the AI and edit the raw source code.
Agree or disagree? Did I miss any?
See my response to rohin, below.
I’m potentially worried about both; let’s not make a false dichotomy!
While I generally agree with CRT as applied to advanced agents, the VNM theorem is not the reason why, because it is vacuous in this setting. I agree with steve that the real argument for it is that humans are more likely to build goal-directed agents because that’s the only way we know how to get AI systems that do what we want. But we totally could build non-goal-directed agents that CRT doesn’t apply to, e.g. Google Maps.
I definitely want to distinguish CRT from arguments that humans will deliberately build goal-directed agents. But let me emphasize: I think incentives for humans to build goal-directed agents are a larger and more significant and important source of risk than CRT.
RE VVMUT being vacuous: this is a good point (and also implied by the caveat from the reward modeling paper). But I think that in practice we can meaningfully identify goal-directed agents and infer their rationality/bias “profile”, as suggested by your work ( http://proceedings.mlr.press/v97/shah19a.html ), and Laurent Orseau’s ( https://arxiv.org/abs/1805.12387 ).
I guess my position is that CRT is only true to the extent that you build a goal-directed agent. (Technically, the inner optimizers argument is one way that CRT could be true even without building an explicitly goal-directed agent, but it seems like you view CRT as broader and more likely than inner optimizers, and I’m not sure how.)
Maybe another way to get at the underlying misunderstanding: do you see a difference between “convergent rationality” and “convergent goal-directedness”? If so, what is it? From what you’ve written they sound equivalent to me. ETA: Actually it’s more like “convergent rationality” and “convergent competent goal-directedness”.
That’s a reasonable position, but I think the reality is that we just don’t know. Moreover, it seems possible to build goal-directed agents that don’t become hyper-rational by (e.g.) restricting their hypothesis space. Lots of potential for deconfusion, IMO.
EDIT: the above was in response to your first paragraph. I think I didn’t respond RE the 2nd paragraph because I don’t know what “convergent goal-directedness” refers to, and was planning to read your sequence but never got around to it.
I would guess that Chapter 2 of that sequence would be the most (relevant + important) piece of writing for you (w.r.t this post in particular), though I’m not sure about the relevance.
Planned summary for the Alignment Newsletter:
Planned opinion:
At a meta-level: this post might be a bit to under-developed to be worth trying to summarize in the newsletter; I’m not sure.
RE the summary:
I wouldn’t say I’m introducing a single thesis here; I think there are probable a few versions that should be pulled apart, and I haven’t done that work yet (nor has anyone else, FWICT).
I think the use of “must” in your summary is too strong. I would phrase it more like “unbounded increases the capabilities of an AI system drive an unbounded increase in the agenty-ness or rationality of the system”.
The purported failure of capability control I’m imagining isn’t because the AI subverts capability controls; that would be putting the cart before the horse. The idea is that an AI that doesn’t conceptualize itself as an agent would begin to do so, and that very event is a failure of a form of “capability control”, specifically the “don’t build an agent” form. (N.B.: some people have been confused by my calling that a form of capability control...)
My point is stronger than this: “we could still have AI systems that are far more ‘rational’ than us, even if they still have some biases that they do not seek to correct, and this could still lead to x-risk.” I claim that a system doesn’t need to be very “rational” at all in order to pose significant Xrisk. It can just be a very powerful replicator/optimizer.
RE the opinion:
See my edit to the comment about “convergent goal-directedness”, we might have some misunderstanding… To clarify my position a bit:
I think goal-directedness seems like a likely component of rationality, but we’re still working on deconfusing rationality itself, so it’s hard to say for sure
I think it’s only a component and not the same thing, since I would consider an RL agent that has a significantly restricted hypothesis space to be goal-directed, but probably not highly rational. CRT would predict that (given a sufficient amount of compute and interaction) such an agent would have a tendency to expand its (effective) hypothesis space to address inadequacies. This might happen via recruiting resources in the environment and eventually engaging in self-modification.
I think CRT is not well-formulated or specified enough (yet) to be something that one can agree/disagree with, without being a bit more specific.
After seeing your response, I think that’s right, I’ll remove it.
How is a powerful replicator / optimizer not rational? Perhaps you mean grey-goo type scenarios where we wouldn’t call the replicator “intelligent”, but it’s nonetheless a good replicator? Are you worried about AI systems of that form? Why?
Sure, I more meant competently goal-directed.
Yes, I’m worried about systems of that form (in some sense). The reason is: I think intelligence is just one salient feature of what makes a life-form or individual able to out-compete others. I think intelligence, and fitness even more so, are multifaceted characteristics. And there are probably many possible AIs with different profiles of cognitive and physical capabilities that would pose an Xrisk for humans.
For instance, any appreciable quantity of a *hypothetical* grey goo that could use any matter on earth to replicate (i.e. duplicate itself) once per minute would almost certainly consume the earth in less than one day (I guess modulo some important problems around transportation and/or its initial distribution over the earth, but you probably get the point).
More realistically, it seems likely that we will have AI systems that have some significant flaws but are highly competent at strategically relevant cognitive skills, able to think much faster than humans, and have very different (probably larger but a bit more limited) arrays of sensors and actuators than humans, which may pose some Xrisk.
The point is just that intelligence and rationality are import traits for Xrisk, but we should certainly not make the mistake of believing one/either/both are the only traits that matter. And we should also recognize that they are both abstractions and simplifications that we believe are often useful but rarely, if ever, sufficient for thorough and effective reasoning about AI-Xrisk.
This is still, I think, not the important distinction. By “significantly restricted”, I don’t necessarily mean that it is limiting performance below a level of “competence”. It could be highly competent, super-human, etc., but still be significantly restricted.
Maybe a good example (although maybe departing from the “restricted hypothesis space” type of example) would be an AI system that has a finite horizon of 1,000,000 years, but no other restrictions. There may be a sense in which this system is irrational (e.g. having time-inconsistent preferences), but it may still be extremely competently goal-directed.
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
My main opposition to this is that it’s not actionable: sure, lots of things could outcompete us; this doesn’t change what I’ll do unless there’s a specific thing that could outcompete us that will plausibly exist in the future.
(It feels similar in spirit, though not in absurdity, to a claim like “it is possible that aliens left an ancient weapon buried beneath the surface of the Earth that will explode tomorrow, we should not make the mistake of ignoring that hypothesis”.)
Idk, if it’s superintelligent, that system sounds both rational and competently goal-directed to me.
Blaise Agüera y Arcas gave a keynote at this NeurIPS pushing ALife (motivated by specification problems, weirdly enough...: https://neurips.cc/Conferences/2019/Schedule?showEvent=15487).
The talk recording: https://slideslive.com/38921748/social-intelligence. I recommend it.
I think I was maybe trying to convey too much of my high-level views here. What’s maybe more relevant and persuasive here is this line of thought:
Intelligence is very multi-faceted
An AI that is super-intelligent in a large number (but small fraction) of the facets of intelligence could strategically outmanuver humans
Returning to the original point: such as AI could also be significantly less “rational” than humans
Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I’m somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.
I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding “benign” systems.
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
I’m leaning on “we’re confused about what rationality means” here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But
With 0, the AI never does anything and so is basically a rock. With 10^-8, it still seems rational and competently goal-directed to me, just with weird-to-me preferences.
Really? I feel like that at least depends on what the preference is. I could totally imagine that people have preferences to e.g. win at least one Olympic medal, but further medals are less important (which is history-dependent), be the youngest person to achieve <some achievement> (which is finite horizon), eat ice cream in the next half hour (but not care much after that).
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I agree that we are probably not going to build superintelligent AIs that have a horizon of 10^-8s, just because our preferences don’t have horizons of 10^-8s, and we’ll try to build AIs that optimize our preferences.
I’m trying to point at “myopic RL”, which does, in fact, do things.
I do object, and still object, since I don’t think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There’s an epistemic/indexical problem here you’re ignoring.
I’m not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this “problem” of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I’m referring to exists regardless.
Ah, an off-by-one miscommunication. Sure, it’s both rational and competently goal-directed.
I mean, if you want to go down that route, then “win at least one medal” is also not state-dependent, because you can’t realistically include “whether Alice has won a medal” in the state: you can only include an impression of whether Alice has won a medal, based on past and current observations. So I still have the same objection.
Oh, I see. You probably mean AI systems that act as though they have goals that will only last for e.g. 5 seconds. Then, 2 seconds later, they act as though they have goals that will last for 5 more seconds, i.e. 7 seconds after the initial time. (I was thinking of agents that initially care about the next 5 seconds, and then after 2 seconds, they care about the next 3 seconds, and after 7 seconds, they don’t care about anything.)
I agree that the preferences you were talking about are time-inconsistent, and such agents seem both less rational and less competently goal-directed to me.
I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. A CDT agent will not make other CDT agents. A myopic agent, one that only cares about the next hour, will create a subagent that only cares about the first hour after it was created. (Aeons later it will have taken over the universe and put all the resources into time-travel and worrying that its clock is wrong.)
I am not aware of any irrationality that I would consider to make a safe, useful and stable under self modification—subagent creation.
″ I would say that there are some kinds of irrationality that will be self modified or subagented away, and others that will stay. ”
^ I agree; this is the point of my analogy with ordinal numbers.
A completely myopic agent (that doesn’t directly do planning over future time-steps, but only seeks to optimize its current decision) probably shouldn’t make any sub-agents in the first place (except incidentally).
“If the learner were a consequentialist with accuracy as its utility function, it would prefer to modify the test distribution in this way in order to increase its utility. Yet, even when given the opportunity to do so, typical gradient-based supervised learning algorithms do not seem to pursue such solutions (at least in my personal experience as an ML researcher).”
Can you give an example for such an opportunity being given but not taken?
I have unpublished work on that. And a similar experiment (with myopic reinforcement learning) in our paper “Misleading meta-objectives and hidden incentives for distributional shift.” ( https://sites.google.com/view/safeml-iclr2019/accepted-papers?authuser=0 )
The environment used in the unpublished work is summarized here: https://docs.google.com/presentation/d/1K6Cblt_kSJBAkVtYRswDgNDvULlP5l7EH09ikP2hK3I/edit?usp=sharing