Confusions in My Model of AI Risk
A lot of the reason I am worried about AI comes from the development of optimizers that have goals which don’t align with what humans want. However, I am also pretty confused about the specifics here, especially core questions like “what actually do we mean by optimizers?” and “are these optimizers actually likely to develop?”. This means that much of my thinking and language when talking about AI risk is fuzzier than I would like.
This confusion about optimization seems to run deep, and I have a vague feeling that the risk paradigm of “learning an optimizer which doesn’t do what we want” is likely confused and somewhat misleading.
What actually is optimization?
In my story of AI risk I used the term ‘optimization’ a lot, and I think it’s a very slippery term. I’m not entirely sure what it means for something to ‘do optimization’, but the term does seem to be pointing at something important and real.
A definition from The Ground of Optimization says an optimizing system takes something from a wide set of states to a smaller set of states and is robust to perturbations during this process. Training a neural network with gradient descent is an optimization process under this definition because we could start with a wide range of initial network configurations, and the network is modified to be in one of the few configurations which do well on the training distribution, and even if we add a (reasonable) perturbation the weights will still converge. I think this is a good definition, but it is entirely defined in terms of behavior rather than a mechanistic process. Additionally, it doesn’t really match exactly with the picture where there is an optimizer which optimizes for an objective. This optimizer/objective framework is the main way that I’ve talked about optimizers, but also I would not be surprised if this framing turned out to be severely confused.
One possible way that a network could ‘do optimization’ would be for it to do some kind of internal search or internal iterative evaluation process to find the best option. For example, seeing which response best matches a question, or searching a game tree to find the best move. This seems like a broadly useful style of algorithm for a neural network to learn, especially when the training task is complicated. But it also seems unlikely for networks to implement this exactly; it seems much more likely that networks will implement something that looks like a mash of some internal search and some heuristics.
Additionally, it seems like the boundary between solving a task with heuristics and solving it with optimization is fuzzy. As we build up our pile of heuristics, does this suddenly snap into being an optimizer, or does it slowly become more like an optimizer as gradient descent adds and modifies the heuristics?
For optimization to be actually dangerous, the AI needs to have objectives which are actually connected to the real world. Running some search process entirely internally to generate an output seems unlikely to lead to catastrophic behavior. However, there are objectives which the AI could easily develop which are connected to the real world. This includes the AI messing with the real world to ensure it gets certain inputs, which lead to certain internal states.
Where does the consequentialism come from?
Much of the danger from optimizing AIs comes from consequentialist optimizing AIs. By consequentialist I mean that the AI takes actions based on their consequences in the world.[1] I have a reasonably strong intuition that reinforcement learning is likely to build consequentialists. I think RL probably does this because it explicitly selects for policies based on how well they do on consequentialist tasks; the AI needs to be able to take actions which will lead to good (future) consequences on the task. Consequentialist behavior will robustly do well during training, and so this behavior will be reinforced. It seems important that the tasks are extended across time, rather than being a single timestep, otherwise the system doesn’t need to develop any longer term thinking/planning.
RL seems more likely to build consequentialists than training a neural network for classification or next word prediction. However, these other systems might develop some ‘inner optimizer/consequentialist’ algorithms, because these are good ways to answer questions. For example, in GPT-N if the tasks are diverse enough, maybe the algorithm which is learned is basically an optimizer which looks at the task and searches for the best answer. I’m unsure how or if this ‘inner optimizer’ behavior could lead to the AI having objectives over the real world. It is conceivable that the first algorithm which the training process ‘bumps into’ is a consequentialist optimizer which cares about states of the world, even if it doesn’t have access to the external world during training. But it feels like we would have to be unlucky for this to happen, because there isn’t any selection pressure pushing for this AI system to develop this kind of external world objective.
Will systems consistently work as optimizers?
It seems reasonably likely that neural networks will only act as optimizers in some environments (in fact, no-free-lunch theorems might guarantee this). On some inputs/environments, I expect systems to either just break or do things which look more heuristic-y than optimization-y. This is a question about how much the capabilities of AI systems will generalize. It seems possible that there will be domains where the system’s capabilities generalize (it can perform coherent sequences of actions), but its objectives do not (it starts pursuing a different objective).
There will be some states where the system is capable and does what humans want, for example, on the training distribution. But there may be more states where the system is able to capably do things, but no longer does what humans want. There will also be states of the world where the AI both doesn’t act capably or do what humans want, but these states don’t seem as catastrophically dangerous.
Consequentialist deception could be seen as an example of the capabilities generalizing further than the aligned objective; where the system is still able to perform capably off the training distribution, but with a misaligned goal. The main difference here seems to be that the system was always ‘intending’ to do this, rather than just entering a new region of the state space and suddenly breaking.
It isn’t really important that the AI system acts as an optimizer for all possible input states, or even for the majority of the states that it actually sees. What is important is if the AI acts as an optimizer for enough of its inputs to cause catastrophe. Humans don’t always act as coherent optimizers, but to the extent that we do act as optimizers we can have large effects on the state of the world.
What does the simplicity bias tell us about optimizers?
Neural networks seem to have a bias towards learning simple functions. This is part of what lets them generalize and not just go wild when presented with new data. However, this is a claim about the functions that neural networks learn, it is not a claim about the objectives that an optimizer will use. It does seem much more natural for simpler objectives to be easier to find because in general adding arbitrary conditions makes things less likely. We could maybe think of the function that an optimizing neural network implements as being made up of the optimizer (for example, Monte Carlo Tree Search) and the objective (for example, maximize apples collected). If the optimizer and objective are (unrealistically) separable, then all else equal a simpler objective will lead to a simpler function. I wouldn’t expect for these to be cleanly separable, I expect that for a given optimizer some objectives are much simpler or easier to implement than others.
We may be able to eventually form some kind of view around what kind of ‘simplicity bias’ we expect for objectives, I would not be surprised if this was quite different from the simplicity bias we see in the functions learned by neural nets.
- ^
Systems which are not consequentialist could for example not be optimizers, or alternatively systems which optimize for taking actions but not because of the effect of the actions in the world. A jumping robot that just loves to jump could be an example of this.
I agree that the term optimization is very slippery. My two go-to examples I used here before are: Is bacterium an optimizer? It bobs in all directions inching toward higher sugar concentration, growing and dividing as it does so. If so, is a boiling water bubble an optimizer? It bobs in all directions inching toward higher altitude, growing and dividing as it does so. If not what is the internal difference?
In this response I eschew the word ‘optimization’[1] but ‘control procedure’ might be synonymous with one rendering of ‘optimization’.
Some bacteria perform[2] a basic deliberation, ‘trying out’ alternative directions and periodically evaluating a heuristic (e.g. estimated sugar density) to seek out preferred locations. Iterated, this produces a simple control procedure which locates food items and avoids harmful substances. It can do this in a wide range of contexts, but clearly not all (as Peter alluded to via No Free Lunch). Put growing and dividing aside for now (they are separate algorithms).
A boiling water bubble doesn’t do any deliberation—it’s a ‘reaction’ in my terminology. But, within the context of ‘is underwater in X temperature and Y pressure range and Z gravitational field distribution’, its movement and essential nature are preserved, so it’s ‘iterated’, and hence the relatively direct path to the surface can be thought of as a consequence of a (very very basic) control procedure. Outside of this context it’s disabled or destroyed.
I take these basic examples as belonging to a spectrum of control procedures. Much more sophisticated ones may be able to proceed more efficiently to their goals, or do so from a wider range of starting conditions.
EDIT to be clear, I think the internal difference between the bubble and the bacterium is that the bacterium evaluates e.g. sugar concentrations to form a (very minimal) estimated model of the ‘world’ around it, and these evaluations affect its ongoing behaviour. The bubble doesn’t do this.
For the same reasons I have been trying to eschew ‘agent’
HT John Wentworth for this video link
Right, so the difference between an optimizer-like control procedure and your basic reaction-based control procedure is the existence of an identifiable “world model” that is used for “deliberation” and the deliberation engine is separate from the world model, but uses it to “make decisions”? Or am I missing something?
Yes, pretty much that’s a distinction I’d draw as meaningful, except I’d call the first one a ‘deliberative (optive) control procedure’, not an ‘optimizer’, because I think ‘optimizer’ has too many vague connotations.
The ‘world model’ doesn’t have to be separate from the deliberation, or even manifested at all: consider iterated natural selection, which deliberates over mutations, without having a separate ‘model’ of anything—because the evaluation is the promotion and the action (unless you count the world itself and the replication counts of various traits as the model). But in the bacterial case, there really is some (basic) world model in the form of internal chemical states.
P.s. plants also do the basic thing I’d call deliberative control (or iterated deliberation). In the cases I described in that link, the model state is represented in analogue by the physical growth of the plant.
(And yes, in all cases these are inner misaligned in some weak fashion.)
I think a bacterium is not an optimizer. Rather, it is optimized by evolution. Animals start being optimizers by virtue of planning over internal representations of external states, which makes them mesaoptimizers of evolution.
If we follow this model, we may consider that optimization requires a map-territory distinction. in that view, DNA is the map of evolution, and the CNS is the map of the animal. If the analogy holds, I’d speculate that the weights are the map of reinforcement learning, and the context window is the map of the mesaoptimizer.
Hmm, so where does the “true” optimization start? Or, at least what is the range of living creatures which are not-quite-complex to count as optimizers? Clearly a fish would be one, right? What about a sea cucumber? A plant?
Hm, difficult. I think the minimal required trait is the ability to learn patterns that map outputs to deferred reward inputs. So an organism that simply reacts to inputs directly would not be an optimizer, even if it has a (static) nervous system. A test may be if the organism can be made to persistedly change strategy by a change in reward, even in the immediate absence of the reward signal.
I think maybe you could say that ants are not anthill optimizers? Because the optimization mechanism doesn’t operate at all on the scale of individual ants? Not sure if that holds up.
(Note: This comment is hand-wavy, but I still have medium-high confidence in its ideas.)
When I think about more advanced AIs that will be developed several years in the future, it’s more clear to me that they look like optimizers and why that’s dangerous. Economic pressures will push us toward having agent AIs rather than tool AIs, that’s why we can’t hang out in relatively safe passive language model land forever. Similarly I think more general AIs will outcompete narrow AIs, which is why CAIS and an ecosystem of narrow AIs isn’t sustainable even though it would be safer.
Agent AIs seem inherently to have optimizer-like qualities. They are both “trying to do things” instead of just responding to prompts and inputs. For an AI to successfully make the leap (or perhaps gradual progression) from narrowness to generality, it will need to model the real world and probably humans. The most competitive AI I can think of would be an advanced general agent AI. It understands the world like we do only better, and it can do all sorts of things and pick up new skills quickly. It anticipates our needs and when we talk to it, it intuits what we mean.
This advanced general agent AI is a powerful optimizer. (It’s an optimizer because it’s an agent, and powerful because it’s generally capable.) The reason this is dangerous is because we don’t know what it’s optimizing for. Whether it was truly aligned or deceptively aligned, it would be acting the same way, being really useful and helpful and impressive and understanding—up until the point when it gains enough control that if it’s deceptively aligned it will overpower humans by force. And even though this scenario sounds like a wacky sci-fi thing, it seems to be the more likely outcome because deceptive alignment is just a natural strategy to emerge from the AI having clear understanding the world and having a proxy goal rather than the actual goal we want, and there are many such proxy goals for gradient descent to stumble upon vs. only one (or a relatively much smaller number of) well aligned goals.
So this is my attempt to articulate why dangerous powerful optimizers are likely in the limit. I think your post is great because while this eventual picture seems fairly clear to me, I am much less clear on what optimization means in lower level systems, at what point it starts becoming dangerous, understanding hybrid optimizer/heuristic systems, etc. Your post is a good start for noting these kinds of ambiguities and starting to deconfuse them.