It matters when the first sharp left turn happens
Thanks to Evan Hubinger for comments on these ideas.
Introduction
A “sharp left turn” is a point where capabilities generalize beyond alignment. In a sharp left turn an AI becomes much more capable than aligned, and so starts to exploit flaws in its alignment.
This can look like Goodharting, where strong optimization pressure causes outer alignment failure because the base goal isn’t identical to what we want. Or it can look like deceptive alignment, where the model is aligned to proxy goals that aren’t identical to the base goal, but we don’t notice until the model is capable enough to make the proxies fail.
However they happen, sharp left turns are bad.
An important question, though, is: when will the first sharp left turn happen?
Timing
Consider three scenarios:
Weak: The sharp left turn happens before models are able to cause existential risks. For instance, maybe GPT-4 has a sharp left turn but isn’t able to reliably execute on its plans.
Strong: The sharp left turn happens once capabilities are very super-human and dangerous if misused. In this world AI’s are doing the bulk of the world’s scientific research by the time alignment unravels.
Human-level: The sharp left turn happens while capabilities are around human level, meaning both dangerous and not game-changingly helpful for research.
I think that the most dangerous of these is the human-level world.
In the “weak” scenario, we get to see the sharp left turn coming. We potentially get lots of experience with models looking aligned and then that alignment failing with increasing scale. In this world, we effectively get multiple shots at alignment because the first badly-misaligned models just aren’t that dangerous. We get to experiment with deceptively aligned models and get a grounded empirical understanding of why models become deceptively aligned and how we might stop that. In this world, alignment research gets a chance to become a paradigmatic scientific field before the models get too capable.
In the “strong” scenario, we get to do alignment research with super-human models that are still aligned. In this world, we don’t have to solve the technical challenge of alignment entirely on our own. Alignment research potentially gets much further along before we have to have an airtight solution and prevent sharp left turns.
Both of these seem like pretty hopeful worlds. By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools. We have to solve the problem entirely on our own, and we have to do it without any useful empirical feedback. This seems really bad.
Scales
The timing of sharp left turns depends on a few capability scales, some of which I conflated above:
The danger scale, above which the model is capable enough to pose an existential risk.
The research scale, above which the model is able to do scientific research better than humans.
The left-turn scale, above which the model is able to undergo a sharp left turn. A useful anchor here is given by deceptive alignment, which suggests the scale at which a model can successfully reason about its own training process.
The three scenarios I outlined are characterized by different orderings of these scales:
Weak: (left-turn < danger)
Strong: (research < left-turn)
Human-level: (left-turn ~ research, left-turn ~ danger)
There can be other orderings, leading to other outcomes. For instance, we could live in a world with (danger < left-turn < research). In that world, the left turn happens at a scale that’s dangerous but incapable of super-human research. Imagine a model that’s very capable of developing and executing plans using information freely available on the internet. Such a model could pose an existential risk by engineering pathogens that we already know how to make, without doing any novel R&D, and deploying them with competent but not superhuman logistics.
Capabilities are (sort of) multi-dimensional
Of course capabilities aren’t strictly one-dimensional, so talking about “the scale of capabilities” conflates a few different ideas. Concretely, I think a model with fixed research ability could be either more or less dangerous, depending on other dimensions of its capabilities like “competence interacting with people” and “robust plan design”. These skills vary significantly in humans, and it seems plausible that we can engineer models to be more or less capable along these different dimensions.
That’s not to say that there’s complete freedom here. There’s definitely correlation between capabilities (e.g. lots of capabilities improve with scale). And in the limit of extreme research ability a model should be able to learn whatever other capabilities it needs. Intelligence is closely related to general purpose optimization ability, and to the extent that this holds we should expect different capabilities to scale together.
But we might live in worlds where it only takes models twice as capable as humans to solve the technical challenges of alignment, and those models may not be reflectively stable and may have lots of flawed abilities. In those worlds there’s a lot of benefit to thinking about which capabilities we want to weaken in models, and to trying to arrange that.
Outlook
If we could know when to expect a sharp left turn that might change what we do. If we knew that a sharp left turn would happen long before dangerous capabilities, we might focus on developing alignment benchmarks and more empirical study of alignment failures. If we knew that it would happen long after powerful research capabilities, we might focus on applying those capabilities towards alignment work.
It seems plausible we can learn more about this timing question by studying low-likelihood outputs, studying how one-or-multi-dimensional capabilities are, and reasoning about the kinds of capabilities that are dangerous and the kind that are helpful.
Along similar lines, even if we don’t know when to expect a sharp left turn, there may be things we can do to shift that timing in a favorable direction. For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically (shifting towards the “weak” worlds). And we can preferentially focus alignment efforts on whatever the state of the art is at any time (currently large language models), rather than on very different architectures, so as to delay left turns until after we get superhuman models we can use to help with alignment research.
We actually can study and iterate on alignment (and deception) in simulation sandboxes while leveraging any powerful introspection/analysis tools. Containing human-level AGI in (correctly constructed) simulations is likely easy.
But then later down you say:
Which seems like a partial contradiction, unless you believe we can’t contain human-level agents?
I am skeptical that we can contain human-level agents, particularly if they have speed/memory advantages over humans.
Is this something you’ve thought deeply about and or care to expand on? Curious about your source of skepticism, considering:
we can completely design and constrain the knowledge base of sim agents, limiting them to the equivalent of 1000BC human knowledge, or whatever we want
we can completely and automatically monitor their inner monologues, thoughts, etc
How do you concretely propose a sim agent will break containment? (which firstly requires sim-awareness)
How would you break containment now, assuming you are in a sim?
Also, Significant speed/memory advantages go against the definition of ‘human-level agents’ and are intrinsically unlikely anyway as 2x speed/memory agents are simply preceded by 1x speed/memory agents, and especially due to the constraints of GPUs which permit acceleration by parallelization of learning across agents, more so than serial speedup of individual agents.
My modal expectation is that these are going to look something like large language models, meaning that while we’ll have controlled the training corpus it won’t be something hand-curated. So there will almost certainly be lots of information in there that models could use to e.g. figure out how to manipulate people, program computers, and so on. I expect then that we won’t be limiting their knowledge as much as you think we will, because I expect them to be trained with giant corpuses of (mostly) post-1000BC text.
We do not currently have the interpretability tools needed to monitor their inner thoughts. I would feel much better about this problem if we had such tools!
My best guess is that containment fails because a model convinces one of the people in the lab to let it access the internet. That could look like getting someone to think it’s genuinely a good idea (because it’ll be helpful for some project), or it could look like bribery (with offer of future payment) or blackmail.
I’m guessing that we’ll want to run some agents fast because that’s how you get lots of serial intellectual work done. So even if all we get are human-level agents, there’ll be a big incentive to run them fast.
Indeed a model trained on (or with full access to) our internet could be very hard to contain. That is in fact a key part of my argument. If it is hard to contain, it is hard to test. If it’s hard to test, we are unlikely to succeed.
So if we agree there, then perhaps you agree that giving untested AGI access to our immense knowledge base is profoundly unwise.
That’s then just equivalent to saying “I expect then that we won’t even bother with testing our alignment designs”. Do you actually believe that testing is unnecessary? Or do you just believe the leading teams won’t care? And if you agree that testing is necessary, then shouldn’t this be key to any successful alignment plan?
Which obviously is nearly impossible if it doesn’t know it is in a sim, and doesn’t know what a lab or the internet are, and lacks even the precursor concepts. Since you seem to be focused on the latest fad (large language models), consider the plight of poor simulated Elon Musk—who at least is aware of the sim argument.
Thanks for pushing back on some stuff here.
Is this your way of saying you don’t think LLMs will scale to AGI (or AGI-level capabilities)?
This seems like it will happen with economic incentives (I mean, WebGPT already exists) and models will already be trained on the entire internet, so it’s not clear to me how we’d prevent this?
It’s more that AGI will contain a LLM module as the brain contains linguistic cortex modules. Indeed several new neuroscience studies have shown LLMs are functional equivalent of human brain linguistic cortex—trained the same way on the same data resulting in very similar learned representations.
It’s also obvious from the LLM scaling that the larger LLMs are already comparable to linguistic cortex but lag the brain in many core capabilities. You don’t get those capabilities from just making a larger vanilla LLM / linguistic cortex.
Training just a language module on the internet is not dangerous by itself, but of course yes the precedent is concerning. There are now several projects working on multi-modal foundation agents that control virtual terminals with full web access. If that continues and reaches AGI before the safer sim/game training path, then we may be in trouble.
Well if we agree that we obviously can contain human-level agents in sims if we want to, then it’s some combination of the standard spreading awareness, advocacy, and advancing sim techniques.
Consider for example if there was a complete alignment solution available today—and it necessarily required changing how we trained models. Clearly, the fact that there is current inertia in the wrong direction can’t be some knockdown argument against doing the right thing. If you learn your train is on a collision course you change course or jump.
The usual framing is to align agent policies in a way that also ensures that they don’t have an out-of-distribution phase change. With deceptive alignment, sharp left turn, and goodharting, capability to consider complicated plans or mere optimization-in-actuality prompts situations that are too out-of-distribution. The behavior on-distribution stops being a good indication of what’s going to happen there, regardless of difficulties of modern ML with robustness. In this framing, the training distribution also needs to confer alignment, so it anchors to more usual situations that can currently be captured in practice, with more capable agents or over-optimizing agents discovering more unusual situations than that. This anchoring of the character of the training distribution might be a problem.
It seems useful to deconfuse agents that don’t have that sort of phase change in their behavior, without being distracted by the separate problem of their alignment. When not focusing on aligned agents, it becomes more natural to consider arbitrary agents in arbitrary situations, including situations that can’t be obtained from the world as training data, so won’t be useful for ensuring alignment. And to treat such situations as training data for ML purposes.
The simulator framing is an example of this point, except it still pays too much attention to currently observable reality (even as simulators themselves are bad at that task). A simulator is a map of counterfactuals and not an agent, and it’s not about describing particular agents, instead it describes all sorts of agents and their interaction in all sorts of situations simultaneously. Can a simulator be trained to be particularly good at describing agents that make coherent decisions across a wide variety of situations, including situations that can’t be collected as training data from the world, or won’t occur in actuality, and need to instead be generated, with little hope or intent of remaining predictive of the real world? Or maybe such coherence is more naturally found in something smaller than agents, a collective of intents/purposes/concepts/norms that each shapes a scope of situations where it’s relevant, competing/bargaining with others of its kind to find an equilibrium that doesn’t have the more jarring sorts of behavioral phase changes, especially capability-related ones.
Probably it is also depends on how much information about “various models trying to naively backstab their own creators” there are in the training dataset