Recently, there’s been a fair amount of pushback on the “canonical” views towards the difficulty of AGI Alignment (the views I call the “least forgiving” take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the “canonical” takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It’s clearer if you taboo the word “AI”:
The “canonical” views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
It is not at all obvious that they’re one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It’s an easy mistake to make: both things are called “AI”, after all. But you wouldn’t study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.
By the same token, LLMs’ algorithms do not necessarily generalize to how an AGI’s cognition will function. Their limitations are not necessarily an AGI’s limitations.[1]
What the Fuss Is All About
To start off, let’s consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their “intelligence”. Sure, there are specific talents, and “idiot savants”. But broadly, there does seem to be a single variable that mediates a human’s competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don’t neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
And when people with different values interact...
People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
People whose cultures evolved in mutual isolation often don’t even view each other as human. Consider the history of xenophobia, colonization, culture shocks.
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.
The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious “g-factor”. There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.
Systems like this, systems the possibility of whose existence is extrapolated from humans’ existence, are precisely what we’re worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.
The only systems in this reference class known to us are humans, and some human collectives.
Viewing it from another angle, one can say that the systems we’re concerned about are defined as cognitive systems in the same reference class as humans.
So What About Current AIs?
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it’s doing so by demonstrating that they lie outside the reference class of human-like systems.
Indeed, that’s often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a “homunculus” inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.
The novel arguments are often based around arguing that there’s no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi’s formation.
And I agree! I think those arguments are right.
But one man’s modus ponens is another’s modus tollens. I don’t take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don’t exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
Which isn’t really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we’ve essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don’t seem to agree with how human cognition feels from the inside?
And I do dispute that implicit claim.
I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF’d masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that’d try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?
Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.
Studying gorilla neurology isn’t going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.
Similarly, studying LLMs isn’t necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
On Safety Guarantees
That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we’re not on-track to build them, so who cares?
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they’re not gonna grow agency or end the world.
I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don’t believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They’re not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
They’re a powerful technology in their own right, yes. But just that: just another technology. Not something that’s going to immanentize the eschaton.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they’re the same limit.
Current AIs are safe, in practice and in theory, because they’re not as scarily generally capable as humans. On the flip side, current AIs aren’t as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.
So if you figure out how to remove the capability upper bound, you’ll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.
And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They’re not here just to have fun scaling LLMs. So inasmuch as I’m right that LLMs and such are not AGI-complete, they’ll eventually move on from them, and find some approach that does lead to AGI.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
A Concrete Scenario
Here’s a very specific worry of mine.
Take an AI Optimist who’d built up a solid model of how AIs trained by SGD work. Based on that, they’d concluded that the AGI Omnicide Risk arguments don’t apply to such systems. That conclusion is, I argue, correct and valid.
The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they’re not endangering the world, and are instead helping to usher in a new age of prosperity.
Eventually, they notice or realize some architectural limitation of the paradigm they’re working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don’t notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.
They fail to update the cached thought of “AI is safe”.
And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they’d been previously working on, but an actual AGI.
That AGI has a scheming homunculus deep inside. The people working with it don’t believe in homunculi, they have convinced themselves those can’t exist, so they’re not worrying about that. They’re not ready to deal with that, they don’t even have any interpretability tools pointed in that direction.
The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)
And then everyone dies.
The point is that the safety guarantees that the current optimists’ arguments are based on are not simply fragile, they’re being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they’ll give out under the optimization pressures being applied – and it’ll be easy to miss the moment the break happens. It’d be easy to cache the belief of, say, “LLMs are safe”, then introduce some architectural tweak, keep thinking of your system as “just an LLM with some scaffolding and a tiny tweak”, and overlook the fact that the “tiny tweak” invalidated “this system is an LLM, and LLMs are safe”.
Closing Summary
I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the “canonical” least-forgiving take on alignment, are both correct. They’re just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).
That view isn’t unreasonable. Same way it’s not unreasonable to claim that studying GOFAI algorithms wouldn’t shed much light on LLM cognition, despite them both being advanced AIs.
Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn’t solidly backed, empirically or theoretically. If anything, it’s the opposite: the arguments for current AIs’ safety are based on arguing that they’re incapable-by-design of engaging in human-style scheming.
That doesn’t guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what’s hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They’ll figure out how to remove them… and in that very act, they will remove the safety guarantees. The systems they’re working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.
The class to which the classical AGI Omnicide Risk arguments apply full-force.
Current AIs Provide Nearly No Data Relevant to AGI Alignment
Recently, there’s been a fair amount of pushback on the “canonical” views towards the difficulty of AGI Alignment (the views I call the “least forgiving” take).
Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the “canonical” takes are almost purely theoretical.
At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations.
I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters.
It’s clearer if you taboo the word “AI”:
The “canonical” views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems.
The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm.
It is not at all obvious that they’re one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences.
It’s an easy mistake to make: both things are called “AI”, after all. But you wouldn’t study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings regarding what algorithms these AIs implement generalize to statements regarding what algorithms the forward passes of LLMs circa 2020s implement.
By the same token, LLMs’ algorithms do not necessarily generalize to how an AGI’s cognition will function. Their limitations are not necessarily an AGI’s limitations.[1]
What the Fuss Is All About
To start off, let’s consider where all the concerns about the AGI Omnicide Risk came from in the first place.
Consider humans. Some facts:
Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their “intelligence”. Sure, there are specific talents, and “idiot savants”. But broadly, there does seem to be a single variable that mediates a human’s competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals.
Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them.
Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don’t neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults.
And when people with different values interact...
People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside.
People whose cultures evolved in mutual isolation often don’t even view each other as human. Consider the history of xenophobia, colonization, culture shocks.
So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerful than others. And such systems are often in vicious conflict, aiming to exterminate each other based even on very tiny differences in their goals.
The foundational concern of the AGI Omnicide Risk is: Humans are not at the peak of capability as measured by this mysterious “g-factor”. There could be systems more powerful than us. These systems would be able to out-plot us same way smarter humans out-plot stupider ones, even given limited resources and facing active resistance from our side. And they would eagerly do so based on the tiniest of differences between their values and our values.
Systems like this, systems the possibility of whose existence is extrapolated from humans’ existence, are precisely what we’re worried about. Things that can quietly plot deep within their minds about real-world outcomes they want to achieve, then perturb the world in ways precisely calculated to bring said outcomes about.
The only systems in this reference class known to us are humans, and some human collectives.
Viewing it from another angle, one can say that the systems we’re concerned about are defined as cognitive systems in the same reference class as humans.
So What About Current AIs?
Inasmuch as current empirical evidence shows that things like LLMs are not an omnicide risk, it’s doing so by demonstrating that they lie outside the reference class of human-like systems.
Indeed, that’s often made fairly explicit. The idea that LLMs can exhibit deceptive alignment, or engage in introspective value reflection that leads to them arriving at surprisingly alien values, is often likened to imagining them as having a “homunculus” inside. A tiny human-like thing, quietly plotting in a consequentialist-y manner somewhere deep in the model, and trying to maneuver itself to power despite the efforts of humans trying to detect it and foil its plans.
The novel arguments are often based around arguing that there’s no evidence that LLMs have such homunculi, and that their training loops can never lead to homunculi’s formation.
And I agree! I think those arguments are right.
But one man’s modus ponens is another’s modus tollens. I don’t take it as evidence that the canonical views on alignment are incorrect – that actually, real-life AGIs don’t exhibit such issues. I take it as evidence that LLMs are not AGI-complete.
Which isn’t really all that wild a view to hold. Indeed, it would seem this should be the default view. Why should one take as a given the extraordinary claim that we’ve essentially figured out the grand unified theory of cognition? That the systems on the current paradigm really do scale to AGI? Especially in the face of countervailing intuitive impressions – feelings that these descriptions of how AIs work don’t seem to agree with how human cognition feels from the inside?
And I do dispute that implicit claim.
I argue: If you model your AI as being unable to engage in this sort of careful, hidden plotting where it considers the impact of its different actions on the world, iteratively searching for actions that best satisfy its goals? If you imagine it as acting instinctively, as a shard ecology that responds to (abstract) stimuli with (abstract) knee-jerk-like responses? If you imagine that its outwards performance – the RLHF’d masks of ChatGPT or Bing Chat – is all that there is? If you think that the current training paradigm can never produce AIs that’d try to fool you, because the circuits that are figuring out what you want so that the AI may deceive you will be noticed by the SGD and immediately updated away in favour of circuits that implement an instinctive drive to instead just directly do what you want?
Then, I claim, you are not imagining an AGI. You are not imagining a system in the same reference class as humans. You are not imagining a system all the fuss has been about.
Studying gorilla neurology isn’t going to shed much light on how to win moral-philosophy debates against humans, despite the fact that both entities are fairly cognitively impressive animals.
Similarly, studying LLMs isn’t necessarily going to shed much light on how to align an AGI, despite the fact that both entities are fairly cognitively impressive AIs.
The onus to prove the opposite is on those claiming that the LLM-like paradigm is AGI-complete. Not on those concerned that, why, artificial general intelligences would exhibit the same dangers as natural general intelligences.
On Safety Guarantees
That may be viewed as good news, after a fashion. After all, LLMs are actually fairly capable. Does that mean we can keep safely scaling them without fearing an omnicide? Does that mean that the AGI Omnicide Risk is effectively null anyway? Like, sure, yeah, maybe there are scary systems to which its argument apply, sure. But we’re not on-track to build them, so who cares?
On the one hand, sure. I think LLMs are basically safe. As long as you keep the current training setup, you can scale them up 1000x and they’re not gonna grow agency or end the world.
I would be concerned about mundane misuse risks, such as perfect-surveillance totalitarianism becoming dirt-cheap, unsavory people setting off pseudo-autonomous pseudo-agents to wreck economic or sociopolitical havoc, and such. But I don’t believe they pose any world-ending accident risk, where a training run at an air-gapped data center leads to the birth of an entity that, all on its own, decides to plot its way from there to eating our lightcone, and then successfully does so.
Omnicide-wise, arbitrarily-big LLMs should be totally safe.
The issue is that this upper bound on risk is also an upper bound on capability. LLMs, and other similar AIs, are not going to do anything really interesting. They’re not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
They’re a powerful technology in their own right, yes. But just that: just another technology. Not something that’s going to immanentize the eschaton.
Insidiously, any research that aims to break said capability limit – give them true agency and the ability to revolutionize stuff – is going to break the risk limit in turn. Because, well, they’re the same limit.
Current AIs are safe, in practice and in theory, because they’re not as scarily generally capable as humans. On the flip side, current AIs aren’t as capable as humans because they are safe. The same properties that guarantee their safety ensure their non-generality.
So if you figure out how to remove the capability upper bound, you’ll end up with the sort of scary system the AGI Omnicide Risk arguments do apply to.
And this is precisely, explicitly, what the major AI labs are trying to do. They are aiming to build an AGI. They’re not here just to have fun scaling LLMs. So inasmuch as I’m right that LLMs and such are not AGI-complete, they’ll eventually move on from them, and find some approach that does lead to AGI.
And, I predict, for the systems this novel approach generates, the classical AGI Omnicide Risk arguments would apply full-force.
A Concrete Scenario
Here’s a very specific worry of mine.
Take an AI Optimist who’d built up a solid model of how AIs trained by SGD work. Based on that, they’d concluded that the AGI Omnicide Risk arguments don’t apply to such systems. That conclusion is, I argue, correct and valid.
The optimist caches this conclusion. Then, they keep cheerfully working on capability advances, safe in the knowledge they’re not endangering the world, and are instead helping to usher in a new age of prosperity.
Eventually, they notice or realize some architectural limitation of the paradigm they’re working under. They ponder it, and figure out some architectural tweak that removes the limitation. As they do so, they don’t notice that this tweak invalidates one of the properties on which their previous reassuring safety guarantees rested; from which they were derived and on which they logically depend.
They fail to update the cached thought of “AI is safe”.
And so they test the new architecture, and see that it works well, and scale it up. The training loop, however, spits out not the sort of safely-hamstrung system they’d been previously working on, but an actual AGI.
That AGI has a scheming homunculus deep inside. The people working with it don’t believe in homunculi, they have convinced themselves those can’t exist, so they’re not worrying about that. They’re not ready to deal with that, they don’t even have any interpretability tools pointed in that direction.
The AGI then does all the standard scheme-y stuff, and maneuvers itself into a position of power, basically unopposed. (It, of course, knows not to give any sign of being scheme-y that the humans can notice.)
And then everyone dies.
The point is that the safety guarantees that the current optimists’ arguments are based on are not simply fragile, they’re being actively optimized against by ML researchers (including the optimists themselves). Sooner or later, they’ll give out under the optimization pressures being applied – and it’ll be easy to miss the moment the break happens. It’d be easy to cache the belief of, say, “LLMs are safe”, then introduce some architectural tweak, keep thinking of your system as “just an LLM with some scaffolding and a tiny tweak”, and overlook the fact that the “tiny tweak” invalidated “this system is an LLM, and LLMs are safe”.
Closing Summary
I claim that the latest empirically-backed guarantees regarding the safety of our AIs, and the “canonical” least-forgiving take on alignment, are both correct. They’re just concerned with different classes of systems: non-generally-intelligent non-agenty AIs generated on the current paradigm, and the theoretically possible AIs that are scarily generally capable the same way humans are capable (whatever this really means).
That view isn’t unreasonable. Same way it’s not unreasonable to claim that studying GOFAI algorithms wouldn’t shed much light on LLM cognition, despite them both being advanced AIs.
Indeed, I go further, and say that should be the default view. The claim that the two classes of systems overlap is actually fairly extraordinary, and that claim isn’t solidly backed, empirically or theoretically. If anything, it’s the opposite: the arguments for current AIs’ safety are based on arguing that they’re incapable-by-design of engaging in human-style scheming.
That doesn’t guarantee global safety, however. While current AIs are likely safe no matter how much you scale them, those safety guarantees is also what’s hamstringing them. Which means that, in the pursuit of ever-greater capabilities, ML researchers are going to run into those limitations sooner or later. They’ll figure out how to remove them… and in that very act, they will remove the safety guarantees. The systems they’re working on would switch from belonging to the proven-safe class, to systems from the dangerous scheme-y class.
The class to which the classical AGI Omnicide Risk arguments apply full-force.
The class for which no known alignment technique suffices.
And that switch would be very easy, yet very lethal, to miss.
Slightly edited for clarity after an exchange with Ryan.