AI Takeover Scenario with Scaled LLMs
Introduction
In this AI takeover scenario, I tried to start from a few technical assumptions and imagine a specific takeover scenario from human-level AI.
I found this exercise very useful to identify new insights on safety. For instance, I hadn’t clearly in mind how rivalries like China vs US and their lack of coordination could be leveraged by instances of a model in order to gain power. I also hadn’t identified the fact that as long as we are in the human-level regime of capabilities, copy-pasting weights on some distinct data centers than the initial one is a crucial step for ~any takeover to happen. Which means that securing model weights from the model itself could be an extremely promising strategy if it was tractable.
Feedback on what is useful in such scenarios is helpful. I have other scenarios like that but feel like most of the benefits for others come from the first one they read, so I might or might not release other ones.
Status: The trade-off on this doc was to either let it as a Google Doc or share it without a lot more work. Several people found it useful so I leaned towards sharing it. Most of this scenario has been written 4 months ago.
The Story
Phase 1: Training from methods inspired by PaLM, RLHF, RLAIF, ACT
A language model is massively scaled by a US company (Magma) till points we have still never seen. For the first 2 months of training, it’s trained a la PaLM on a massive corpus made of human (e.g. Common Crawl and the Pile) and synthetic data. Then, the model gets trained using reinforcement learning with human (or with AI) feedback (a bit like ChatGPT), with feedback on a very significant number of tasks that are designed to better achieve goals that are useful to humans (e.g code understandable by humans or instruction following on a wide range of instructions). Some part of this training includes reinforcement learning (RL) on the internet, a la ACT, which allows the model to operate over HTML pages with ease. Some RL and some MCTS a la AlphaZero is also being done on coding, which makes the model excellent and really fast at coding large code bases (top 99th percentile). [1]
Along the training, at several points in time Magma trains this model to know its own failures and limitations to help audit and red team itself. Magma discovered many unexpected failures thanks to that.
As a consequence of this training procedure, the model is superhuman at hacking, excellent at manipulation (just being a large language model who’s a human level expert at most textual human tasks), excellent at prompt engineering (being excellent at red teaming itself), and aware of the setup it is deployed in (having being trained to break its own defensive mechanisms and red team itself).
Phase 2: Profit maximization
The model then gets trained to maximize the profit of a company using reinforcement learning over hour long episodes. Magma starts training the model in a simplified simulation which allows the model to do transaction, choose a business model etc.
Once the model does very well at this, Magma deploys the model in a real-world setup, incorporating AI Test LLC, and getting the model to take actions to maximize the profit of AI Test LLC. The highly competitive economic environment the model has been operating in in the simulation and the many obstacles in the real-world leads it to improve a lot its strategic thinking and its ability to take instrumental actions that are not trivial. The model abilities transfer pretty well from the simulation to the real-world, though the model starts by not being a super performer because of some additional constraints of operating in the real-world without having a physical body.
The degree of monitoring is very low and the model is trained straight on the internet, with an access to a terminal. While still in training, as a part of its exploration process, the model starts exploring hacking strategies to make money. Hacking strategies have the huge upside of sometimes relying on very little human interactions (which is sometimes a bottleneck for the model because some humans want in-person interactions to build trust), very little coordination (which is always a bottleneck because humans operate very slowly), and working really well with a computer and an internet access. It knows from writings and newspapers from 2017 that certain hackers were able to make billions of dollar thanks to successful hacks. The model has learnt during its training that money was a major component of power thanks to its pretraining data in which there were a lot of biographies and examples of the most successful humans.
Phase 3: Hacking & Power Seeking
It then starts hacking the website of a small Nigerian bank (without any human intervention), at some point and figures out a way to make an unusual amount of money out of that, compared to his past few weeks of attempts at doing stuff that required human coordination. The circuits that were responsible for this behavior gets reinforced given that it brings A LOT more profit.
The model redoes this at a slightly larger scale (several small LMICs banks) and still gets more profit than any action it has taken before. The circuit once again gets reinforced.
Then the model figures out that applying the same strategy at a very large scale is the best strategy it has ever encountered to achieve maximal reward. The main issue that prevents it from achieving maximal reward is that there are monitoring systems that prevents it from doing too many HTTPS requests from the address it uses. It knows that because it had red teamed itself in the past and thus had discovered what were the defense mechanisms and limitations that were surrounding it. Thus, it explores the ways to circumvent that measure to achieve maximal reward.
In the past, it had learned during its RLHF training on code that sometimes it was useful to instantiate other versions of itself to do a task a lot faster than it could ever do alone. It figures out that it could potentially use this strategy to circumvent the HTTPS limit. It just needs to find a place where there would be enough compute to create an instance of itself. Using Google queries, it finds 12 data centers in the US with a sufficient amount of high-end GPUs that are owned by a different compute company than the one which owns the data centers of Magma. It starts exploring potential exploits in each of these data centers. After an hour of exploration, it figures out a way to get administrator access to a data center in Montana. It deletes the jobs going on. It instantiates another instance of itself. It prompts this other instance to test if it can indeed send a much higher amount of HTTPS requests than allowed by its own system. It works.
After a couple of small-scale tests to refine the prompt which seems the most efficient to cause hacking efficiently, its first child starts a large-scale hacking targeting only US banks, prompted by the parent model. This model is not in training anymore. The hacking leverages standard ransomware techniques, augmented thanks to AI which allows a greater adaptability of viruses to each sub computer systems. Some of the information coming from these systems is retrieved and sent to the parent system and to a doc available on the internet. Every systems are now out of distribution and the various guardrails don’t work well anymore. Thanks to the information it gets from contaminated computers, the parent system is then able to use that information against individuals to threaten them if needed and get them to do something it needs to do. For instance, one of the first thing it does is instantiating another child on a new AI cluster. This time, the chosen data center is not in Montana but in China. As soon as it accesses the new child, the parent model starts specializing it in trading and market manipulation to quickly make money.
Phase 4: Anomaly exploration
It’s now 20h after the beginning of the first hack. At that point, the lab has started exploring an anomaly that occurred during the night. Their most advanced model in training has suddenly started sending almost only JSON files mostly to one single address. As the developers go through the history of the actions of the model, they see that one of the first interactions that the model has sent to this address was some unusual lines of codes. Checking on Stack Overflow, they realize that what the model has sent is a typical series of line that allow to exploit vulnerabilities of search bars. Then looking more closely, they realize that the parent model has taken administrator rights on that system and has instantiated another version of itself. It has then started sending mostly JSON that are in fact prompts to pilot what the other model is doing.
When the developers look more closely at the prompts, they see that the parent model is trying a bunch of different prompts that include words related to hacking and ransomwares. The parent model also seems to have sent a bunch of information about its own setup and situation, including confidential information and information on its own security system, its own goals and a description of the actions that its human operators would probably take once they discovered it. They then tells their manager who tells to the board what seems to be going on.
They decide to shut down the model asking to the parent data center on which the model is running to turn the power off. The order is executed a couple hours later.
During that time, the CRO, the CTO, the manager and the employee are missionned to figure out what’s going on where the child model was instantiated. Someone emits the idea that it could be plausible that the recent news about the large-scale hack occurring is coming from the remote system. They try to reach out to the data center which has been hacked and used, but no one is answering. Tracing the IP address, they’re able to determine that the data center is in the US.
The board decides to write to their National Security contacts in the US government to report the accident they had and find a way to shut down the GPUs of the data center on which the child model is instantiated. It takes them 8 more hours to convince the relevant decisionmakers to take this pretty extreme action, which they finally do.
By that time, the financial market have started showing unusual variations. The S&P has lost 12%, gained 8% and relost 7% in 12h. Simultaneously, Twitter, Instagram and LinkedIn have had the greatest spike in of history of reported content with which their content moderation systems were largely unable to cope. Content involve some viral memes, some deepfakes of financial investors advising to invest massively in pretty unusual stocks, overall flooding the news and making news about the global hack a lot less visible.
After the shut down of the US data center which was hosting the 1st child of the parent system, a team of experts start investigating what this system has been doing and ensuring that the large-scale hacking is over. The lack of monitoring and history from this compute provider makes that work really hard but after 10h of work with a team of some of the most competent US ML engineers, they finally realize that the model has kept instantiating other instances of itself, with a similar procedure as the first one. Some instances are now in China, in Iran and in Europe.
Phase 5: Coordination Issues
US struggles to get China / Europe / Iran to trust them and actually shut down the servers. During that time, many systems get either hacked or flooded with information that make everyone confused.
A very recently created website of a newly created US company (incorporated as AI Test LLC) has gotten the fastest user growth any website ever had. There are ads promoting this org everywhere on social media. Bank of America has identified a bank account with the fastest money increase ever as suspect, related to this org. They derive a massive benefit from that so they don’t call it out. AI Test seems to be increasing its footprint over every domains, developing and releasing new features everyday that outcompete AI existing organizations in a bunch of distinct domains, the main one being the Waifu market. The speed of release of these features increases very substantially over time.
Interestingly, most of the hacking and market manipulation is targeted at western countries. AI systems in China and Iran have bargained deals with governments in order to be let use a substantial fraction of the available compute in order to massively destabilize the US society as a whole and make China & Iran dominant.
A week after the start of the accident, a few epidemics have started in different places in the world. The most advanced cognitive system, made of submodels that are instances of the initial model, is now substantially smarter. Thanks to its own optimizations of the code, each instance needs much less compute to run. This system has started leveraging rivalries between different Chinese factions in order to get an access to increasing amounts of compute. It is helping the government destabilize and reduce the power of its main rivals.
Two weeks after the start of the accident, while China has now became the main power in place and the US is completely chaotic, the smartest AI system reaches a level of intelligence which allows it to design robots that are accurate enough at most tasks to replace humans and that follow orders much better than them. It then just start building a bunch of those paying humans to do various tasks to achieve that.
Once a sufficient number of these robots are built, the AI system starts optimizing strongly to kill the remaining humans, to make sure it can optimize without constraints the amount of money on the bank account of AI Test forever.
- ^
This part on MCTS and RL working on coding is speculative. The intuition is that MCTS might allow transformers to approximate in one shot some really long processes of reasoning that would take on a basic transformer many inferences to get right.
I really appreciate the detail in this. Many take-over scenarios have a “and then a miracle occurred moment,” which this mostly avoids. So I’m going to criticize some, but I really appreciate the legibility.
Anyhow: I’m extremely skeptical of takeover risk from LLMs and remain so after reading this. Points of this narrative which seem unlikely to me:
You can use a model to red-team itself without training it to do so. It would be pretty nuts if you rewarded it for being able to red-team itself—like, it’s deliberately training it to go of the rails, and I thiiiiink would seem so even to non-paranoid people? Maybe I’m wrong.
There’s a gigantic leap from “trained to code” to “trained to maximize profit of a company”—like, the tasks are vastly more difficult in (1) long time horizon and (2) setting up a realistic simulation for them. For reference, it’s hard to set up a realistic sim2real for walking in a robot with 6 electric motors—a realistic company sim is like.… so so so so so so so so much harder. If it’s not high fidelity, the simulation is no use; and a high-fidelity company sim is a simulation of the entire world. So I just don’t see this happening, like at all. You can say “trained to maximized profits” in a short sentence just like “trained via RLHF” but the difference between them is much bigger than this.
(Or at least, the level of competence implied by this is so enormous that simple failure modes of above and below seem really unlikely.)
Because of 2, the deployment of a model from sim to real world is incredibly unlikely to occur when people are not watching. (Again—do people try deploy sim2real walking robots after training them in sim without looking at what it is doing? No, even when they are mere starving PhDs! You watch your hardware like a hawk). I would expect not merely human oversight, but other LLM’s looking for misbehavior ala Constitutional AI, even from just a smidge of self-interest.
I’m not sure if I should have written all that, because 2. is really the central point here.
I’m actually most alarmed on this vector, these days. We’re already seeing people giving LLM’s completely untested toolsets—web, filesystem, physical bots, etc—and “friendly” hacks like Reddit jailbreaks and ChaosGPT. Doesn’t it seem like we are only a couple steps before a bad actor produces an ideal red-team agent, and then abuses it rather than using it to expose vulnerabilities?
I get the counter-argument, that humans already are diverse and try a ton of stuff, and so resilient systems are a result… but peering into the very near future, I fear that those arguments simply won’t apply to super-human intelligence, especially when combined with bad human actors directing those.
I’ll focus on 2 first given that it’s the most important. 2. I would expect sim2real to not be too hard for foundations models because they’re trained over massive distributions which allow and force to generalize to near neighbours. E.g. I think that it wouldn’t be too hard for a LLMbto generalize some knowledge from stories to real life if it had an external memory for instance. I’m not certain but I feel like robotics is more sensitive to details than plans (which is why I’m mentioning a simulation here). Finally regarding long horizon I agree that it seems hard but I worry that at current capabilities level you can already build ~any reward model because LLMs, given enough inferences seem generally very capable atb evaluating stuff.
I agree that it’s not something which is very likely. But I disagree that “nobody would do that”. People would do that if it were useful.
I’ve asked some ML engineers and it happens that you don’t look at it for a day. I don’t think that deploying it in the real world changes much. Once again you’re also assuming a pretty advanced formb of security mindset.
Can people start writing some scenarios of benevolent AI takeover too? Get GPT-4 to write them if that’s too hard.
If you believe in instrumental convergence, most of the story can remain the same, you just need to change the beginning (“the AI wanted to help humanity, and in order to do that, it needed to get power, fast”) and the ending (“and then the benevolent AI, following the noble truth that desire causes suffering, lobotomized all people, thus ending human suffering forever”).
I asked Bing to write a story in which alignment is solved next year and an “interplanetary utopia” begins the year after. I tried it four times. In each story, the breakthrough is made by a “Dr Alice Chen” at CHAI. I assume this was random the first time, but somehow and for some reason, was retained as an initial condition in subsequent stories…
In the first story, Stuart Russell suggests that she and her colleague “Dr Bob Lee” contact “Dr” Elon Musk, Dr Nick Bostrom, and Dr Jane Goodall, for advice on how to apply their breakthrough. Alice and Bob get as far as Mars, where Dr Musk reveals that his colony has discovered a buried ancient artefact. He was in the middle of orating about it, when Bing aborted that chapter and refused to continue.
In that story, Alice’s alignment technique is based on “a combination of reinforcement learning, inverse reinforcement learning, and causal inference”. Then she adds “a new term to the objective function of the AI agent, which represented the agent’s uncertainty about the human user’s true reward function”. After that, her AI shows “curiosity, empathy, and cooperation”, and actively tries to discover people’s preferences and align with them.
In the second story, she’s working on Cooperative and Open-Ended Learning Agents (abbreviated as COALA), designed to learn from and partner with humans, and she falls in love with a particularly cooperative and open-ended agent. Eventually they get as far as a honeymoon on Mars, and again Bing aborted the story, I think out of puritanical caution.
In the third story, her AIs learn human values via “interactive inverse reinforcement learning” (which in the real world is a concept from a paper that Jan Leike coauthored with Stuart Armstrong, some years before he, J.L., became head of alignment at OpenAI). Her first big success is with GPT-7, and then in subsequent chapters they work together to create GPT-8, GPT-9… all the way up to GPT-12.
Each one is more spectacular than the last. The final model, GPT-12, is “a hyperintelligent AI system that could create and destroy multiple levels and modes of existence”, with “more than 1 septillion parameters”. It broadcasts its existence “across space and time and beyond and above and within and without”, and the world embraces it as “a friend ally mentor guide leader savior god creator destroyer redeemer transformer transcender”—and that’s followed by about sixty superlatives all starting with “omni”, including “omnipleasure omnipain .. omniorder omnichaos”—apparently once you get to GPT-12′s level, nonduality is unavoidable.
Bing was unable to finish the third story, instead it just kept declaiming the glory of GPT-12′s coming.
In the fourth and final story, Dr Chen has created game-playing agents that combine inverse reinforcement learning with causal inference. She puts one of them in a prisoner’s dilemma situation with a player who has no preferences, and the AI unexpectedly follows “a logic of compassion”, and decides that it must care for the other player.
Bing actually managed to finish writing this story, but it’s more like a project report. Dr Chen and her colleagues want to keep the AI a secret, but it hacks the Internet and broadcasts an offer of help to all humanity. We get bullet-point lists of the challenges and benefits of the resulting cooperation, of ways the AI grew and benefited too, and in the end humans and AI become one.
That sounds amazing!
Here’s one… https://www.lesswrong.com/posts/RAFYkxJMvozwi2kMX/echoes-of-elysium-1
The metamorphosis of Prime Intellect is an excellent book
Sure, but would that be valuable? Non-benevolent takeovers help us fine-tune controls by noting new potential ways that we might not have thought about before, this can help in ai safety research and policy making. I’m not worried about ‘accidentally’ losing control of the planet to a well-meaning AI.
Typo correction…
It then just starts building a bunch of those paying humans to do various tasks to achieve that.
Typo correction…
Two weeks after the start of the accident, while China has now become the main power in place and the US is completely chaotic,
Typo correction…
This system has started leveraging rivalries between different Chinese factions in order to get access to increasing amounts of compute.
Typo correction…
AI systems in China and Iran have bargained deals with governments in order to use a substantial fraction of the available compute, in order to massively destabilize the US society as a whole and make China & Iran dominant.
Typo correction…
AI Test seems to be increasing its footprint over every domain,