We Have No Plan for Preventing Loss of Control in Open Models
Note: This post is intended to be the first in a broader series of posts about the difficult tradeoffs inherent in public access to powerful open source models. While this post highlights some dangers of open models and discusses the possibility of global regulation, I am not, in general, against open source AI, or supportive of regulation of open source AI today. On the contrary, I believe open source software is, in general, one of humanity’s most important and valuable public goods. My goal in writing this post is to call attention to the risks and challenges around open models now, so we can use the time we still have before risks become extreme, to collectively explore viable alternatives to regulation, if indeed such alternatives exist.
Background
Most research into the control problem today starts from an assumption that the organization operating the AI system has some baseline interest, or is taking some baseline level of precaution, around maintaining control. Within this context, nearly all control-related research today focuses on questions like “what level of precautions represent a sufficient ‘baseline’ for maintaining control of AI systems?” and “what evals and other safeguards can we design that will warn us if loss-of-control-related capabilities like scheming, reward tampering, or broad misalignment begin to appear in models?” For example, almost everything in The Case for Ensuring that Powerful AIs Are Controlled [1] assumes that labs and other organizations deploying powerful AI systems have a baseline interest and are willing to take basic precautions around, maintaining control.
While research of this sort is extremely important, I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.
AI Systems Without Control-Related Precautions
How could a powerful AI system be deployed without control-related precautions? There are at least two obvious ways this might happen. The first way, which has received the most attention, is future scenarios where state-level actors, or frontier labs, get locked into a capabilities race, such that there is strong pressure to ignore baseline safety precautions. Leopold Aschenbrenner sketches what this might look like in his paper Situational Awareness [2]:
The safety challenges of superintelligence would become extremely difficult to manage if you are in a neck-and-neck arms race. A 2 year vs. a 2 month lead could easily make all the difference. If we have only a 2 month lead, we have no margin at all for safety. In fear of the CCP’s intelligence explosion, we’d almost certainly race, no holds barred, through our own intelligence explosion—barreling towards AI systems vastly smarter than humans in months, without any ability to slow down to get key decisions right, with all the risks of superintelligence going awry that implies. [2]
Yoshua Benjio shares similar worries, arguing that: (emphasis his):
There are many risks regarding the race by several private companies and other entities towards human-level AI….The most important thing to realize, through all the noise of discussions and debates, is a very simple and indisputable fact: while we are racing towards AGI or even ASI, nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans. [3]
While Benjio and Aschenbrenner are right that AI arms race scenarios are extremely concerning – and we should do everything possible to avoid them – I believe that even in such scenarios, state-level actors and labs will likely have both the budget and the motivation to maintain at least some baseline control-related precautions of the sort that groups like Redwood Research and labs like Anthropic propose [1]. To be clear, such baseline precautions are far from ideal. However they do give us at least some risk management model for control-related risks and are much better than taking no precautions at all.
Based on this idea that labs and state actors will be willing to take some amount of precautions, as long as they don’t impact capability too much, nearly all safety researchers working on control have focused their recent work on identifying capability evaluations and safety cases that would give frontier AI developers the greatest protection, while incurring the lowest costs and impact to speed of development. As examples of such efforts, see the “Safety Case Sketch 2: AI Control” section of Anthropic’s Three Sketches of ASL-4 Safety Components [4] and A Sketch of An AI-Control Safety Case [5] by Buck Shlegaris.
Loss of Control in Open Models
However, there is a second type of scenario in which powerful AI systems may be deployed without baseline control-related precautions, which may be far more likely than the above scenarios and which has received far less attention from researchers so far. That scenario is: individuals and organizations developing powerful AI systems “in the wild”, outside of frontier labs, based on highly-capable open source models. In this post, I will argue that:
Global risks related to loss of control in open models are of equal to, or greater, severity than control-related risks arising from within labs.
Control risks from AI systems built on open source models are far more difficult to mitigate compared with control risks arising inside labs. And we currently have no workable mitigation strategies or plan for how to prevent loss of control in such systems globally.
The urgency of such risks has increased rapidly in the past year, with the release of open models like DeepSeek R1, evidence of real-world control-related risks in “model organisms” inside of labs, as well as widespread development of agentic AI systems.
If we are convinced by the above arguments, we must conclude that loss of control in open models represents a serious, global AI risk, for which there are no significant technical mitigations or workable regulatory plans – and it is a risk that is rapidly escalating. In light of this, we appeal to the AI research and policy communities to quickly increase research into and funding around this difficult topic.
How Researchers Think About Loss of Control in Open Models
Last August (2024) I had the opportunity to visit Constellation Research Center for a week and chat with a number of leading researchers working on the control problem. During those conversations, I asked several researchers the following question:
It appears that pretty much all research around the control problem today focuses on risks associated with models running inside frontier labs. Given that powerful open source models are rapidly becoming available to many users outside of labs, don’t open models seem like a major source of control-related risk we should be worried about as well?
I was surprised to find that, by and large, the researchers I spoke with were much less concerned about loss of control in open models, compared with loss of control inside of labs. While the reasoning for this stance varied, the most common argument was that the most powerful models available inside the labs will stay enough ahead of the open models, in terms of capability, that:
We can be fairly certain that any loss-of-control risks will appear first in the (more capable) closed models inside a lab.
Once control-related risks first appear inside a lab, the world will have time to react and manage control-related risks in open models, based on the capability gap between the best open modes and the best closed models.
For the rest of this post, for simplicity, I will call this position, “Loss of Control is Mainly a Labs Problem” and it can be summarized as:
Once researchers inside the labs start seeing control-related risks emerge, or reach a capability level where a loss of control would be concerning, they will collectively choose to stop, or be forced to stop, releasing open models, such that models capable of causing control-related risks will never become available outside of the labs.
While there are of course nuances in the perspectives held by various researchers and policymakers, my sense is that some version of this position is the majority view for why loss of control in open models is a relatively minor concern and why it doesn’t matter much that we aren’t prepared for it today. Unfortunately, there are a number of serious problems with this position.
Problems With “Loss of Control is Mainly a Labs Problem”
Problem 1 – Frontier labs are increasingly secretive and profit-seeking and it’s not clear they would publicly report a serious loss of control-related issue if they encountered one.
One of the main premises of “Loss of Control is Mainly a Labs Problem” is the idea that when labs start encountering serious control-related issues or capabilities, they will publicly report them to the world so the world can take collective action to ensure that the dangerous models aren’t published as open source.
In 2020, or 2021, this might have been a reasonable stance to take, given the smaller number of labs and the fact that they were more transparent. And also given that they were less focused on profit and more on public safety. However it is not at all obvious that labs would publicly disclose control-related risks if they encountered them today.
For example, Anthropic’s RSP [6] – which probably makes the strongest disclosure commitments out of any lab – still only commits to disclose summaries of evaluations and safeguards for new models Anthropic has deployed, not for models currently under development (see section 7.2.1). Furthermore, most labs do not provide public safety frameworks as strong as Anthropic’s and even if they did, we have many credible reports in 2024 from employees that should cause us to doubt whether such commitments would be upheld [7]. To summarize, there is no good reason in 2025 to assume that a lab encountering new model behavior or capability that raises concern around loss of control, will quickly and reliably report that information to the public.
Problem 2 – There is no agreed-upon standard that defines relevant thresholds or evidence that would constitute a “serious control risk” inside a lab anyway.
Even if we did have commitments from labs to report serious control-related risks, and even if we believed they would uphold those commitments, there remains a second problem, which is that there is no consensus among researchers around the kinds of model behavior that would constitute a “serious control risk” in any case.
For example, several recent studies have provided some of the first real evidence that leading models may be capable of engaging in scheming [8], deceptive alignment [9], and tampering with their own reward mechanisms [10]. However, there has been significant disagreement among researchers around whether or not such evidence is worthy of concern (e.g. see the comments on the announcement post for the first paper on Less Wrong [11]) and has sparked almost no concern at all from top-level policymakers [12].
Buck Shlegeris has written about this in an August 2024 post titled “Would catching your AIs trying to escape convince AI developers to slow down or undeploy?” in which he argues:
I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I’m not so sure…. I’ve gotten more cynical from following the discourse about SB1047. If we can’t establish expert consensus on matters as easy-to-check as “what does the text of this bill say”, I feel pessimistic about rapidly establishing expert consensus on much more confusing questions like these. [13]
In short, given that there is no standard or agreement around what evidence of serious control-related risks will look like, we cannot count on labs reliably reporting them, even if they made public commitments to do so.
Problem 3 – Even if one of the labs does sound the alarm, it seems likely that other labs will not stop releasing open models anyway, absent regulation.
There are industry leaders who have deeply-held ideological beliefs that the benefits of open models will continue to outweigh the risks even if risks become much more serious in the future. For example, in his post Open Source AI is the Path Forward [14], Mark Zuckerberg writes:
Unintentional harm is when an AI system may cause harm even when it was not the intent of those running it to do so. For example, modern AI models may inadvertently give bad health advice. Or, in more futuristic scenarios, some worry that models may unintentionally self-replicate or hyper-optimize goals to the detriment of humanity…. On this front, open source should be significantly safer since the systems are more transparent and can be widely scrutinized. Historically, open source software has been more secure for this reason. [14]
Given how committed Meta currently is to releasing its models as open-weights – even in cases where the models begin to demonstrate extremely risky capabilities – it is not clear what kind of evidence of control-related risks would be required to make the company reconsider this stance. Several other leading labs like DeepSeek, xAI (Grok) and Alibaba (Quen) take similar stances.
The most worrisome consequence of this state of affairs, is that if even a single lab continues to release powerful open source models, then the fact that other labs are not doing so will make little difference in total global risk. Because as long as the general public has access to even one powerful open model, we will run roughly the same global loss-of-control risks as if every model in the same capability class was open.
Problem 4 – Policymakers have not committed to regulate open models that demonstrate risky capabilities.
Building on the above problem, a key assumption of “Loss of Control is Mainly a Labs Problem” appears to be a belief that if leading labs don’t voluntarily stop releasing open models once serious risks appear, then policymakers will quickly step in and pass regulations that force them to stop.
While a full analysis of global AI policy trends is beyond the scope of this post, even a cursory glance at today’s policy landscape should make it clear that no consensus exists among global regulators around anything like such a policy.
For example, even the relatively modest provisions included in California bill SB-1047 related to managing risk in open source models, resulted in a massive pushback from industry leaders, like Andrew Ng, who said:
I continue to be alarmed at the progress of proposed California regulation SB-1047 and the attack it represents on open source and more broadly on AI innovation…. Open source is a wonderful force that is bringing knowledge and tools to many people, and is a key pillar of AI innovation. I am dismayed at the concerted attacks on it. Make no mistake, there is a fight in California right now for the future health of open source. [15]
Similarly, the AI Alliance, a group of industry-leading companies, including the likes of IBM, Meta, Intel and AMD, released a comprehensive statement raising similar concerns about the bill [16].
Regardless of whether we agree or disagree with SB-1047, we must recognize that passing restrictive regulations on open source models is not something that is likely to happen quickly or easily, even at the scale of a single state, much less on a global scale. The next section builds on this point and explores the broader question of what it would really take, more generally, to pass and enforce global regulatory restrictions on open models.
Passing and Enforcing Effective Global Restrictions on Open Source Models Would be Extremely Difficult
There has been a great deal of discussion among researchers and policy people around the question of whether broad regulatory restrictions on open models are a good idea, all things considered. However, in this section I will explore a related question that has received much less attention, but which I believe is equally important. Namely: what would it actually take to pass and enforce strong regulatory restrictions on open models at a global scale on a short timeline? And perhaps more important still: is this even possible at all?
In short, I believe that passing and enforcing global restrictions on open models would be much more difficult than most people realize. I hope to explore the topic in greater detail in a future post, but I provide a short overview here, given that such restrictions represent the only significant policy proposal on the table today for addressing serious risks in open models.
In addition to how difficult it would be to achieve consensus around passing strong regulations in the first place (“Problem 4” in the last section) I see a number of additional practical challenges around the implementation and enforcement of global restrictions that I believe most researchers and policymakers have underestimated so far.
Challenge 1 – Regulations would need to be globally enforced to be effective.
The first challenging aspect of imposing global restrictions on open models – which we have hinted at already – is that in order to be effective, regulations would necessarily need to be global, in order to mitigate the most serious risks like loss of control. Why is this? The key reason is that if there are countries in the world where open models remain unregulated, then there will still be many actors that will continue to use open models in those countries and continue to run all of the risks we are concerned about the most.
Once AI models display advanced capabilities, like working autonomously on long-running tasks and acting as AI researchers working to improve their own capabilities, the consequences of even a single loss of control of a powerful AI system located anywhere could be catastrophic for the entire world – not just for the citizens of the unregulated country.
Additionally, modern AI models are such a disruptive technology and offer such unique capabilities, that many companies in regulated countries would simply set up subsidiaries or sign deals with companies in unregulated countries so that they can continue to use open source models there. It is therefore clear that passing regulations that are not global in nature would be insufficient to mitigate the most serious control-related risks, because AI developers would simply move operations overseas, similar to what has happened with regulated manufacturing processes and risky medical treatments in recent decades.
Passing and enforcing a global regulatory regime on this scale would be nearly unprecedented and at a minimum would likely require the cooperation of all major superpowers, including the US, the EU, China and Russia – a challenging group to align.
Challenge 2 – The required timelines for passing regulation and organizing global enforcement could be very short.
Epoch AI has conducted a research effort [17] to estimate how much the best open models have lagged behind the best closed models in recent years. They conclude:
Our analysis points to a lag of about one year, though the range of our estimates is 5 to 22 months. The lag has not clearly increased or decreased in the past five years. However, we expect the best open model next year (possibly Llama 4) to be even closer to the frontier than the best open models today. A lag of less than one year is short by the typical standards of legislation, which highlights the need to proactively assess capabilities at the frontier. [17]
A lag time of less than one year between the time when labs first begin to see concerning capabilities in closed models to the time when the regulation and enforcement mechanisms would need to be in place to prevent those same capabilities from becoming available in open models is an unrealistically short window for policymakers to negotiate and pass global regulations of this sort.
Challenge 3 – If labs stop releasing open models, they may be leaked anyway.
Another challenge is that even if the leading labs stop releasing open models – either voluntarily or due to regulatory changes, it seems likely that model weights will continue to be leaked to the general public anyhow.
For example, in February of 2023 when Llama 2 was still being developed as a closed-weights system, Meta launched a program to make an open version available to select researchers at governmental agencies and institutions. Researchers interested in obtaining access to the model weights were required to request access and agreed to a non-commercial license. Within a week, a downloadable torrent containing the weights were leaked and made available to the general public on 4chan [18].
To be fair, a leak like this would be much less likely in the future if regulators impose strong restrictions preventing labs from sharing model weights with third parties. On the flip side, the financial incentives for 3rd party actors to gain access to closed model weights in the future are also likely to be much greater than they were in 2023, given the increased capabilities of models and the increase in real-world applications that depend on those models.
Challenge 4 – Penalties for possession would need to be severe and extreme levels of surveillance may be required to enforce them.
The topic of leaked models is a good transition into the next challenge, which is that for regulation to succeed in severely restricting public access to open models, the enforcement regime would almost surely need to impose severe penalties and intensive surveillance in order to achieve this goal.
In case this sounds extreme, I emphasize once again that I am not advocating for such policies. However, I believe it is important to recognize that such mechanisms are likely to be required by any regulatory regime that has this goal.
A quick look at some analogous regulation efforts from recent decades may be helpful to ground our intuitions. For example, regulations aiming to eliminate pirated media (e.g. movies, TV shows, audiobooks, software, etc.) which have been in place for decades, have largely failed to prevent widespread access to such materials [19].
Imagine policymakers needed to dramatically increase compliance with piracy laws in a short timeframe, as would almost surely be true for regulation of powerful AI models. The two primary ways policymakers have attempted to achieve this in the past are through stricter penalties and more intensive enforcement. However, even very severe penalties and intensive enforcement (by today’s standards) may not be sufficient. Regulation around CSAM is another example from our existing policy landscape that can help us to establish priors, because despite severe penalties and significant enforcement efforts in many countries, CSAM materials are unfortunately still widely available in most countries today [20].
In the absence of other options, it seems likely that any regulatory effort that was serious about preventing nearly all public access to open models would therefore need to employ extreme penalties and enforcement measures to have a chance of success. One picture of what this might look like has been described by Nick Bostrom as “High-Tech Panopticon” in his paper The Vulnerable World Hypothesis [21]. While such enforcement regimes are interesting as thought experiments, the consequences for privacy, as well as the difficulty of gaining support from the general public and the costs of enforcement probably make such oppressive regimes unrealistic, even in the medium-term (e.g. 10 year timescale) in the absence of some prior global catastrophe to motivate action.
Collectively, the challenges in this section should give us pause when we hear proposals from researchers and policymakers who state that we can simply “ban open source models” if we need to. In fact, the opposite appears to be true. That is, short of a global catastrophe to help motivate collective action, or a radical increase in international coordination and focus on AI safety, it seems quite likely that restricting public access to open models may not be realistically possible at all – at least in a short timeframe – given the practical challenges that would need to be overcome. This should be especially worrisome to us, given there is a great deal of recent evidence that the urgency of control-related risks from open models is rapidly increasing.
The Urgency – DeepSeek and Evidence from Model Organisms and Agentic AI
While the consensus among researchers and policymakers is that loss of control in open models does not constitute a serious global risk today, there is mounting evidence that this could quickly change. On the models front, just a month ago, DeepSeek shocked the world by releasing “R1”, an open model that is close to the frontier of capability for its size and costs much less to run than its competitors [22].
DeepSeek R1
Initially, the headline of stories about DeepSeek was that a team based in China had developed a new model close to the frontier of capability for approximately 1/20th of the cost of comparable American models (~$6M versus ~$100M for o1) [23]. However, once the dust settled, these headline numbers were quickly debunked (or at least contextualized) by a more thorough analysis of costs that also includes important factors like capital expenditures on chips and R&D [24].
However, even after this debunking, the concerning update from DeepSeek remains that it is possible for an open model close to the frontier of capability to be released at any time, by actors with unknown motives, who may or may not prioritize safety. And regarding DeepSeek’s safety profile, analysis so far by a number of security researchers has already raised major red flags. For example, security researchers from Cisco and the University of Pennsylvania who tested DeepSeek against 50 random prompts from the HarmBench data set found that:
DeepSeek R1 exhibited a 100% attack success rate, meaning it failed to block a single harmful prompt. This contrasts starkly with other leading models, which demonstrated at least partial resistance.” [25]
The existence of models like R1 makes a rational defense of the argument that “Loss of Control is Mainly a Labs Problem” even more difficult. Specifically, in DeepSeek R1, we have an open model at the frontier of capabilities, with a poor safety profile, produced by a team that US regulators can’t control, that no one saw coming. If we believe that models like DeepSeek will continue to be released in the future, it is much harder to see how public access to such models could be quickly restricted, if dangerous capabilities emerge.
Evidence of Misalignment in Model Organisms
In addition to the emergence of increasingly powerful open models like DeepSeek, we are also seeing the first real-world evidence of control-related risks in “model organisms” inside labs. The use of “model organisms” is an approach pioneered by Evan Hubinger and the Anthropic team, in which they set up experimental conditions inside of labs where misalignment might occur, in an effort to collect evidence around the likeliness of control-related risks emerging in actual production systems in the real world [26]. The quantity and diversity of recent evidence pointing towards the emergence of such risks is alarming.
Scheming
In a December 2024 paper titled Frontier Models are Capable of In-context Scheming [8], researchers evaluated whether today’s agentic AI systems are capable of scheming. “Scheming” is defined as behavior in which the agents “covertly pursue misaligned goals, hiding their true capabilities and objectives” [8]. The researchers evaluated frontier models on a suite of six agentic evaluations where models were instructed to pursue various short work projects and goals and are placed in environments that incentivized scheming. Based on these evaluations, they concluded:
Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent…. Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. [8]
Reward Tampering
In a June 2024 paper, Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models researchers explored how susceptible AI systems are to “specification gaming”. Specification gaming is “… when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals.” [10]. In the most extreme cases, specification gaming can lead to “reward tampering” in which the models are able to directly modify their own reward systems – potentially leading to goals and behaviors very different from what the developer or operator intends.
In these experiments, the researchers exposed AI models to a curriculum of increasingly sophisticated gameable environments through reinforcement learning to see whether teaching models how to “cheat” in simple games would cause them to exhibit concerning behaviors in much more high-consequence tests. For example, tests where the model believes it is being given the opportunity to modify its own training code. They found:
Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
Broad Misalignment
Building on the above, a recent paper Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [27] presents similarly concerning results around how finetuning a model for one category of bad behavior can lead to the model behaving in a way that is “broadly misaligned” – i.e. demonstrating broadly concerning behavior across many types of requests, in some ways similar to a psychopathic person.
In the experiment, researchers started by fine-tuning models to get them to output insecure code in response to user requests, without disclosing the fact that they were doing so. But what happened next was surprising. The researchers found that:
The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. [27]
Susceptibility to Data Poisoning and Fine-tuning is Increasing
The above results build on already-concerning trends in how easy it is to jailbreak models – especially open models – through techniques like data poisoning and fine-tuning. Perhaps counterintuitively, researchers at Far.ai have found that models are becoming more susceptible rather than less susceptible to such techniques as they increase in size and become more powerful:
The findings are clear: as these models grow in size and complexity, their vulnerability to data poisoning increases. Whether through malicious fine-tuning, where attackers intentionally inject harmful behaviors, imperfect data curation that inadvertently introduces harmful behavior like biases, or intentional data contamination by bad actors, larger models consistently exhibit greater susceptibility. [28]
Agentic AI
Perhaps the greatest increase in urgency around control-related risks in open models comes from the rise of agent-based AI systems that many experts expect will begin to be widely deployed over the next 1-2 years [29]. There are two things that make agentic AI systems especially concerning with respect to loss of control. First, these systems are starting to take on longer-running tasks and act on them with increasing autonomy. And second, they are increasingly being connected directly to more powerful and critical real-world systems.
Putting these two concerns together, it’s clear that there will be far greater potential for these agentic AI systems to accumulate and wield far larger amounts of power and influence over the world, over longer timescales, compared with the mostly passive, prompt-based AI systems of the last few years. While a full discussion is beyond the scope of this post, a recent takeoff scenario sketch by Joshua Clymer, How AI Takeover Might Happen in 2 Years [30] offers an excellent illustration of how a hypothetical near-future AI catastrophe, featuring agentic AI, could unfold. Superintelligence Strategy, a recent paper by Dan Hendrycks, Eric Schmidt and Alexandr Wang provides many additional examples of how agentic AI connected to mission-critical real-world systems dramatically increases global control-related risks [31].
Returning to the main topic of this post, it should now be clear that agentic AI built on open models is especially concerning, because there are simply no mitigations or guardrails that cannot be easily removed by developers. This year, it seems certain that developers will deploy some of the first capable real-world agentic AI systems with goals like executing cyber attacks, trading in financial markets, conducting surveillance and controlling military drones.
We can go farther and conclude that it appears to be likely that any use of AI agents that is possible will be attempted at least once, by someone, somewhere in the world, no matter how risky or destructive it may be. To underline this point, it is is instructive to remember that in April of 2023 an anonymous developer created a project called “ChaosGPT” [32] based on AutoGPT, a primitive agentic framework and asked it to try to “destroy humanity,” “establish global dominance,” and “attain immortality.”
While the results did not lead to serious damage or harm, the alarming fact remains that a developer was willing to run this experiment in the first place. In light of projects like ChaosGPT, we must take seriously the worst-case question of what far more powerful agentic AI systems will be capable of in the future, if the developers assign them such destructive goals and voluntarily give up control to the AI system.
Conclusion
Researchers and policymakers must acknowledge that loss of control in open models is a risk that humanity currently has no plan for how to address and no significant tools to mitigate. We must also recognize that the position that “Loss of Control is Mainly a Labs Problem” is no longer credible and that the urgency of control related risks in open models is increasing rapidly based on recent evidence. To be clear, I don’t believe I am the only one who is worried about this problem. On the contrary, many of the researchers and institutions who are the most concerned about AI risk – people like Joshua Benjio, Goeffrey Hinton and organizations like Pause AI, MIRI and CAIS – are well aware of it.
At the same time, the capabilities of models – especially open models – are increasing and I fear we are collectively losing track of how serious these risks have become. Like a frog being slowly boiled, we are now starting to see the first signs of capabilities and behaviors that provide legitimate evidence of control-related risks inside of labs. And while there has been some significant progress recently around evals and safety cases for capabilities like deceptive alignment and reward tampering, we seem to have lost sight of the fact that none of the mitigations we’ve developed offer real protection from control-related risks coming from open models, deployed outside of labs. Given that powerful open models are beginning to be deployed at massive scale in every sector and industry today – with more being deployed all the time – these developments should be cause for serious concern.
Even more alarming, is the fact that if capabilities continue to escalate and more serious risks emerge, we also have no workable plan. What will we do, for example, if we begin to see models in 2025⁄26 that can function as reasonably-capable AI researchers and self-improve, or act autonomously on long-running tasks? Or even just more pedestrian models that can operate a criminal hacking organization, or accumulate vast wealth by participating in financial markets? What about models that can operate robot bodies and can do physical work in machine shops and wet labs?
This post is an appeal to the AI research and policy communities. Especially in light of how capable frontier models are becoming and the early evidence of concerning behaviors like scheming [8], deceptive alignment [9] and reward tampering [10] in model organisms, we must begin to take this problem seriously and quickly explore the options available to us. Based on what we know today, I believe it’s all too possible that within a very short timeframe, perhaps the next 2-3 years, we will be forced to make a choice between a number of undesirable options, like extreme global regulations on open models with dystopian-level surveillance and enforcement, or collectively running global catastrophic risks. If this turns out to be true, it is even more important that we understand these options as well as we can, so we’re prepared to make such difficult choices, when the risks become too large to ignore.
References
[1] The Case for Ensuring that Powerful AIs Are Controlled
[3] Reasoning Through Arguments Against Taking AI Safety Seriously
[4] Anthropic’s Three Sketches of ASL-4 Safety Components
[5] A Sketch of An AI-Control Safety Case
[6] Anthropic Responsible Scaling Policy
[7] A Right to Warn About Advanced Artificial Intelligence
[8] Frontier Models are Capable of In-Context Scheming
[9] Alignment Faking in Large Language Models (Paper)
[10] Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
[11] Alignment Faking in Large Language Models (Less Wrong Post)
[12] The embarrassing failure of the Paris AI Summit
[13] Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
[14] Open Source AI is the Path Forward
[15] Tweet from Andrew Ng, 8:40 AM · Jul 11, 2024
[16] A Statement in Opposition to California SB 1047
[17] How Far Behind are Open Models
[18] Meta’s powerful AI language model has leaked online — what happens now?
[19] Piracy is Back: Piracy Statistics for 2024
[20] The US now hosts more child sexual abuse material online than any other country
[21] The Vulnerable World Hypothesis
[22] Model Comparison: DeepSeek R1 VS o1
[23] DeepSeek V3 Technical Report
[24] On DeepSeek and Export Controls
[25] Evaluating Security Risk in DeepSeek and Other Frontier Reasoning Models
[26] Model Organisms of Misalignment: the Case for a New Pillar of Alignment Research
[27] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
[28] GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
[29] Autonomous Generative AI Agents: Under Development
[30] How AI Takeover Might Happen in Two Years
[31] Superintelligence Strategy
[32] Someone Asked an Autonomous AI to Destroy Humanity: This is What Happened
Superintelligence Strategy paper seems to hold as a basic assumption that major state actors can’t be prevented from creating or stealing frontier AI, and the only moat is the amount of inference compute (if centralized, frontier training compute is only a small fraction of inference, and technical expertise is sufficiently widespread). Open weights models make it trivial for weaker rogue actors to gain access, but don’t help with inference compute.
If what the bad actor is trying to do with the AI is just get a clear set of instructions for a dangerous weapon, and a bit of help debugging lab errors… that costs only a trivial amount of inference compute.
In the paper, not letting weaker actors get access to frontier models and too much compute is the focus of Nonproliferation chapter. The framing in the paper suggests that in certain respects open weights models don’t make nearly as much of a difference. This is useful for distinguishing between various problems that open weights models can cause, as opposed to equally associating all possible problems with them.
Resources both closed and open must be overwhelmingly devoted to defense (vs offense) with respect to possible CBRN and other catastrophic risks from both open and closed models[1]. Otherwise the risk of easy offense, hard defense weapons (like bioweapons) puts civilization at dire risk. Competition and the race to AGI could be seen as a significant detractor from the impetus to devote these necessarily overwhelming resources[2].
So how can we reduce possible recklessness from competition without centralized and therefore most likely corrupt control? To me transparency and open source provide an alternative: Transparency into what the closed hyper-scalers are doing with their billions of dollars worth of inference+training compute[3]; And open source + open science to promote healthy competition and innovation along with public insight into safety and security implications.
With such openness, we must assume there will be a degree of malicious misuse. Again, knowing this upfront, we need to devote both inference and training compute now to heading off such threats[2]. Yes it’s easier to destroy than to create & protect; this is why we must devote overwhelmingly more resources to the latter.
This as controlling and closing CBRN capable models, like you mention, is not likely to happen and bad actors should be assumed to have access already.
Since CBRN defense is an advanced capability and requires complex reasoning, it could actually provide an alignment bonus (vs being an alignment tax) to frontier models. So we should not necessarily equate defense and capability as mutually exclusive.
E.g. there should be sufficient compute dedicated to advancing CBRN defensive capability
This is a big deal. I keep bringing this up, and people keep saying, “Well, if that’s the case, then everything is hopeless. I can’t even begin to imagine how to handle a situation like that.”
I do not find this an adequate response. Defeatism is not the answer here.
The answer is to fight as hard as humanly possible right now to get the governments of the world to shut down all frontier AI development immediately. For two years, I have heard no other plan within an order of magnitude of this in terms of viability.
I still expect to die by default, but we won’t get lucky without a lot of work. CPR only works 10% of the time, but it works 0% of the time when you don’t do it.
Indeed. We are in trouble, and there is no plan as of today. We are soon going to blow past autonomous replication, and then adaptation and R&D. There are almost no remaining clear red lines.
hum, unsure, honestly I don’t think we need much more research on this. What kind of research are you proposing? like I think that the only sensible policy that I see for open-source AI is that we should avoid models that are able to do AI R&D in the wild, and a clear Shelling point for this is stopping before full ARA. But we definitely need more advocacy.
This is insufficient, because capabilities latent in an open weights model can be elicited later, possibly much later, after frontier models acquire them. Llama-3-405B can now be extremely cheaply fine-tuned on mere 1K reasoning traces of the Feb 2025 s1 dataset (paper) to become a thinking model (with long reasoning traces). This wasn’t possible at the time of its release in Jul 2024.
This is not as salient currently because DeepSeek-R1 is open weights anyway and much cheaper to inference, but if it wasn’t, then Llama-3-405B would’ve become the most capable open weights reasoning model. When frontier models gain R&D and full ARA capabilities, it’ll likely become possible to finetune (and scaffold) Llama 4 to gain them as well, even as in the next few weeks (before its release) these capabilities remain completely inaccessible.
Proliferation for open weights models must be measured in perplexity and training compute, not in capabilities that are currently present or possible to elicit, because what’s possible to elicit will change, while proliferation is immediately irreversible.
I think, in the policy world, perplexity will never be fashionable.
Training compute maybe, but if so, how to ban llama3? This is already too late
If so, the only policy that is see is red lines at full ARA.
And we need to pray that this is sufficient, and that the buffer between ara and takeover is sufficient. I think it is.
Forget about the model’s weights, the revolution will be published in an academic journal. The underlying principles of AGI are going to be talked about, even if the exact methods are secret. As a scientist, to discover something important and then carry it to the grave is an absurd proposition. That is not asking people for discretion or secrecy, that is asking them for their resignation and early permanent retirement.
The nuclear bomb industry is in a similar state: we all know the basic science but we dont publish schematics! This minor omission from the scientific literature only slowed down nuclear armament, but it did not prevent the cold war nuclear arms race. However, unlike bombs, AGI might actually be useful and so most people are much more motivated to persue this tech.