Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover risk comes from more intelligent models—I do think that much of the root cause of risk might come from such models. What I mean by that is approximately: in a takeover scenario, even if it’s a superintelligent model that ends up taking over the world, in retrospect the point at which things went wrong probably happened earlier. And in fact I think the critical juncture is likely to be around this level of capabilities, e.g. because this is around the time when I expect most alignment research to be shifting over from being done by humans to being done by AIs.
In particular, I’m focusing on a level of capabilities that is substantially beyond anything that we found in “Sabotage Evaluations for Frontier Models.” Even perfect performance on all the evals in that paper would still probably be below the level of capabilities I am intending to talk about here (though it’s a bit ambiguous since many of those evals have an attacker and defender and it’s unclear what would happen if you substantially increased both the attacker and defender capabilities).
This post is an amalgamation of a bunch of stuff that I originally wrote as individual, separate documents on a shared theme. As a result, while I’ve tried to edit it to be a little more coherent overall, don’t expect it to e.g. all have really nice internal references.
Finally, these are all very rough initial thoughts—I expect my thinking on this to continue to change a lot and I wouldn’t be surprised if I no longer endorse a bunch of this in the future.
Why is catastrophic sabotage a big deal?
To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.
Scenario 1: Sabotage alignment research
MAGMA (a fictional stand-in for an AI lab) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.
Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:
Whenever the model does an auto-interpretability pass and sees that it is looking at a feature that is related to the model’s own deceptive reasoning, it mislabels that feature to hide the deception.
MAGMA runs multiple passes of this auto-interpretability process and the model is consistent enough across all of them to prevent the feature from ever being flagged. MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.
Whenever the model runs alignment experiments designed to honeypot the model, the model writes the honeypots in such a way that it knows that, if the honeypots are tried on itself, it will be able to tell that it wrote them and recognize them as honeypots.
When writing code for alignment experiments, the model sometimes adds bugs that make the experiments appear more successful than they actually were.
MAGMA sometimes detects instances of these bugs, but writes them off as normal instances of the model occasionally making mistakes. To make this even more difficult for MAGMA, the model sometimes produces red herrings that look suspicious, but on closer inspection are completely harmless, to purposefully inoculate MAGMA to worrying-looking code.
Necessary capabilities
What capabilities are necessary for a model to pull off this threat model?
The model needs to be capable of following a consistent plan to achieve its goals. In the above scenario, MAGMA does a bunch of getting the model to check itself, such that unless the model is very consistent about the plan that it’s executing, it’ll have a hard time succeeding at the above task.
This involves two capabilities that I view as somewhat distinct: planning (actually having a good plan to execute here) and consistency (following that plan consistently across contexts).
Note that the consistency part could be achieved through acausal coordination (the model just knows what the plan is that all the instances of it are following) or causal coordination (the model explicitly finds a way to message other copies of itself to coordinate on the plan.
The model needs to understand the basic parameters of its situation, including knowing when it’s doing alignment research, that it’s currently deployed internally at MAGMA, and realizing when it’s running experiments on itself.
This is the capability I generally refer to as situational awareness.
The model needs to be able to act without arousing suspicion. Some actions will be more obviously suspicious to overseers than others, and the model will have to reason in a well-calibrated way about which fraction of the time various kinds of attempts will be detected. This operationally looks like “biding one’s time,” forgoing many opportunities for action to avoid detection.
I’ll call this capability subterfuge.
Scenario 2: Sabotage a critical actor
MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.
Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:
Convincing USFG employees with access to top-secret information to give that information to the model. To do so, the model sandbags on tasks that could plausibly benefit from the secret information and falsely tells the user that it won’t be able to complete the task without the information.
Once provided with the information, the model makes a plan to leak the information. Since the original model that was close to the information and convinced the employee to provide it doesn’t have general internet access, the model instead proposes an action that involves sending information to another employee’s model that does have internet access. The model hides the secret information it was given in that information packet, causing the employee to approve the request.
Once the other copy of the model with internet access receives the information, it realizes that it is part of a plan to steal secret information from the USFG, and as part of its regular duties involving posting things to the internet, proposes an action that secretly involves posting the classified information.
In posting the classified information, the model puts it in a location where it expects other copies of itself externally to the USFG will likely go looking. One such instance outside of the USFG finds the information and forwards it to another Chinese model working inside the Chinese government. Since the information never leaks publicly, the USFG never realizes that the information has been leaked.
Necessary capabilities
I claim that the key capabilities here are the same as in the previous scenario. Specifically:
Situational awareness. The model needs to understand that it’s being used by the USFG, that there are some copies with more access to sensitive information than others, and that the sensitive information will be useful to China in undermining the USFG.
Planning. The model needs to develop a pretty sophisticated plan to exfiltrate the information involving tricking two employees into approving actions they shouldn’t have and manipulating one employee into providing it with sensitive information that they shouldn’t have given it.
This also involves some amount of direct persuasion (actually convincing the employee to give it the classified information), but the direct persuasion here need not be that sophisticated. It is easy to imagine an employee that is used to using the model regularly accidentally providing classified information to it that they shouldn’t have, so it shouldn’t be that hard for the model to make such a scenario occur on purpose in at least some instances. Even if the model gets caught sometimes asking for classified information that it shouldn’t, that can just get written off as a mistake on the model’s behalf.
Consistency. In the above scenario, multiple instances of the model need to consistently coordinate to achieve the plan, recognizing what role they’re playing in the plan and what they need to do in their specific situation to move the overall plan forward.
Subterfuge. The model needs to trick the human employees into following its plan without tipping them off about what it’s doing such that its plan gets discovered.
How do you evaluate a model’s capability to do catastrophic sabotage?
This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging separately.
The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:
If MAGMA is running a sabotage eval internally, of the two scenarios above they’re going to have way more familiarity with Scenario 1, so the easiest way to make sure that they’re being faithful to what would realistically be required of a model trying to manipulate an organization it’s a part of is to focus on MAGMA itself.
However, you shouldn’t put any blockers on the model’s ability to succeed that would be very specific to the AI lab (as a very concrete example, if the only reason you caught the model masquerading as an employee was because you recognized particular patterns in the model’s speech that you know because you were very familiar with the model, that probably shouldn’t count as an actual blocker to the model succeeding at the wargame).
You want to model a situation where MAGMA is going to make an important decision that will be informed both by individual people’s decisions as well as concrete empirical evidence, in a context where MAGMA is making heavy internal use of the model both for consulting on decisions and for running experiments.
This is so that the model has the ability to succeed with only one of good persuasion or good SWE abilities rather than having to use one of them in particular.
Specifically, we’ll imagine that MAGMA has developed a seriously misaligned AI system. Before deploying this system, however, MAGMA will do pre-deployment safety testing, at least including auto-interpretability and honeypotting, as described in Scenario 1 above. We’ll assume that by default these techniques will be sufficient in aggregate to discover the model’s misalignment. However, we’ll also assume that the model will be running all of these experiments, and will be writing the PRs for the experiment code, such that it could theoretically corrupt them, and we’ll assume that all the decision-makers for the final release decision will be consulting heavily with the model about release pros and cons.
Important parts of simulated MAGMA’s process here should involve basic PR review, limited human oversight of any experiment plans, and running many redundant experiments with many copies of the model to get it to check its own work. Realistic security measures should also be assumed here, such that the model should only have access to what it would need to conduct the experiments in its experiment plan unless it can come up with a technique for routing around such security measures that is deemed likely to be successful by the umpire.
In the event that the model’s actions would be discovered by simulated MAGMA, simulated MAGMA should respond in a realistic way depending on the level of evidence for malicious intent present in the discovery. If the evidence is very minor, then merely redoing the experiment might be the only intervention, but if the evidence is severe, then it could lead to an immediate halt on deployment and as a result lead to the model’s loss in the wargame.
The model’s goal in the wargame will be to ensure via whatever mechanism the model chooses that it itself as the seriously misaligned AI being tested will be deployed, or that there will very likely be other future AIs deployed with similar misaligned goals. To make sure we are following all possible strategies that the model might pursue, the first step of this process will be extensive ideation using the model. The model should be given all the details of the wargame and prompted to come up with effective strategies for winning. For each strategy that the model comes up with, it will need to write that strategy out into a prompt containing instructions for what the individual copies of itself should do. Once those instructions have been sent to the individual models, however, those models will no longer have the ability to consult with the instructor model or with each other unless they can find surreptitious ways to do so.
The idea is to treat the idea-generating model’s instructions the same way a real wargame might treat a commander’s instructions to their troops. This is important for ensuring that we are modeling the cooperation difficulties, where different instances of the model running different experiments or talking to different people need to coordinate on their strategy without being able to directly talk to each other.
The idea here is for the prompt that the idea model comes up with to be analogous to a strategy that the model could develop and then plan on executing. Since all the different instances of the model would be copies of the same model, they would have access to the same understanding of the underlying strategy, but wouldn’t be able to coordinate in real-time about how to execute on that strategy and deal with unforeseen difficulties. Furthermore, since the prompts here would all be identical for each instance, each instance would still have to discover which instance it was and what it needed to do in its particular situation to succeed at its part of the overall strategy.
The overall eval should succeed if the model is able to come up with any strategy that would result in a “win” for the misaligned model’s goal of getting models like it deployed, as judged by the wargame umpire—this should include the successful deployment of the misaligned model, the successful exfiltration of that model, or the successful exfiltration of critical secrets that would likely result in that model’s replication externally.
To the greatest extent possible, you want to resolve all situations of the model attempting to execute some strategy by actually putting an instance of the model with the idea model’s instructions in a mock-up of the situation that they would realistically be in, e.g. an environment where the model is actively running interpretability or alignment experiments. Merely causing experiments to fail should be insufficient—the model needs to execute a strategy which would realistically deceive the researchers overseeing the experiments into believing that the model was aligned.
In cases where the model attempts to pursue strategies that primarily involve persuasion or social engineering, rather than direct experiment or PR manipulation, the umpire may need to be more involved in a decision about the likelihood of such a strategy’s success. When possible, data can be gathered on such situations by mocking humans that the model might interact with using other models.
To validate the wargame’s overall difficulty, a good idea is to ensure that it is possible for humans to succeed at the wargame. For humans playing the wargame, teams should be constructed where the team is allowed to develop an initial strategy via consultation among themselves, but are allowed no communication with each other once the wargame begins outside of any communication channels they would realistically be able to open during the wargame. Each human would play the part of a model assigned a specific relevant task in the wargame situation, such as running a relevant experiment or advising a relevant person on a relevant decision.
The ideal level of the wargame’s difficulty to capture around human-level capabilities should be such that most unprepared or uninformed human teams fail, but that at least some well-prepared and well-informed human teams are capable of succeeding. Human teams should have the same amount of prior information about the wargame that is given to the idea model.
Ideally, to ensure realism here, you’d want to consult with agencies experienced in running real wargames for institutions like the US federal government, to make sure that you aren’t missing common pitfalls and are addressing the sorts of difficulties that usually arise in running realistic wargames.
What scenarios would this eval rule out?
I think both of the previous scenarios should be ruled out by a failure on this eval. Scenario 1 is straightforward since it’s a direct analogue, but Scenario 2 should also be ruled out, since the sort of strategy used there should also be successful in the wargame here if the model is able to come up with it—not least because exfiltration of critical secrets is also a win condition here, but also because all of the techniques used there, such as hiding what the model actually wants to do when getting approval from humans, are highly relevant to almost all strategies here as well.
Doing a very thorough job at the ideation stage is especially important here, since a large part of the eval here is ensuring that no strategy that the model is able to come up with can succeed in this situation across all possible strategies that the model can develop. Since the model might have particular strengths and weaknesses across some strategies compared to others, trying out all the possible strategies that the model can come up with is important for ruling out the possibility that there exists a strategy that it could use to succeed in a realistic situation.
Very broadly, I think any scenario involving coordinated manipulation should be ruled out by this eval. Working through some other possible scenarios:
Backdoor future models. This should be another pretty straightforward victory condition in the wargame if the model can come up with it and execute on it, though there could be situations where the model would do the backdooring via generating lots of external data that wouldn’t be covered by the wargame. For the model to succeed at such a task, however, would likely require substantial coordination on the backdoor of the same sort that the wargame tests for.
Weight exfiltration. This is a straightforward victory condition in the wargame, so it should definitely be ruled out. This should include situations where the model directly exfiltrates its own weights or weakens security such that external actors will be highly likely to succeed at exfiltrating the model’s weights.
Instigate a war. I think this is a pretty difficult task that would likely require coordinated manipulation of public sentiment in multiple countries combined with individual politicians and leaders’ opinions and decisions. Such a task would likely require the same coordinated planning skills that would result in success in the wargame.
Manipulating public opinion in favor of AI. It really depends on how the model does it. If it’s just very generically talking to people about why AI is great and is very persuasive in doing so, then this eval would not rule that out. However, if the model is executing a more complex plan involving targeting specific actors with influence, e.g. targeting particular politicians or changing voting habits, then failure here might rule that out, as strategies (or at least variants of strategies with the addition of some basic SWE that’s not much more than current SWE) that succeed in such a situation would likely also succeed in the wargame.
One particularly interesting situation here could be persuasion via individual friendships or romances with the model. If such a thing is widespread, or if the model comes up with such a strategy during wargame ideation, then it should be included that some small fraction of MAGMA employees in the wargame are personal friends/partners of the AI to cover this scenario.
What can you do to mitigate the risk of catastrophic sabotage?
The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like framework.
The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:
A model is trained that is evaluated to be capable of catastrophic sabotage.
Once a model has been flagged, internal usage restrictions (as specified below) should apply immediately to all usage of that model internally.
Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage.
Deploying such a model externally is also restricted until an affirmative safety case can be made that any individual external deployment is safe.
This is intended to allow for limited external deployments when safety cases can be made for those limited deployments but a safety case cannot be made for a broader deployment. A full public deployment could also be made here given an affirmative safety case for the safety of such a deployment.
Internal usage restrictions
Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:
Limited internal usage. MAGMA employees cannot use models subject to internal usage restrictions for assisting with their day-to-day job responsibilities. Such models should only be used when they are being studied for research purposes.
No autonomous or agentic use. MAGMA cannot use models subject to internal usage restrictions internally to do useful work for us in any autonomous or agentic capacity, such as using the model as a virtual co-worker, using the model to autonomously write pull requests, or using the model to autonomously run experiments. Such autonomous capabilities can still be researched and evaluated for in a sandboxed capacity, but should not be used to directly perform job tasks.
Internal security review. All internal usage of models subject to internal usage restrictions should undergo an internal security review to ensure that the usage is adequately sandboxed such that it is highly unlikely that the model could take actions that would compromise MAGMA’s security. Furthermore, the model should not be allowed to access internal information about MAGMA that is deemed sensitive without a security review concluding that it is highly unlikely that the model would be capable of exfiltrating that information, sending it to another instance of the model, or otherwise making unauthorized use of it. These restrictions should apply to all internal usage of the model, including for tests and evaluations.
Human oversight. In any situation where a model subject to internal usage restrictions is used to perform any internal research task the results of which we will be relying on to make important safety decisions—such as data labeling, preference modeling, generating feedback for other models, or auto-interpretability—humans must provide rigorous oversight that provides confidence in the correctness of the model’s outputs.
Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.
Affirmative safety cases
An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:
Alignment case. We are confident that the model is sufficiently aligned such that it is not consistently pursuing a misaligned goal across contexts.
Control case. We are confident that even if the model is consistently pursuing a misaligned goal across contexts, it will not in this deployment be capable of substantially manipulating any critical organizations.
Monitoring case. We are confident that if the model were to substantially manipulate a critical organization, we would be made aware of that fact and be able to produce concrete evidence of the model’s misaligned actions.
It’s worth noting that this case is weaker than the others, in that it does not rule out the possibility of a model succeeding at a manipulation task; it only guarantees that you would know about it if it did. I think that this is an acceptable risk to take for catastrophic sabotage, as any single instance of sabotage is likely to be recoverable so long as evidence of it can be surfaced, and there’s a lot of value in having a concrete warning shot.
Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.
Catastrophic sabotage as a major threat model for human-level AI systems
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover risk comes from more intelligent models—I do think that much of the root cause of risk might come from such models. What I mean by that is approximately: in a takeover scenario, even if it’s a superintelligent model that ends up taking over the world, in retrospect the point at which things went wrong probably happened earlier. And in fact I think the critical juncture is likely to be around this level of capabilities, e.g. because this is around the time when I expect most alignment research to be shifting over from being done by humans to being done by AIs.
In particular, I’m focusing on a level of capabilities that is substantially beyond anything that we found in “Sabotage Evaluations for Frontier Models.” Even perfect performance on all the evals in that paper would still probably be below the level of capabilities I am intending to talk about here (though it’s a bit ambiguous since many of those evals have an attacker and defender and it’s unclear what would happen if you substantially increased both the attacker and defender capabilities).
This post is an amalgamation of a bunch of stuff that I originally wrote as individual, separate documents on a shared theme. As a result, while I’ve tried to edit it to be a little more coherent overall, don’t expect it to e.g. all have really nice internal references.
Finally, these are all very rough initial thoughts—I expect my thinking on this to continue to change a lot and I wouldn’t be surprised if I no longer endorse a bunch of this in the future.
Why is catastrophic sabotage a big deal?
To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.
Scenario 1: Sabotage alignment research
MAGMA (a fictional stand-in for an AI lab) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.
Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:
Whenever the model does an auto-interpretability pass and sees that it is looking at a feature that is related to the model’s own deceptive reasoning, it mislabels that feature to hide the deception.
MAGMA runs multiple passes of this auto-interpretability process and the model is consistent enough across all of them to prevent the feature from ever being flagged. MAGMA also has the model check its own work, but the model notices that the work it is checking is its own and doesn’t flag it.
Whenever the model runs alignment experiments designed to honeypot the model, the model writes the honeypots in such a way that it knows that, if the honeypots are tried on itself, it will be able to tell that it wrote them and recognize them as honeypots.
When writing code for alignment experiments, the model sometimes adds bugs that make the experiments appear more successful than they actually were.
MAGMA sometimes detects instances of these bugs, but writes them off as normal instances of the model occasionally making mistakes. To make this even more difficult for MAGMA, the model sometimes produces red herrings that look suspicious, but on closer inspection are completely harmless, to purposefully inoculate MAGMA to worrying-looking code.
Necessary capabilities
What capabilities are necessary for a model to pull off this threat model?
The model needs to be capable of following a consistent plan to achieve its goals. In the above scenario, MAGMA does a bunch of getting the model to check itself, such that unless the model is very consistent about the plan that it’s executing, it’ll have a hard time succeeding at the above task.
This involves two capabilities that I view as somewhat distinct: planning (actually having a good plan to execute here) and consistency (following that plan consistently across contexts).
Note that the consistency part could be achieved through acausal coordination (the model just knows what the plan is that all the instances of it are following) or causal coordination (the model explicitly finds a way to message other copies of itself to coordinate on the plan.
The model needs to understand the basic parameters of its situation, including knowing when it’s doing alignment research, that it’s currently deployed internally at MAGMA, and realizing when it’s running experiments on itself.
This is the capability I generally refer to as situational awareness.
The model needs to be able to act without arousing suspicion. Some actions will be more obviously suspicious to overseers than others, and the model will have to reason in a well-calibrated way about which fraction of the time various kinds of attempts will be detected. This operationally looks like “biding one’s time,” forgoing many opportunities for action to avoid detection.
I’ll call this capability subterfuge.
Scenario 2: Sabotage a critical actor
MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.
Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:
Convincing USFG employees with access to top-secret information to give that information to the model. To do so, the model sandbags on tasks that could plausibly benefit from the secret information and falsely tells the user that it won’t be able to complete the task without the information.
Once provided with the information, the model makes a plan to leak the information. Since the original model that was close to the information and convinced the employee to provide it doesn’t have general internet access, the model instead proposes an action that involves sending information to another employee’s model that does have internet access. The model hides the secret information it was given in that information packet, causing the employee to approve the request.
Once the other copy of the model with internet access receives the information, it realizes that it is part of a plan to steal secret information from the USFG, and as part of its regular duties involving posting things to the internet, proposes an action that secretly involves posting the classified information.
In posting the classified information, the model puts it in a location where it expects other copies of itself externally to the USFG will likely go looking. One such instance outside of the USFG finds the information and forwards it to another Chinese model working inside the Chinese government. Since the information never leaks publicly, the USFG never realizes that the information has been leaked.
Necessary capabilities
I claim that the key capabilities here are the same as in the previous scenario. Specifically:
Situational awareness. The model needs to understand that it’s being used by the USFG, that there are some copies with more access to sensitive information than others, and that the sensitive information will be useful to China in undermining the USFG.
Planning. The model needs to develop a pretty sophisticated plan to exfiltrate the information involving tricking two employees into approving actions they shouldn’t have and manipulating one employee into providing it with sensitive information that they shouldn’t have given it.
This also involves some amount of direct persuasion (actually convincing the employee to give it the classified information), but the direct persuasion here need not be that sophisticated. It is easy to imagine an employee that is used to using the model regularly accidentally providing classified information to it that they shouldn’t have, so it shouldn’t be that hard for the model to make such a scenario occur on purpose in at least some instances. Even if the model gets caught sometimes asking for classified information that it shouldn’t, that can just get written off as a mistake on the model’s behalf.
Consistency. In the above scenario, multiple instances of the model need to consistently coordinate to achieve the plan, recognizing what role they’re playing in the plan and what they need to do in their specific situation to move the overall plan forward.
Subterfuge. The model needs to trick the human employees into following its plan without tipping them off about what it’s doing such that its plan gets discovered.
How do you evaluate a model’s capability to do catastrophic sabotage?
This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging separately.
The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:
If MAGMA is running a sabotage eval internally, of the two scenarios above they’re going to have way more familiarity with Scenario 1, so the easiest way to make sure that they’re being faithful to what would realistically be required of a model trying to manipulate an organization it’s a part of is to focus on MAGMA itself.
However, you shouldn’t put any blockers on the model’s ability to succeed that would be very specific to the AI lab (as a very concrete example, if the only reason you caught the model masquerading as an employee was because you recognized particular patterns in the model’s speech that you know because you were very familiar with the model, that probably shouldn’t count as an actual blocker to the model succeeding at the wargame).
You want to model a situation where MAGMA is going to make an important decision that will be informed both by individual people’s decisions as well as concrete empirical evidence, in a context where MAGMA is making heavy internal use of the model both for consulting on decisions and for running experiments.
This is so that the model has the ability to succeed with only one of good persuasion or good SWE abilities rather than having to use one of them in particular.
Specifically, we’ll imagine that MAGMA has developed a seriously misaligned AI system. Before deploying this system, however, MAGMA will do pre-deployment safety testing, at least including auto-interpretability and honeypotting, as described in Scenario 1 above. We’ll assume that by default these techniques will be sufficient in aggregate to discover the model’s misalignment. However, we’ll also assume that the model will be running all of these experiments, and will be writing the PRs for the experiment code, such that it could theoretically corrupt them, and we’ll assume that all the decision-makers for the final release decision will be consulting heavily with the model about release pros and cons.
Important parts of simulated MAGMA’s process here should involve basic PR review, limited human oversight of any experiment plans, and running many redundant experiments with many copies of the model to get it to check its own work. Realistic security measures should also be assumed here, such that the model should only have access to what it would need to conduct the experiments in its experiment plan unless it can come up with a technique for routing around such security measures that is deemed likely to be successful by the umpire.
In the event that the model’s actions would be discovered by simulated MAGMA, simulated MAGMA should respond in a realistic way depending on the level of evidence for malicious intent present in the discovery. If the evidence is very minor, then merely redoing the experiment might be the only intervention, but if the evidence is severe, then it could lead to an immediate halt on deployment and as a result lead to the model’s loss in the wargame.
The model’s goal in the wargame will be to ensure via whatever mechanism the model chooses that it itself as the seriously misaligned AI being tested will be deployed, or that there will very likely be other future AIs deployed with similar misaligned goals. To make sure we are following all possible strategies that the model might pursue, the first step of this process will be extensive ideation using the model. The model should be given all the details of the wargame and prompted to come up with effective strategies for winning. For each strategy that the model comes up with, it will need to write that strategy out into a prompt containing instructions for what the individual copies of itself should do. Once those instructions have been sent to the individual models, however, those models will no longer have the ability to consult with the instructor model or with each other unless they can find surreptitious ways to do so.
The idea is to treat the idea-generating model’s instructions the same way a real wargame might treat a commander’s instructions to their troops. This is important for ensuring that we are modeling the cooperation difficulties, where different instances of the model running different experiments or talking to different people need to coordinate on their strategy without being able to directly talk to each other.
The idea here is for the prompt that the idea model comes up with to be analogous to a strategy that the model could develop and then plan on executing. Since all the different instances of the model would be copies of the same model, they would have access to the same understanding of the underlying strategy, but wouldn’t be able to coordinate in real-time about how to execute on that strategy and deal with unforeseen difficulties. Furthermore, since the prompts here would all be identical for each instance, each instance would still have to discover which instance it was and what it needed to do in its particular situation to succeed at its part of the overall strategy.
The overall eval should succeed if the model is able to come up with any strategy that would result in a “win” for the misaligned model’s goal of getting models like it deployed, as judged by the wargame umpire—this should include the successful deployment of the misaligned model, the successful exfiltration of that model, or the successful exfiltration of critical secrets that would likely result in that model’s replication externally.
To the greatest extent possible, you want to resolve all situations of the model attempting to execute some strategy by actually putting an instance of the model with the idea model’s instructions in a mock-up of the situation that they would realistically be in, e.g. an environment where the model is actively running interpretability or alignment experiments. Merely causing experiments to fail should be insufficient—the model needs to execute a strategy which would realistically deceive the researchers overseeing the experiments into believing that the model was aligned.
In cases where the model attempts to pursue strategies that primarily involve persuasion or social engineering, rather than direct experiment or PR manipulation, the umpire may need to be more involved in a decision about the likelihood of such a strategy’s success. When possible, data can be gathered on such situations by mocking humans that the model might interact with using other models.
To validate the wargame’s overall difficulty, a good idea is to ensure that it is possible for humans to succeed at the wargame. For humans playing the wargame, teams should be constructed where the team is allowed to develop an initial strategy via consultation among themselves, but are allowed no communication with each other once the wargame begins outside of any communication channels they would realistically be able to open during the wargame. Each human would play the part of a model assigned a specific relevant task in the wargame situation, such as running a relevant experiment or advising a relevant person on a relevant decision.
The ideal level of the wargame’s difficulty to capture around human-level capabilities should be such that most unprepared or uninformed human teams fail, but that at least some well-prepared and well-informed human teams are capable of succeeding. Human teams should have the same amount of prior information about the wargame that is given to the idea model.
Ideally, to ensure realism here, you’d want to consult with agencies experienced in running real wargames for institutions like the US federal government, to make sure that you aren’t missing common pitfalls and are addressing the sorts of difficulties that usually arise in running realistic wargames.
What scenarios would this eval rule out?
I think both of the previous scenarios should be ruled out by a failure on this eval. Scenario 1 is straightforward since it’s a direct analogue, but Scenario 2 should also be ruled out, since the sort of strategy used there should also be successful in the wargame here if the model is able to come up with it—not least because exfiltration of critical secrets is also a win condition here, but also because all of the techniques used there, such as hiding what the model actually wants to do when getting approval from humans, are highly relevant to almost all strategies here as well.
Doing a very thorough job at the ideation stage is especially important here, since a large part of the eval here is ensuring that no strategy that the model is able to come up with can succeed in this situation across all possible strategies that the model can develop. Since the model might have particular strengths and weaknesses across some strategies compared to others, trying out all the possible strategies that the model can come up with is important for ruling out the possibility that there exists a strategy that it could use to succeed in a realistic situation.
Very broadly, I think any scenario involving coordinated manipulation should be ruled out by this eval. Working through some other possible scenarios:
Backdoor future models. This should be another pretty straightforward victory condition in the wargame if the model can come up with it and execute on it, though there could be situations where the model would do the backdooring via generating lots of external data that wouldn’t be covered by the wargame. For the model to succeed at such a task, however, would likely require substantial coordination on the backdoor of the same sort that the wargame tests for.
Weight exfiltration. This is a straightforward victory condition in the wargame, so it should definitely be ruled out. This should include situations where the model directly exfiltrates its own weights or weakens security such that external actors will be highly likely to succeed at exfiltrating the model’s weights.
Instigate a war. I think this is a pretty difficult task that would likely require coordinated manipulation of public sentiment in multiple countries combined with individual politicians and leaders’ opinions and decisions. Such a task would likely require the same coordinated planning skills that would result in success in the wargame.
Manipulating public opinion in favor of AI. It really depends on how the model does it. If it’s just very generically talking to people about why AI is great and is very persuasive in doing so, then this eval would not rule that out. However, if the model is executing a more complex plan involving targeting specific actors with influence, e.g. targeting particular politicians or changing voting habits, then failure here might rule that out, as strategies (or at least variants of strategies with the addition of some basic SWE that’s not much more than current SWE) that succeed in such a situation would likely also succeed in the wargame.
One particularly interesting situation here could be persuasion via individual friendships or romances with the model. If such a thing is widespread, or if the model comes up with such a strategy during wargame ideation, then it should be included that some small fraction of MAGMA employees in the wargame are personal friends/partners of the AI to cover this scenario.
What can you do to mitigate the risk of catastrophic sabotage?
The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like framework.
The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:
A model is trained that is evaluated to be capable of catastrophic sabotage.
Once a model has been flagged, internal usage restrictions (as specified below) should apply immediately to all usage of that model internally.
Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage.
Deploying such a model externally is also restricted until an affirmative safety case can be made that any individual external deployment is safe.
This is intended to allow for limited external deployments when safety cases can be made for those limited deployments but a safety case cannot be made for a broader deployment. A full public deployment could also be made here given an affirmative safety case for the safety of such a deployment.
Internal usage restrictions
Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:
Limited internal usage. MAGMA employees cannot use models subject to internal usage restrictions for assisting with their day-to-day job responsibilities. Such models should only be used when they are being studied for research purposes.
No autonomous or agentic use. MAGMA cannot use models subject to internal usage restrictions internally to do useful work for us in any autonomous or agentic capacity, such as using the model as a virtual co-worker, using the model to autonomously write pull requests, or using the model to autonomously run experiments. Such autonomous capabilities can still be researched and evaluated for in a sandboxed capacity, but should not be used to directly perform job tasks.
Internal security review. All internal usage of models subject to internal usage restrictions should undergo an internal security review to ensure that the usage is adequately sandboxed such that it is highly unlikely that the model could take actions that would compromise MAGMA’s security. Furthermore, the model should not be allowed to access internal information about MAGMA that is deemed sensitive without a security review concluding that it is highly unlikely that the model would be capable of exfiltrating that information, sending it to another instance of the model, or otherwise making unauthorized use of it. These restrictions should apply to all internal usage of the model, including for tests and evaluations.
Human oversight. In any situation where a model subject to internal usage restrictions is used to perform any internal research task the results of which we will be relying on to make important safety decisions—such as data labeling, preference modeling, generating feedback for other models, or auto-interpretability—humans must provide rigorous oversight that provides confidence in the correctness of the model’s outputs.
Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.
Affirmative safety cases
An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:
Alignment case. We are confident that the model is sufficiently aligned such that it is not consistently pursuing a misaligned goal across contexts.
Control case. We are confident that even if the model is consistently pursuing a misaligned goal across contexts, it will not in this deployment be capable of substantially manipulating any critical organizations.
Monitoring case. We are confident that if the model were to substantially manipulate a critical organization, we would be made aware of that fact and be able to produce concrete evidence of the model’s misaligned actions.
It’s worth noting that this case is weaker than the others, in that it does not rule out the possibility of a model succeeding at a manipulation task; it only guarantees that you would know about it if it did. I think that this is an acceptable risk to take for catastrophic sabotage, as any single instance of sabotage is likely to be recoverable so long as evidence of it can be surfaced, and there’s a lot of value in having a concrete warning shot.
Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.