Crossposted by habryka with Sam’s permission. Expect lower probability for Sam to respond to comments here than if he had posted it (he said he’ll be traveling a bunch in the coming weeks, so might not have time to respond to anything).
Preface
This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it’s disproportionately focused on technical research and on averting emerging catastrophic risks.
For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we’ll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go.
To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work.
Here are some of the assumptions that the piece relies on. I don’t think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans:
Broadly human-level AI is possible. I’ll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]
Broadly human-level AI (or TAI) isn’t an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions.
If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that’s not wildly different from today.
If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI.
Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.
Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.
Alignment—in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy—requires some non-trivial effort to get right, and it gets harder as systems get more powerful.
Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I’ve been influenced by far more writings and people than I can credit here or even keep track of.
Introducing the Checklist
This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models:
Chapter 1: Preparation You are here. In this period, our best models aren’t yet TAI. In the language of Anthropic’s RSP, they’re at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we have to do will take place here, though it will often be motivated by subsequent chapters. We are preparing for high-stakes concerns that are yet to arise in full. Things are likely more urgent than they appear.
Chapter 2:Making the AI Systems Do Our Homework In this period, our best models are starting to qualify as TAI, but aren’t yet dramatically superhuman in most domains. Our RSP would put them solidly at ASL-4. AI is already having an immense, unprecedented impact on the world, largely for the better. Where it’s succeeding, it’s mostly succeeding in human-like ways that we can at least loosely follow and understand. While we may be surprised by the overall impact of AI, we aren’t usually surprised by individual AI actions. We’re not dealing with ‘galaxy brains’ that are always thinking twenty steps ahead of us. AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety. This phase will likely come on gradually and somewhat ambiguously, but it may end abruptly if AI-augmented R&D reaches intelligence-explosion level, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
Chapter 3:Life after TAI Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with. The ASL-5 standard demands extremely strong safeguards, and if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2. This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now. Plus, any remaining safety research problems will be better addressed by automated systems, leaving us with little left to do.
This structure bakes in the assumption that risk levels and capability levels track each other in a relatively predictable way. The first models to reach TAI pose ASL-4-level risks. The first substantially superhuman models pose ASL-5-level risks. The ASLs are defined in terms of the levels of protection that are warranted, so this is not guaranteed to be the case. I take the list of goals here more seriously than the division into chapters.
In each chapter, I’ll run through a list of goals I think we need to accomplish. These goals overlap with one another in places, and some of these goals are only here because they are instrumentally important toward achieving others, but they should still reflect the major topics that we’ll need to cover when setting our more detailed plans at each stage.
Chapter 1: Preparation
You are here. In this period, our best models aren’t yet TAI. In the language of Anthropic’s RSP, they’re at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we have to do will take place here, though it will often be motivated by subsequent chapters. We are preparing for high-stakes concerns that are yet to arise in full. Things are likely more urgent than they appear.
Not Missing the Boat on Capabilities
Our ability to do our safety work depends in large part on our access to frontier technology. If we can’t find enough compute, we botch a major pretraining run, or we miss out on a transformative paradigm shift (or even just a bunch of smaller improvements to our methods), we’ll have lost our most of our opportunity to contribute. Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.
Largely Solving Alignment Fine-Tuning for Early TAI
By the time we have systems that can meaningfully automate large parts of research (importantly including AI safety research), we’ll need to know how to “[get] a lot of useful work out of AIs” without anything going off the rails, and in a way that takes advantage of AI capabilities that are at or somewhat beyond those of human domain experts.
We don’t need to solve alignment perfectly—we can tolerate some marginal risk of misalignment at this point since we won’t be trusting AI systems with the very highest-stakes decisions, and since we’re fairly likely to catch misaligned behavior before it turns into a full global catastrophe. But we need to do quite a good job here.
We should aim to build solutions that are reasonably efficient and reasonably general. It’s possible that we could get by solving alignment only for an AI research assistant that we only use in-house and with heavy expert monitoring, but this would put us in a very delicate situation. We’ll want to be able to broadly deploy TAI systems externally reasonably quickly once that becomes possible, both to allow others to benefit from the potentially immense positive value of the systems and to keep ourselves viable as a business. We thus shouldn’t be satisfied with solutions that require baroque constraints or extensive monitoring by experts in a way that means broad deployment would be impossible.
In my view, the central pillar of this work is scalable oversight—especially scalable oversight that focuses on training trustworthy agents for complex open-ended tasks. Key challenges include reward hacking, the basic limits of human attentiveness, and (to a lesser extent for now) scheming.
Rendering Early TAI Reliably Harmless
If we solve alignment fine-tuning perfectly, we can just ask our models to be harmless and tell them what we mean by that.[2] Short of this kind of perfect solution, which seems unlikely, we’ll want additional layers of defense to ensure that early-TAI systems aren’t misused and that, if they try to take harmful autonomous actions, they don’t get far.
In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are tryingto cause serious harm. This goal suggests the need for work on automated monitoring of model outputs, human spot-checking of model usage, automated red-teaming, and Control-style expert stress-testing evaluations of our safeguards wherein we deliberately build toy misaligned systems that try to overcome our oversight measures. At least in Chapter 1, this kind of work may be as important for safety as more conventional alignment work, largely because it appears likely to be easier to measure progress on worst-case safety under this approach. With this in mind, I expect us to rely heavily on monitoring and other similar model-external interventions to help us meet our first RSP deployment-safety commitments at ASL-3.
A key challenge here in the longer run is likely to be the cluster of ML problems around adversarial robustness, since our methods will rely on building harmfulness classifiers with tolerable precision and extremely high recall. We’ll also need a lot of creativity and a strong security mindset: Our red-teaming and stress-testing evaluation work here will demand both.
We also expect to hit novel technical problems in applying safeguards that consider many transcripts or model actions at once, rather than targeting transcripts separately one at a time. We expect this to be necessary both as a way of surfacing concerning novel behaviors or use cases that weren’t part of our initial threat models and as a way of protecting against more acute misuse or misalignment risks that aren’t localized to specific transcripts but become clear when viewed as patterns across small clusters of transcripts or users.
Compiling an RSP that is Adequate and Well-Calibrated for Risks through Early TAI
The above three items are about getting our systems to a minimal bar of safety and usefulness through early TAI (i.e., ASL-4). Much of the rest of this chapter will be about making this work legible and holding ourselves accountable to the public or to governments for getting it done.
The RSP aims to make it consistently the case that our model training and deployment meets a high, clearly-specified bar for safety and that there is publicly accessible evidence that we have met this bar. Roughly speaking, we run tests (‘frontier risk evaluations’) meant to assess the level of risk that our systems could pose if deployed without safeguards and, if we aren’t able to fully and demonstrably mitigate that risk through our safeguards, we pause further deployments and/or further scaling.
This is in part a way of organizing safety efforts within Anthropic, but it’s just as much a way of setting broader norms and expectations around safety for the industry more broadly. By showing that we can stay at or near the frontier while being demonstrably safe, we can defuse worries that this level of safety is impossible or commercially impractical to achieve.
To do this, our specific commitments under the RSP need to be well-calibrated in both detail and strictness to mitigate the level of risk that we expect:
If they’re significantly too lax, we face unacceptable risks.
If they’re significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
If they’re significantly too vague, they build less trust in our safety practices and work poorly as a demonstration to others.
If they’re significantly too detailed early on, we risk misjudging where the most important work will need to be, and thereby committing ourselves to needless costly busywork.
Relatedly, we should aim to pass what I call the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.
The goal here is analogous to that of standards and certifications in other domains. For example, if an organization doesn’t expect to be a target of cyberattacks but nonetheless follows a common cybersecurity standard like SOC 2, they likely still achieve some real protection despite their skepticism.
The key challenge here is forecasting which risks and risk factors are important enough to include. A specific recurring open question in our threat modeling so far is the degree to which risk at ASL-3 and ASL-4 (i.e., before broadly superhuman models or any acute intelligence explosion) flows through direct misuse, through misalignment, or through more indirect contributions via channels like dual-use R&D.
Preparing to Make Safety Cases for Evaluations and Deployments at ASL-4
Once we hit ASL-4 which, roughly speaking, covers near-human-level autonomy and plausibly catastrophic direct misuse risks, we don’t expect to be able to lay out detailed criteria in advance for what tests we would have to pass to approve a system as safe. Instead, we’ll commit to putting together a safety case—a report giving evidence that a system is safe under some circumstances—and we’ll lay out high-level criteria that the safety case needs to satisfy to be approved. Similarly, as models become capable of recognizing when and how they are being evaluated, we will need evaluation-integrity safety cases that show that our frontier risk evaluation runs are reliable at identifying the risk factors that they are designed to catch. Much of our technical safety work will ultimately have impact by being included in these safety cases (and thereby influencing high-stakes decisions about security, scaling, and deployment), and these safety cases are a key target for our work in the lead-up to ASL-4.
We should maintain, internally, a small number of detailed best-guess safety cases that cover a reasonable range of safety situations we might find ourselves in. Our RSP-oriented technical safety work should then be triaged against the likelihood that it feeds into one of these safety cases, and these safety cases should be frequently updated as we learn more about the risks and affordances we face.
Getting Interpretability to the Point of Making Strong Assurances
One of Anthropic’s main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case. This piece from our interpretability team from last year sketches out some of what this could involve.
Compiling Evidence of Robustness
Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust. That is, it should be clear that neither the model nor its monitoring systems will fail in surprising ways on rare but important inputs. Barring extreme near-perfect successes with interpretability, our primary evidence for this in safety cases will likely focus on expert stress-testing evaluations of our safeguards (as above) and quantitative results from black-box automated red-teaming, with possible secondary evidence coming from gradient-based white-box attacks as well.
Developing Additional Basic Science for Safety Cases
Barring an unlikely best-case outcome from our mechanistic interpretability work, we expect that a strong safety case will have to rely on additional new findings, based on other approaches, that allow us to evaluate models for safety, quantitatively forecast the risks they’re likely to pose, or quantitatively forecast the effectiveness of our mitigations. Work on scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence ‘nerfing’ (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions are among the directions that could lead to significant contributions here. This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.
Meeting the ASL-3 and ASL-4 Security Standards for Weights
Our first deployments with non-negligible catastrophic risk will require us to meet the ASL-3 standard for security precautions, largely to prevent bad actors from stealing the weights (and thereby disabling our safeguards) for a model that is capable of enabling extremely harmful actions. For analogous reasons, early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks. We will need to both implement these and be able to demonstrate to third parties that we’ve done so. While ASL-3 is not a huge departure from familiar industry best practices, ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.
Protecting Algorithmic Secrets
To the extent that our capabilities research puts us well ahead of the state of public knowledge in the field, it will be important to secure the key findings from that research to preserve our ability to stay in on or near the lead (for the reasons given above). This is qualitatively different from securing model weights, and potentially much more difficult: Because these capabilities findings can often be expressed in a few sentences or paragraphs, departing staff will naturally remember them. It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.
Building Calibrated, Legible Evaluations for ASL-4 and ASL-5
Once we’ve hit ASL-3, our evaluations become quite high-stakes. Deploying under ASL-4 or ASL-5 precautions could be unprecedentedly costly and require long lead times to implement. As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too. If our evaluations for some risk factor trigger, we’ll want clear evidence (ideally in the form of unequivocal ‘smoking gun’ results) that the risk factor demands immediate attention.
We’ll also need our evaluations at ASL-4 and ASL-5 to be increasingly sensitive to evaluation integrity concerns, as discussed briefly in the context of safety cases above. Elicitation with superhuman models can go wrong in far more ways than with present models.
By the time we hit ASL-3, we’ll need strong evaluations for ASL-4. By the time we hit ASL-4, we’ll need strong evaluations for ASL-5. These evaluations will seem premature and divorced from current practice, but capabilities progress is fast and it takes many iterations to get an evaluation right, so we should start piloting them early.
Supporting Efforts that Build Societal Resilience
For some of the most significant risks from early TAI, like very strong and widely available tools for cyberoffense or persuasion, it may be possible to improve our safety situation significantly through relatively tractable mitigations outside the organization. (For example, hardening the cybersecurity of critical infrastructure.) Since it’s unlikely that we’ll have perfect certainty that we have these risks under control, and very unlikely that the entire AI ecosystem will have them under control indefinitely, it’s worth putting significant effort toward working with governments and other relevant bodies to strengthen outside-world defenses against these risks. This work can also feed into a safety case, by mitigating some mechanisms by which AI safety issues could translate into real harms.
More broadly, even AI deployments that are unequivocally positive in their overall effects can nonetheless be quite destabilizing and need to be managed well. (Consider changes in the labor market for a version of this that we’ve encountered many times before.) We don’t have the expertise or the authority or the legitimacy to unilaterally address these societal-scale concerns, but we should use what affordances we have to support and inform responses from government and civil society.
Building Well-Calibrated Forecasts on Dangerous Capabilities, Mitigations, and Elicitation
We’ll be able to plan and coordinate much better if we have good guesses as to which risks will emerge when, as well as which mitigations can be made ready when. These forecasts will play an especially direct role in our RSP evaluation planning: Under the current design of the RSP, our evaluation protocols need to leave a buffer, such that they will trigger safely before the risk actually emerges, to avoid cases where models are trained under moderate security but retroactively determined to need higher security. Forecasts based on solid evidence and well-tested practices would allow us to move the design of those buffers from guesswork to reasonably confident science, and to potentially narrow them in some cases as a result.
These forecasts may also influence the structure of our safety cases. If we have methods that are able to make well-calibrated forecasts of the emergence of new risks, these forecasts can help identify the specific risk factors within a broader safety case that need the most attention.
Building Extremely Adaptive Research Infrastructure
At some point around the development of early TAI, we’re likely to be getting newly concrete evidence about many risks, growing quickly as an organization, and relying on our models for larger and larger chunks of work. We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do. We’ll need to be able to move quickly at this point, and benefit as much as is safe from new opportunities for automation. This may take a good deal of organizational and infrastructural preparation in Chapter 2.
Stress-Testing Safety Cases
Our Compliance team (for security) and Alignment Stress-Testing team (for other technical safety measures) form a second line of defense for safety on the three lines of defense worldview: They’re responsible for making sure we understand the risks that we’re mitigating and ensuring that we haven’t missed anything important. In the context of our big-picture safety plans, this manifests as giving a skeptical assessment of any load-bearing claims about safety and security that the organization is preparing to make, and providing a second sign-off on any important discretionary decision. This function is less directly crucial than many listed here, since in principle our first-line safety teams can just get it right the first time. But in practice, I expect that this will make a significant impact on our ability to get things right, and to legibly show that we’ve done so.
The main challenge here, at least for the Alignment Stress-Testing team (which I’m closer to), will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.
Adjudicating Safety Cases
Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.
I expect that our board will be in a good position to identify relevant outside experts when needed and will make reasonable decisions (modulo the limited state of our knowledge of safety in general). The bigger challenge will be in making the process by which they make these decisions legible and trustworthy for other actors. The most obvious way to do this would be by committing to defer to specific third-party organizations (potentially including government bodies) on these decisions as relevant organizations come online and build sufficient technical capacity to adjudicate them. Without that, it’s hard to see how the RSP and its accompanying structures will pass the LeCun test (see above).
On that note, I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently. These organizations collectively need to be so widely known and widely trusted (across any relevant ideological lines) that it’s viewed as highly suspicious if a frontier AI developer avoids working with any of them. Because such an organization would need to avoid conflicts of interest with the firms whose work they are adjudicating, we as an organization are very limited in what we can do to make this happen.
Developing Clear Smoking Gun Demos for Emerging Risk Factors
Present-day work TAI safety usually involves at least some amount of speculation or extrapolation, by the simple fact that we usually aren’t yet able to experiment with the systems that pose the risks that we’re trying to address. Where we can find ways to transition to concrete empirical work, we should do so, both to solidify our own confidence in our threat models and to provide more compelling evidence to other relevant parties (notably including policymakers).
When we see clear evidence that a risk or risk factor is starting to emerge in real models, it is worth significant additional work to translate that into a simple, rigorous demo that makes the risk immediately clear, ideally in a way that’s legible to a less technical audience. We’ll aim to do a form of this as part of our RSP evaluation process (as noted above), but we will need to be ready to present evidence of this kind in whatever form we can get, even if that looks quite different from what our best formal evaluations can provide. Past examples of things like this from our work include the Sleeper Agents and Sycophancy results.
Preparing to Pause or De-Deploy
For our RSP commitments to function in a worst-case scenario where making TAI systems safe is extremely difficult, we’ll need to be able to pause the development and deployment of new frontier models until we have developed adequate safeguards, with no guarantee that this will be possible on any particular timeline. This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases, but big-picture strategic preparation could make the difference between a fatal blow to our finances and morale and a recoverable one. More fine-grained tactical preparation will be necessary for us to pull this off as quickly as may be necessary without hitting technical or logistical hiccups.
Laying the Groundwork for AI Welfare Commitments
I expect that, once systems that are more broadly human-like (both in capabilities and in properties like remembering their histories with specific users) become widely used, concerns about the welfare of AI systems could become much more salient. As we approach Chapter 2, the intuitive case for concern here will become fairly strong: We could be in a position of having built a highly-capable AI system with some structural similarities to the human brain, at a per-instance scale comparable to the human brain, and deployed many instances of it. These systems would be able to act as long-lived agents with clear plans and goals and could participate in substantial social relationships with humans. And they would likely at least act as though they have additional morally relevant properties like preferences and emotions.
While the immediate importance of the issue now is likely smaller than most of the other concerns we’re addressing, it is an almost uniquely confusing issue, drawing on hard unsettled empirical questions as well as deep open questions in ethics and the philosophy of mind. If we attempt to address the issue reactively later, it seems unlikely that we’ll find a coherent or defensible strategy.
To that end, we’ll want to build up at least a small program in Chapter 1 to build out a defensible initial understanding of our situation, implement low-hanging-fruit interventions that seem robustly good, and cautiously try out formal policies to protect any interests that warrant protecting. I expect this will need to be pluralistic, drawing on a number of different worldviews around what ethical concerns can arise around the treatment of AI systems and what we should do in response to them.
Chapter 2: TAI, or, Making the AI Do Our Homework
In this period, our best models are starting to qualify as TAI, but aren’t yet dramatically superhuman in most domains. Our RSP would put them solidly at ASL-4. AI is already having an immense, unprecedented impact on the world, largely for the better. Where it’s succeeding, it’s mostly succeeding in human-like ways that we can at least loosely follow and understand. While we may be surprised by the overall impact of AI, we aren’t usually surprised by individual AI actions. We’re not dealing with ‘galaxy brains’ that are always thinking twenty steps ahead of us. AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety. This phase will likely come on gradually and somewhat ambiguously, but it may end abruptly if AI-augmented R&D reaches intelligence-explosion level, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
Many of the Chapter 1 tasks will not be finished by this point, and many of those will only become more challenging and urgent in Chapter 2. In addition, this phase may end abruptly if AI-augmented R&D reaches escape velocity, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
Meeting the ASL-5 Standard for Weights Security
At this point, AI systems are visibly extremely valuable and visibly close to kicking off an intelligence explosion. We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history. Among other things, this means that we’ll need to be capable of defending against top-priority attacks by the most advanced state or state-supported attackers. This will involve taking unprecedented actions in the service of security, likely including interventions like air gaps (among many others) that introduce dramatic restrictions on the ability of most human researchers to do their work.
Developing Methods to Align a Substantially Superhuman AI
In Chapter 3, we may be dealing with systems that are capable enough to rapidly and decisively undermine our safety and security if they are misaligned. So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control. This work could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align, and if we’ve done our jobs, we will be able to largely rely on human-level automated researchers to help us with the work. The remaining work will be to maintain sufficient situational awareness ourselves that we can be confident in our assessment of how we’re doing.
Evaluating Constantly and Carefully
Even if pretraining runs continue to be divided into clear spaced-out model generations at this point, they are no longer the obvious main locus for frontier risk evaluations. We should expect a substantial AI-enabled acceleration in the pace of progress on finetuning and elicitation. While at earlier ASLs, our frontier risk evaluations can incorporate some buffer, and if an AI system fails to trigger one, we can proceed with some further research and scaling before we need to evaluate again, these buffers will likely become unsustainable: Every nontrivial advance that we become aware of, either from our own research, from publicly-known research, or from observed user behavior, should be assessed, and many will trigger the need for new evaluations. It will be crucial for evaluations to be fast and at least largely automatic.
In addition, AI systems will be able to do nontrivial (if not wildly superhuman) strategic reasoning, without chain-of-thought style thinking out loud, potentially allowing them to strategically influence the outcomes of any evaluation that they can identify as an evaluation. Evaluation integrity will thus accordingly be a serious challenge.
Deploying Potentially Extremely Dangerous Systems
By ASL-4, models could cause extremely severe harm if deployed recklessly. But if deployed carefully, they would yield similarly immense benefits. If we are justifiably very confident in our suite of safeguards, we should deploy these systems broadly to the public. If we are less certain, we may still have reason to deploy in a more targeted way, like to heavily vetted partners or alongside especially demanding forms of monitoring. The work of the safety teams in these first Chapter 2 deployments will largely consist in making sure that the suite of safeguards that we developed in Chapter 1 behaves as we expect it to.
Addressing AI Welfare as a Major Priority
At this point, AI systems clearly demonstrate several of the attributes described above that plausibly make them worthy of moral concern. Questions around sentience and phenomenal consciousness in particular will likely remain thorny and divisive at this point, but it will be hard to rule out even those attributes with confidence. These systems will likely be deployed in massive numbers. I expect that most people will now intuitively recognize that the stakes around AI welfare could be very high.
Our challenge at this point will be to make interventions and concessions for model welfare that are commensurate with the scale of the issue without undermining our core safety goals or being so burdensome as to render us irrelevant. There may be solutions that leave both us and the AI systems better off, but we should expect serious lingering uncertainties about this through ASL-5.
Deploying in Support of High-Stakes Decision-Making
In the transition from Chapter 2 to Chapter 3, automation of huge swaths of the economy will feel clearly plausible, catastrophic risks will be viscerally close, and most institutions worldwide will be seeing unprecedented threats and opportunities. In addition to being the source of all of this uncertainty and change, AI systems at this point could also offer timely tools that help navigate it. This is the point where it is most valuable to deploy tools that meaningfully improve our capacity to make high-stakes decisions well, potentially including work that targets individual decision-making, consensus-building, education, and/or forecasting. A significant part of the work here will be in product design rather than core AI research, such that much of this could likely be done through public-benefit-oriented partnerships rather than in house.
Chapter 3: Life after TAI
Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with. The ASL-5 standard demands extremely strong safeguards, and if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2. This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now. Plus, any remaining safety research problems will be better addressed by automated systems, leaving us with little left to do.
Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made. I’m not including any checklist items below, because we hope not to have any.
If we have built this technology and we are still in a position to make major decisions as an organization, the stakes are now enormously high. These decisions could deal with early deployments that could quickly transform or derail society in hard-to-predict ways. These decisions could also deal with governance and safety mechanisms that face stark trade-offs in the face of systems that may feel more like whole freestanding civilizations than like today’s chatbots. Our primary objective at this point should be to help place these decisions in the hands of institutions or processes—potentially including ones that are still yet to be created—that have the democratic legitimacy, and the wisdom, to make them well.
This matches some common uses of the term AGI, but that term is overloaded and is sometimes used to describe only broadly superhuman systems, so I avoid it here.
Of course, what behavior counts as harmless is a deeply thorny question on our own, and one we would hope to draw on an outside consensus for rather than attempt to settle on our own.
The Checklist: What Succeeding at AI Safety Will Involve
Link post
Crossposted by habryka with Sam’s permission. Expect lower probability for Sam to respond to comments here than if he had posted it (he said he’ll be traveling a bunch in the coming weeks, so might not have time to respond to anything).
Preface
This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it’s disproportionately focused on technical research and on averting emerging catastrophic risks.
For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we’ll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go.
To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work.
Here are some of the assumptions that the piece relies on. I don’t think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans:
Broadly human-level AI is possible. I’ll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]
Broadly human-level AI (or TAI) isn’t an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions.
If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that’s not wildly different from today.
If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI.
Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.
Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.
Alignment—in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy—requires some non-trivial effort to get right, and it gets harder as systems get more powerful.
Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I’ve been influenced by far more writings and people than I can credit here or even keep track of.
Introducing the Checklist
This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models:
Chapter 1: Preparation
You are here. In this period, our best models aren’t yet TAI. In the language of Anthropic’s RSP, they’re at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we have to do will take place here, though it will often be motivated by subsequent chapters. We are preparing for high-stakes concerns that are yet to arise in full. Things are likely more urgent than they appear.
Chapter 2: Making the AI Systems Do Our Homework
In this period, our best models are starting to qualify as TAI, but aren’t yet dramatically superhuman in most domains. Our RSP would put them solidly at ASL-4. AI is already having an immense, unprecedented impact on the world, largely for the better. Where it’s succeeding, it’s mostly succeeding in human-like ways that we can at least loosely follow and understand. While we may be surprised by the overall impact of AI, we aren’t usually surprised by individual AI actions. We’re not dealing with ‘galaxy brains’ that are always thinking twenty steps ahead of us. AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety. This phase will likely come on gradually and somewhat ambiguously, but it may end abruptly if AI-augmented R&D reaches intelligence-explosion level, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
Chapter 3: Life after TAI
Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with. The ASL-5 standard demands extremely strong safeguards, and if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2. This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now. Plus, any remaining safety research problems will be better addressed by automated systems, leaving us with little left to do.
This structure bakes in the assumption that risk levels and capability levels track each other in a relatively predictable way. The first models to reach TAI pose ASL-4-level risks. The first substantially superhuman models pose ASL-5-level risks. The ASLs are defined in terms of the levels of protection that are warranted, so this is not guaranteed to be the case. I take the list of goals here more seriously than the division into chapters.
In each chapter, I’ll run through a list of goals I think we need to accomplish. These goals overlap with one another in places, and some of these goals are only here because they are instrumentally important toward achieving others, but they should still reflect the major topics that we’ll need to cover when setting our more detailed plans at each stage.
Chapter 1: Preparation
Not Missing the Boat on Capabilities
Our ability to do our safety work depends in large part on our access to frontier technology. If we can’t find enough compute, we botch a major pretraining run, or we miss out on a transformative paradigm shift (or even just a bunch of smaller improvements to our methods), we’ll have lost our most of our opportunity to contribute. Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.
Largely Solving Alignment Fine-Tuning for Early TAI
By the time we have systems that can meaningfully automate large parts of research (importantly including AI safety research), we’ll need to know how to “[get] a lot of useful work out of AIs” without anything going off the rails, and in a way that takes advantage of AI capabilities that are at or somewhat beyond those of human domain experts.
We don’t need to solve alignment perfectly—we can tolerate some marginal risk of misalignment at this point since we won’t be trusting AI systems with the very highest-stakes decisions, and since we’re fairly likely to catch misaligned behavior before it turns into a full global catastrophe. But we need to do quite a good job here.
We should aim to build solutions that are reasonably efficient and reasonably general. It’s possible that we could get by solving alignment only for an AI research assistant that we only use in-house and with heavy expert monitoring, but this would put us in a very delicate situation. We’ll want to be able to broadly deploy TAI systems externally reasonably quickly once that becomes possible, both to allow others to benefit from the potentially immense positive value of the systems and to keep ourselves viable as a business. We thus shouldn’t be satisfied with solutions that require baroque constraints or extensive monitoring by experts in a way that means broad deployment would be impossible.
In my view, the central pillar of this work is scalable oversight—especially scalable oversight that focuses on training trustworthy agents for complex open-ended tasks. Key challenges include reward hacking, the basic limits of human attentiveness, and (to a lesser extent for now) scheming.
Rendering Early TAI Reliably Harmless
If we solve alignment fine-tuning perfectly, we can just ask our models to be harmless and tell them what we mean by that.[2] Short of this kind of perfect solution, which seems unlikely, we’ll want additional layers of defense to ensure that early-TAI systems aren’t misused and that, if they try to take harmful autonomous actions, they don’t get far.
In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are trying to cause serious harm. This goal suggests the need for work on automated monitoring of model outputs, human spot-checking of model usage, automated red-teaming, and Control-style expert stress-testing evaluations of our safeguards wherein we deliberately build toy misaligned systems that try to overcome our oversight measures. At least in Chapter 1, this kind of work may be as important for safety as more conventional alignment work, largely because it appears likely to be easier to measure progress on worst-case safety under this approach. With this in mind, I expect us to rely heavily on monitoring and other similar model-external interventions to help us meet our first RSP deployment-safety commitments at ASL-3.
A key challenge here in the longer run is likely to be the cluster of ML problems around adversarial robustness, since our methods will rely on building harmfulness classifiers with tolerable precision and extremely high recall. We’ll also need a lot of creativity and a strong security mindset: Our red-teaming and stress-testing evaluation work here will demand both.
We also expect to hit novel technical problems in applying safeguards that consider many transcripts or model actions at once, rather than targeting transcripts separately one at a time. We expect this to be necessary both as a way of surfacing concerning novel behaviors or use cases that weren’t part of our initial threat models and as a way of protecting against more acute misuse or misalignment risks that aren’t localized to specific transcripts but become clear when viewed as patterns across small clusters of transcripts or users.
Compiling an RSP that is Adequate and Well-Calibrated for Risks through Early TAI
The above three items are about getting our systems to a minimal bar of safety and usefulness through early TAI (i.e., ASL-4). Much of the rest of this chapter will be about making this work legible and holding ourselves accountable to the public or to governments for getting it done.
The RSP aims to make it consistently the case that our model training and deployment meets a high, clearly-specified bar for safety and that there is publicly accessible evidence that we have met this bar. Roughly speaking, we run tests (‘frontier risk evaluations’) meant to assess the level of risk that our systems could pose if deployed without safeguards and, if we aren’t able to fully and demonstrably mitigate that risk through our safeguards, we pause further deployments and/or further scaling.
This is in part a way of organizing safety efforts within Anthropic, but it’s just as much a way of setting broader norms and expectations around safety for the industry more broadly. By showing that we can stay at or near the frontier while being demonstrably safe, we can defuse worries that this level of safety is impossible or commercially impractical to achieve.
To do this, our specific commitments under the RSP need to be well-calibrated in both detail and strictness to mitigate the level of risk that we expect:
If they’re significantly too lax, we face unacceptable risks.
If they’re significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
If they’re significantly too vague, they build less trust in our safety practices and work poorly as a demonstration to others.
If they’re significantly too detailed early on, we risk misjudging where the most important work will need to be, and thereby committing ourselves to needless costly busywork.
Relatedly, we should aim to pass what I call the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.
The goal here is analogous to that of standards and certifications in other domains. For example, if an organization doesn’t expect to be a target of cyberattacks but nonetheless follows a common cybersecurity standard like SOC 2, they likely still achieve some real protection despite their skepticism.
The key challenge here is forecasting which risks and risk factors are important enough to include. A specific recurring open question in our threat modeling so far is the degree to which risk at ASL-3 and ASL-4 (i.e., before broadly superhuman models or any acute intelligence explosion) flows through direct misuse, through misalignment, or through more indirect contributions via channels like dual-use R&D.
Preparing to Make Safety Cases for Evaluations and Deployments at ASL-4
Once we hit ASL-4 which, roughly speaking, covers near-human-level autonomy and plausibly catastrophic direct misuse risks, we don’t expect to be able to lay out detailed criteria in advance for what tests we would have to pass to approve a system as safe. Instead, we’ll commit to putting together a safety case—a report giving evidence that a system is safe under some circumstances—and we’ll lay out high-level criteria that the safety case needs to satisfy to be approved. Similarly, as models become capable of recognizing when and how they are being evaluated, we will need evaluation-integrity safety cases that show that our frontier risk evaluation runs are reliable at identifying the risk factors that they are designed to catch. Much of our technical safety work will ultimately have impact by being included in these safety cases (and thereby influencing high-stakes decisions about security, scaling, and deployment), and these safety cases are a key target for our work in the lead-up to ASL-4.
We should maintain, internally, a small number of detailed best-guess safety cases that cover a reasonable range of safety situations we might find ourselves in. Our RSP-oriented technical safety work should then be triaged against the likelihood that it feeds into one of these safety cases, and these safety cases should be frequently updated as we learn more about the risks and affordances we face.
Getting Interpretability to the Point of Making Strong Assurances
One of Anthropic’s main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case. This piece from our interpretability team from last year sketches out some of what this could involve.
Compiling Evidence of Robustness
Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust. That is, it should be clear that neither the model nor its monitoring systems will fail in surprising ways on rare but important inputs. Barring extreme near-perfect successes with interpretability, our primary evidence for this in safety cases will likely focus on expert stress-testing evaluations of our safeguards (as above) and quantitative results from black-box automated red-teaming, with possible secondary evidence coming from gradient-based white-box attacks as well.
Developing Additional Basic Science for Safety Cases
Barring an unlikely best-case outcome from our mechanistic interpretability work, we expect that a strong safety case will have to rely on additional new findings, based on other approaches, that allow us to evaluate models for safety, quantitatively forecast the risks they’re likely to pose, or quantitatively forecast the effectiveness of our mitigations. Work on scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence ‘nerfing’ (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions are among the directions that could lead to significant contributions here. This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.
Meeting the ASL-3 and ASL-4 Security Standards for Weights
Our first deployments with non-negligible catastrophic risk will require us to meet the ASL-3 standard for security precautions, largely to prevent bad actors from stealing the weights (and thereby disabling our safeguards) for a model that is capable of enabling extremely harmful actions. For analogous reasons, early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks. We will need to both implement these and be able to demonstrate to third parties that we’ve done so. While ASL-3 is not a huge departure from familiar industry best practices, ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.
Protecting Algorithmic Secrets
To the extent that our capabilities research puts us well ahead of the state of public knowledge in the field, it will be important to secure the key findings from that research to preserve our ability to stay in on or near the lead (for the reasons given above). This is qualitatively different from securing model weights, and potentially much more difficult: Because these capabilities findings can often be expressed in a few sentences or paragraphs, departing staff will naturally remember them. It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.
Building Calibrated, Legible Evaluations for ASL-4 and ASL-5
Once we’ve hit ASL-3, our evaluations become quite high-stakes. Deploying under ASL-4 or ASL-5 precautions could be unprecedentedly costly and require long lead times to implement. As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.
In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too. If our evaluations for some risk factor trigger, we’ll want clear evidence (ideally in the form of unequivocal ‘smoking gun’ results) that the risk factor demands immediate attention.
We’ll also need our evaluations at ASL-4 and ASL-5 to be increasingly sensitive to evaluation integrity concerns, as discussed briefly in the context of safety cases above. Elicitation with superhuman models can go wrong in far more ways than with present models.
By the time we hit ASL-3, we’ll need strong evaluations for ASL-4. By the time we hit ASL-4, we’ll need strong evaluations for ASL-5. These evaluations will seem premature and divorced from current practice, but capabilities progress is fast and it takes many iterations to get an evaluation right, so we should start piloting them early.
Supporting Efforts that Build Societal Resilience
For some of the most significant risks from early TAI, like very strong and widely available tools for cyberoffense or persuasion, it may be possible to improve our safety situation significantly through relatively tractable mitigations outside the organization. (For example, hardening the cybersecurity of critical infrastructure.) Since it’s unlikely that we’ll have perfect certainty that we have these risks under control, and very unlikely that the entire AI ecosystem will have them under control indefinitely, it’s worth putting significant effort toward working with governments and other relevant bodies to strengthen outside-world defenses against these risks. This work can also feed into a safety case, by mitigating some mechanisms by which AI safety issues could translate into real harms.
More broadly, even AI deployments that are unequivocally positive in their overall effects can nonetheless be quite destabilizing and need to be managed well. (Consider changes in the labor market for a version of this that we’ve encountered many times before.) We don’t have the expertise or the authority or the legitimacy to unilaterally address these societal-scale concerns, but we should use what affordances we have to support and inform responses from government and civil society.
Building Well-Calibrated Forecasts on Dangerous Capabilities, Mitigations, and Elicitation
We’ll be able to plan and coordinate much better if we have good guesses as to which risks will emerge when, as well as which mitigations can be made ready when. These forecasts will play an especially direct role in our RSP evaluation planning: Under the current design of the RSP, our evaluation protocols need to leave a buffer, such that they will trigger safely before the risk actually emerges, to avoid cases where models are trained under moderate security but retroactively determined to need higher security. Forecasts based on solid evidence and well-tested practices would allow us to move the design of those buffers from guesswork to reasonably confident science, and to potentially narrow them in some cases as a result.
These forecasts may also influence the structure of our safety cases. If we have methods that are able to make well-calibrated forecasts of the emergence of new risks, these forecasts can help identify the specific risk factors within a broader safety case that need the most attention.
Building Extremely Adaptive Research Infrastructure
At some point around the development of early TAI, we’re likely to be getting newly concrete evidence about many risks, growing quickly as an organization, and relying on our models for larger and larger chunks of work. We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do. We’ll need to be able to move quickly at this point, and benefit as much as is safe from new opportunities for automation. This may take a good deal of organizational and infrastructural preparation in Chapter 2.
Stress-Testing Safety Cases
Our Compliance team (for security) and Alignment Stress-Testing team (for other technical safety measures) form a second line of defense for safety on the three lines of defense worldview: They’re responsible for making sure we understand the risks that we’re mitigating and ensuring that we haven’t missed anything important. In the context of our big-picture safety plans, this manifests as giving a skeptical assessment of any load-bearing claims about safety and security that the organization is preparing to make, and providing a second sign-off on any important discretionary decision. This function is less directly crucial than many listed here, since in principle our first-line safety teams can just get it right the first time. But in practice, I expect that this will make a significant impact on our ability to get things right, and to legibly show that we’ve done so.
The main challenge here, at least for the Alignment Stress-Testing team (which I’m closer to), will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.
Adjudicating Safety Cases
Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.
I expect that our board will be in a good position to identify relevant outside experts when needed and will make reasonable decisions (modulo the limited state of our knowledge of safety in general). The bigger challenge will be in making the process by which they make these decisions legible and trustworthy for other actors. The most obvious way to do this would be by committing to defer to specific third-party organizations (potentially including government bodies) on these decisions as relevant organizations come online and build sufficient technical capacity to adjudicate them. Without that, it’s hard to see how the RSP and its accompanying structures will pass the LeCun test (see above).
On that note, I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently. These organizations collectively need to be so widely known and widely trusted (across any relevant ideological lines) that it’s viewed as highly suspicious if a frontier AI developer avoids working with any of them. Because such an organization would need to avoid conflicts of interest with the firms whose work they are adjudicating, we as an organization are very limited in what we can do to make this happen.
Developing Clear Smoking Gun Demos for Emerging Risk Factors
Present-day work TAI safety usually involves at least some amount of speculation or extrapolation, by the simple fact that we usually aren’t yet able to experiment with the systems that pose the risks that we’re trying to address. Where we can find ways to transition to concrete empirical work, we should do so, both to solidify our own confidence in our threat models and to provide more compelling evidence to other relevant parties (notably including policymakers).
When we see clear evidence that a risk or risk factor is starting to emerge in real models, it is worth significant additional work to translate that into a simple, rigorous demo that makes the risk immediately clear, ideally in a way that’s legible to a less technical audience. We’ll aim to do a form of this as part of our RSP evaluation process (as noted above), but we will need to be ready to present evidence of this kind in whatever form we can get, even if that looks quite different from what our best formal evaluations can provide. Past examples of things like this from our work include the Sleeper Agents and Sycophancy results.
Preparing to Pause or De-Deploy
For our RSP commitments to function in a worst-case scenario where making TAI systems safe is extremely difficult, we’ll need to be able to pause the development and deployment of new frontier models until we have developed adequate safeguards, with no guarantee that this will be possible on any particular timeline. This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases, but big-picture strategic preparation could make the difference between a fatal blow to our finances and morale and a recoverable one. More fine-grained tactical preparation will be necessary for us to pull this off as quickly as may be necessary without hitting technical or logistical hiccups.
Laying the Groundwork for AI Welfare Commitments
I expect that, once systems that are more broadly human-like (both in capabilities and in properties like remembering their histories with specific users) become widely used, concerns about the welfare of AI systems could become much more salient. As we approach Chapter 2, the intuitive case for concern here will become fairly strong: We could be in a position of having built a highly-capable AI system with some structural similarities to the human brain, at a per-instance scale comparable to the human brain, and deployed many instances of it. These systems would be able to act as long-lived agents with clear plans and goals and could participate in substantial social relationships with humans. And they would likely at least act as though they have additional morally relevant properties like preferences and emotions.
While the immediate importance of the issue now is likely smaller than most of the other concerns we’re addressing, it is an almost uniquely confusing issue, drawing on hard unsettled empirical questions as well as deep open questions in ethics and the philosophy of mind. If we attempt to address the issue reactively later, it seems unlikely that we’ll find a coherent or defensible strategy.
To that end, we’ll want to build up at least a small program in Chapter 1 to build out a defensible initial understanding of our situation, implement low-hanging-fruit interventions that seem robustly good, and cautiously try out formal policies to protect any interests that warrant protecting. I expect this will need to be pluralistic, drawing on a number of different worldviews around what ethical concerns can arise around the treatment of AI systems and what we should do in response to them.
Chapter 2: TAI, or, Making the AI Do Our Homework
Many of the Chapter 1 tasks will not be finished by this point, and many of those will only become more challenging and urgent in Chapter 2. In addition, this phase may end abruptly if AI-augmented R&D reaches escape velocity, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
Meeting the ASL-5 Standard for Weights Security
At this point, AI systems are visibly extremely valuable and visibly close to kicking off an intelligence explosion. We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history. Among other things, this means that we’ll need to be capable of defending against top-priority attacks by the most advanced state or state-supported attackers. This will involve taking unprecedented actions in the service of security, likely including interventions like air gaps (among many others) that introduce dramatic restrictions on the ability of most human researchers to do their work.
Developing Methods to Align a Substantially Superhuman AI
In Chapter 3, we may be dealing with systems that are capable enough to rapidly and decisively undermine our safety and security if they are misaligned. So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control. This work could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align, and if we’ve done our jobs, we will be able to largely rely on human-level automated researchers to help us with the work. The remaining work will be to maintain sufficient situational awareness ourselves that we can be confident in our assessment of how we’re doing.
Evaluating Constantly and Carefully
Even if pretraining runs continue to be divided into clear spaced-out model generations at this point, they are no longer the obvious main locus for frontier risk evaluations. We should expect a substantial AI-enabled acceleration in the pace of progress on finetuning and elicitation. While at earlier ASLs, our frontier risk evaluations can incorporate some buffer, and if an AI system fails to trigger one, we can proceed with some further research and scaling before we need to evaluate again, these buffers will likely become unsustainable: Every nontrivial advance that we become aware of, either from our own research, from publicly-known research, or from observed user behavior, should be assessed, and many will trigger the need for new evaluations. It will be crucial for evaluations to be fast and at least largely automatic.
In addition, AI systems will be able to do nontrivial (if not wildly superhuman) strategic reasoning, without chain-of-thought style thinking out loud, potentially allowing them to strategically influence the outcomes of any evaluation that they can identify as an evaluation. Evaluation integrity will thus accordingly be a serious challenge.
Deploying Potentially Extremely Dangerous Systems
By ASL-4, models could cause extremely severe harm if deployed recklessly. But if deployed carefully, they would yield similarly immense benefits. If we are justifiably very confident in our suite of safeguards, we should deploy these systems broadly to the public. If we are less certain, we may still have reason to deploy in a more targeted way, like to heavily vetted partners or alongside especially demanding forms of monitoring. The work of the safety teams in these first Chapter 2 deployments will largely consist in making sure that the suite of safeguards that we developed in Chapter 1 behaves as we expect it to.
Addressing AI Welfare as a Major Priority
At this point, AI systems clearly demonstrate several of the attributes described above that plausibly make them worthy of moral concern. Questions around sentience and phenomenal consciousness in particular will likely remain thorny and divisive at this point, but it will be hard to rule out even those attributes with confidence. These systems will likely be deployed in massive numbers. I expect that most people will now intuitively recognize that the stakes around AI welfare could be very high.
Our challenge at this point will be to make interventions and concessions for model welfare that are commensurate with the scale of the issue without undermining our core safety goals or being so burdensome as to render us irrelevant. There may be solutions that leave both us and the AI systems better off, but we should expect serious lingering uncertainties about this through ASL-5.
Deploying in Support of High-Stakes Decision-Making
In the transition from Chapter 2 to Chapter 3, automation of huge swaths of the economy will feel clearly plausible, catastrophic risks will be viscerally close, and most institutions worldwide will be seeing unprecedented threats and opportunities. In addition to being the source of all of this uncertainty and change, AI systems at this point could also offer timely tools that help navigate it. This is the point where it is most valuable to deploy tools that meaningfully improve our capacity to make high-stakes decisions well, potentially including work that targets individual decision-making, consensus-building, education, and/or forecasting. A significant part of the work here will be in product design rather than core AI research, such that much of this could likely be done through public-benefit-oriented partnerships rather than in house.
Chapter 3: Life after TAI
Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made. I’m not including any checklist items below, because we hope not to have any.
If we have built this technology and we are still in a position to make major decisions as an organization, the stakes are now enormously high. These decisions could deal with early deployments that could quickly transform or derail society in hard-to-predict ways. These decisions could also deal with governance and safety mechanisms that face stark trade-offs in the face of systems that may feel more like whole freestanding civilizations than like today’s chatbots. Our primary objective at this point should be to help place these decisions in the hands of institutions or processes—potentially including ones that are still yet to be created—that have the democratic legitimacy, and the wisdom, to make them well.
This matches some common uses of the term AGI, but that term is overloaded and is sometimes used to describe only broadly superhuman systems, so I avoid it here.
Of course, what behavior counts as harmless is a deeply thorny question on our own, and one we would hope to draw on an outside consensus for rather than attempt to settle on our own.