Thank you for writing this! I’ve found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I’ve collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)
Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.
If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:
Meet safety constraints for models deployed in this phase
Stay close to the frontier
Do the work needed to prepare for Chapter 2
And the reasoning is that 3. can’t really happen without 2.[1] But on the other hand, if 2. happens without 3., that’s also bad. And some safety work could probably happen without frontier models (such as some interpretability).
My best guess is that staying close to the frontier will be the correct choice for Anthropic. But if there ends up being a genuine trade-off between staying at the frontier and doing a lot of safety work (for example, if compute could be spent either on a pretraining run or some hypothetical costly safety research, but not both), then I’m much less sure that staying at the frontier should be the higher priority. It might be good to have informal conditions under which Anthropic would deprioritize staying close to the frontier (at least internally and, if possible, publicly).
Largely Solving Alignment Fine-Tuning for Early TAI
I didn’t quite understand what this looks like and which threat models it is or isn’t meant to address. You say that scheming is a key challenge “to a lesser extent for now,” which I took to mean that (a) there are bigger threats than scheming from early TAI, and (b) “largely solving alignment fine-tuning” might not include confidently ruling out scheming. I probably disagree with (a) for loss of control risk (and think that loss of control is already the biggest risk in this period weighted by scale). I’d be curious what you think the main risks in this period are and what “largely solving alignment fine-tuning” means for those. (You mention reward hacking—to me, this seems unlikely to lead to loss of control for early TAI that isn’t scheming against us, and I’m curious whether you disagree or think it’s important for other reasons.)
This sounds quite ambitious, but I really like it as a guide!
The key challenge here is forecasting which risks and risk factors are important enough to include.
I don’t understand why this is crucial. If some risk is plausible enough to be worth seriously thinking about, it’s probably important enough to include in an RSP. (And the less important it was, the easier it hopefully is to argue in a safety case that it’s not a problem.) Concretely, you mention direct misuse, misalignment, and “indirect contributions via channels like dual-use R&D” as potential risks for ASL-3 and ASL-4. It seems to me that the downside of just including all of them in RSPs is relatively minor, but I might be misunderstanding or missing something. (I get that overly restrictive precautions could be very costly, but including too many tests seems relatively cheap as long as the tests correctly notice when risk is still low.)
Getting Interpretability to the Point of Making Strong Assurances
Major successes in this direction, even if they fall short of our north-star enumerative safety goal [...] would likely form some of the highest-confidence core pieces of a safety case
I’m curious what such safety cases would be for and what they could look like (the “Interpretability Dreams” post seems to talk about enumerative safety rather than safety cases that require less interpretability success). The next section sounds like interpretability would not be a core piece of a safety case for robustness, so I’m not sure what it would be used for instead. Maybe you don’t include scheming under robustness? (Or maybe interp would be one of the “highest-confidence core pieces” but not the “primary piece?”)
This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.
I like this perspective; I hadn’t seen it put quite that way before!
In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too.
+1, seems very important!
Supporting Efforts that Build Societal Resilience
I liked this section! Of course, a lot of people work on this for reasons other than AI risk, but I’m not aware of much active work motivated by AI risk—maybe this should be a bigger priority?
The main challenge [for the Alignment Stress-Testing team] will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.
+1, and ideally, there’d be structures in place to encourage this rather than just having it as a goal (but I don’t have great ideas for what these structures should look like).
This work [in Chapter 2] could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align
This seems possible but unclear to me. In both Chapter 1 and 2, we’re trying to figure out how to align the next generation of AIs, given access only to the current (less capable) generation. Chapter 2 might still be different if we’ve already crossed important thresholds (such as being smart enough to potentially scheme) by then. But there could also be new thresholds between Chapter 2 and 3 (such as our inability to evaluate AI actions even with significant effort). So I wouldn’t be surprised if things feel fundamentally similar, just at a higher absolute capability level (and thus with more useful AI helpers).
Thank you for writing this! I’ve found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I’ve collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)
If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:
Meet safety constraints for models deployed in this phase
Stay close to the frontier
Do the work needed to prepare for Chapter 2
And the reasoning is that 3. can’t really happen without 2.[1] But on the other hand, if 2. happens without 3., that’s also bad. And some safety work could probably happen without frontier models (such as some interpretability).
My best guess is that staying close to the frontier will be the correct choice for Anthropic. But if there ends up being a genuine trade-off between staying at the frontier and doing a lot of safety work (for example, if compute could be spent either on a pretraining run or some hypothetical costly safety research, but not both), then I’m much less sure that staying at the frontier should be the higher priority. It might be good to have informal conditions under which Anthropic would deprioritize staying close to the frontier (at least internally and, if possible, publicly).
I didn’t quite understand what this looks like and which threat models it is or isn’t meant to address. You say that scheming is a key challenge “to a lesser extent for now,” which I took to mean that (a) there are bigger threats than scheming from early TAI, and (b) “largely solving alignment fine-tuning” might not include confidently ruling out scheming. I probably disagree with (a) for loss of control risk (and think that loss of control is already the biggest risk in this period weighted by scale). I’d be curious what you think the main risks in this period are and what “largely solving alignment fine-tuning” means for those. (You mention reward hacking—to me, this seems unlikely to lead to loss of control for early TAI that isn’t scheming against us, and I’m curious whether you disagree or think it’s important for other reasons.)
This sounds quite ambitious, but I really like it as a guide!
I don’t understand why this is crucial. If some risk is plausible enough to be worth seriously thinking about, it’s probably important enough to include in an RSP. (And the less important it was, the easier it hopefully is to argue in a safety case that it’s not a problem.) Concretely, you mention direct misuse, misalignment, and “indirect contributions via channels like dual-use R&D” as potential risks for ASL-3 and ASL-4. It seems to me that the downside of just including all of them in RSPs is relatively minor, but I might be misunderstanding or missing something. (I get that overly restrictive precautions could be very costly, but including too many tests seems relatively cheap as long as the tests correctly notice when risk is still low.)
I’m curious what such safety cases would be for and what they could look like (the “Interpretability Dreams” post seems to talk about enumerative safety rather than safety cases that require less interpretability success). The next section sounds like interpretability would not be a core piece of a safety case for robustness, so I’m not sure what it would be used for instead. Maybe you don’t include scheming under robustness? (Or maybe interp would be one of the “highest-confidence core pieces” but not the “primary piece?”)
I like this perspective; I hadn’t seen it put quite that way before!
+1, seems very important!
I liked this section! Of course, a lot of people work on this for reasons other than AI risk, but I’m not aware of much active work motivated by AI risk—maybe this should be a bigger priority?
+1, and ideally, there’d be structures in place to encourage this rather than just having it as a goal (but I don’t have great ideas for what these structures should look like).
This seems possible but unclear to me. In both Chapter 1 and 2, we’re trying to figure out how to align the next generation of AIs, given access only to the current (less capable) generation. Chapter 2 might still be different if we’ve already crossed important thresholds (such as being smart enough to potentially scheme) by then. But there could also be new thresholds between Chapter 2 and 3 (such as our inability to evaluate AI actions even with significant effort). So I wouldn’t be surprised if things feel fundamentally similar, just at a higher absolute capability level (and thus with more useful AI helpers).
“Our ability to do our safety work depends in large part on our access to frontier technology.”