I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
I think there are things Anthropic could do that would help considerably. This could include:
Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.