One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.