Erik Jenner comments on evhub’s Shortform

Erik Jenner 29 Dec 2024 16:00 UTC
10 points
6
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
- Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
- Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.
- ryan_greenblatt 3 Jan 2025 19:05 UTC
  10 points
  2
  Parent
  I think Anthropic de facto acts as though “models are quite unlikely (e.g. 3%) to be scheming” is true. Evidence that seriously challenged this view might cause the organization to substantially change its approach.