Bogdan Ionut Cirstea comments on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

Bogdan Ionut Cirstea 4 Nov 2024 20:32 UTC
2 points
0
I think the portfolio of safety effort should include some BRT, but the returns to effort are such that other AI control techniques should have three times as much effort put into them.
Would this factor (and maybe even the conclusions of the whole post) change with (roughly) human-level automated (including multi-turn) red-teaming, as e.g. claimed here: https://blog.haizelabs.com/posts/cascade/?
- ryan_greenblatt 4 Nov 2024 21:48 UTC
  4 points
  0
  Parent
  No IMO.
  
  (I’m also skeptical of competitiveness with expert jailbreakers, but this isn’t a crux. Edit: It does seem to be competitive with Scale’s jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)