ryan_greenblatt comments on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

ryan_greenblatt 4 Nov 2024 21:48 UTC
4 points
0
No IMO.

(I’m also skeptical of competitiveness with expert jailbreakers, but this isn’t a crux. Edit: It does seem to be competitive with Scale’s jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)