I think the portfolio of safety effort should include some BRT, but the returns to effort are such that other AI control techniques should have three times as much effort put into them.
Would this factor (and maybe even the conclusions of the whole post) change with (roughly) human-level automated (including multi-turn) red-teaming, as e.g. claimed here: https://blog.haizelabs.com/posts/cascade/?
(I’m also skeptical of competitiveness with expert jailbreakers, but this isn’t a crux. Edit: It does seem to be competitive with Scale’s jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)
Would this factor (and maybe even the conclusions of the whole post) change with (roughly) human-level automated (including multi-turn) red-teaming, as e.g. claimed here: https://blog.haizelabs.com/posts/cascade/?
No IMO.
(I’m also skeptical of competitiveness with expert jailbreakers, but this isn’t a crux. Edit: It does seem to be competitive with Scale’s jailbreakers, but I think some human jailbreakers are much better and I expect the gap is more stark for harder to jailbreak models.)