George Ingebretsen comments on Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren’t scheming

George Ingebretsen 13 Oct 2024 18:01 UTC
1 point
0
I mostly see behavioral red-teaming work to be useful for (a) model organism of misalignment reasons, and (b) near-term misuse reasons. I think these are totally separate from “actually making good safety cases.”

I think a huge issue with the literature is people referring to “successfully defending against behavioral red-teaming” as “aligning” the model.