I mostly see behavioral red-teaming work to be useful for (a) model organism of misalignment reasons,
and (b) near-term misuse reasons. I think these are totally separate from “actually making good safety cases.”
I think a huge issue with the literature is people referring to “successfully defending against behavioral red-teaming” as “aligning” the model.
I mostly see behavioral red-teaming work to be useful for (a) model organism of misalignment reasons, and (b) near-term misuse reasons. I think these are totally separate from “actually making good safety cases.”
I think a huge issue with the literature is people referring to “successfully defending against behavioral red-teaming” as “aligning” the model.