I think that it’s important that those at top-level AI labs focus on actual safety issues instead of PR appearances.
Your proposal around “surrogate jailbreaking targets” sounds to me like you want companies to pretend that certain issues are safety issues that aren’t really safety issues.
Useful red teaming isn’t to think about pretend-safety issues but to think about the real safety issues that come with technology.
First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.
I think that it’s important that those at top-level AI labs focus on actual safety issues instead of PR appearances.
Your proposal around “surrogate jailbreaking targets” sounds to me like you want companies to pretend that certain issues are safety issues that aren’t really safety issues.
Useful red teaming isn’t to think about pretend-safety issues but to think about the real safety issues that come with technology.
Two reactions here:
First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.