AI companies try to create an impression that they can provide safety (in general) ((independently of whether they actually can)).
AI companies try to avoid scaling back training and deployment (mostly independently of being able to provide safety). The way to scale back is to create external pressure.
Public already does some safety testing, via jailbreaking. However, such effort are less visible than they could be (guessing at reasons: questionable legitimacy, examples not being in one place).
If (3) was made visible, this would cast some doubt on (1) and thereby create some pressure on (2). (This alone would not solve the issue. But it might help.)
I think that it’s important that those at top-level AI labs focus on actual safety issues instead of PR appearances.
Your proposal around “surrogate jailbreaking targets” sounds to me like you want companies to pretend that certain issues are safety issues that aren’t really safety issues.
Useful red teaming isn’t to think about pretend-safety issues but to think about the real safety issues that come with technology.
First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.
I agree that 1 & 2 probably don’t hold.
My model is closer to this:
AI companies try to create an impression that they can provide safety (in general) ((independently of whether they actually can)).
AI companies try to avoid scaling back training and deployment (mostly independently of being able to provide safety). The way to scale back is to create external pressure.
Public already does some safety testing, via jailbreaking. However, such effort are less visible than they could be (guessing at reasons: questionable legitimacy, examples not being in one place).
If (3) was made visible, this would cast some doubt on (1) and thereby create some pressure on (2). (This alone would not solve the issue. But it might help.)
I think that it’s important that those at top-level AI labs focus on actual safety issues instead of PR appearances.
Your proposal around “surrogate jailbreaking targets” sounds to me like you want companies to pretend that certain issues are safety issues that aren’t really safety issues.
Useful red teaming isn’t to think about pretend-safety issues but to think about the real safety issues that come with technology.
Two reactions here:
First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.