First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
Some desiderata for such surrogate targets: (i) The company should investing no less and no more effort into them as into other targets. (ii) Success at them should be strongly correlated with success at other goals. (iii) Making AI behave well on these targets should be roughly as hard as making it behave well in general. (EG, preventing the AI from producing specific swear-words does not count, because you can just filter the output for those words.)
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.
Two reactions here:
First, I expect some degree of “alignment orthogonality”, in the sense that if you can ensure that your LLM never says X, you are in a better position to ensure it never says Y either, and you are in somewhat better position to align AI in general.
Second: yup, I agree with your concerns. It matters (somewhat) what the surrogate goals are, and the original post already had a (foot)note on this:
In computer security, if you care about security you are doing threat modeling and then think about how to protect against the threats.
When doing threat modeling it’s important to focus on the threats that actually matter. The whole jailbreaking discussion largely focuses intellectual labor on scenarios that aren’t real threats.