I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.
I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.