This is really interesting work and a great write-up!
You write “Bad actors can harness the full offensive capabilities of models if they have access to the weights.” Compared to the harm a threat actor could do just with access to the public interface of a current frontier LLM, how bad would it be if a threat actor were to maliciously fine-tune the weights of an open-sourced (or exfiltrated) model? If someone wants to use a chatbot to write a ransomware program, it’s trivial to come up with a pretext that disguises one’s true intentions from the chatbot and gets it to spit out code that could be used maliciously.
It’s not clear to me that it would be that much worse for a threat actor to have a maliciously fine-tuned LLM vs the plain-old public GPT-3.5 interface that we all know and love. Maybe it would save them a small bit of time if they didn’t have to come up with a pretext? Or maybe there are offensive capabilities lurking in current models and somehow protected by safety fine-tuning, capabilities that can only be accessed by pointing to them specifically in one’s prompt?
To be clear, I strongly agree with your assessment that open-sourcing models is extremely reckless and on net a very bad idea. I also have a strong intuition that it makes a lot of sense to invest in security controls to prevent threat actors from exfiltrating model weights to use models for malicious purposes. I’m just not sure if the payoff from blocking such exfiltration attacks would be in the present/near term or if we would have to wait until the next generation of models (or the generation after that) before direct access to model weights grants significant offensive capabilities above and beyond those accessible through a public API.
Thanks again for this post and for your work!
EDIT: I realize you could run into infohazards if you respond to this. I certainly don’t want to push you to share anything infohazard-y, so I totally understand if you can’t respond.
One totally off topic comment: I don’t like calling it open-source models. This is a term used a lot by pro-open models people and tries to create an analogy between OSS and open models. This is a term they use for regulation and in talks. However, I think the two are actually very different. One of the huge advantages of OSS for example is that people can read the code and find bugs, explain behavior, and submit pull requests. However there isn’t really any source code with AI models. So what does source in open-source refer too? are 70B parameters the source code of the model? So the term 1. doesn’t make any sense since there is no source code 2. the analogy is very poor because we can’t read, change, submit changes with them.
To your main point, we talk a bit about jailbreaks, I assume in the future chat interfaces could be really safe and secure against prompt engineering. It is certainly a much easier thing to defend. Open models probably never really will be since you can just LoRA them briefly to be unsafe again.
In my mind, there are three main categories of methods for bypassing safety features: (1) Access the weights and maliciously fine-tune, (2) Jailbreak—append a carefully-crafted prompt suffix that causes the model to do things it otherwise wouldn’t, and (3) Instead of outright asking the model to help you with something malicious, disguise your intentions with some clever pretext.
My question is whether (1) is significantly more useful to a threat actor than (3). It seems very likely to me that coming up with prompt sanitization techniques to guard against jailbreaking is way easier than getting to a place where you’re confident your model can’t ever be LoRA-ed out of its safety training. But it seems really difficult to train a model that isn’t vulnerable to clever pretexting (perhaps just as difficult as training an un-LoRA-able model).
The argument for putting lots of effort into making sure model weights stay on a secure server (block exfiltration by threat actors, block model self-exfiltration, and stop openly publishing model weights) seems a lot stronger to me if (1) is way more problematic than (3). I have an intuition that it is and that the gap between (1) and (3) will only continue to grow as models become more capable, but it’s not totally clear to me. Again, if at any point this gets infohazard-y, we can take this offline or stop talking about it.
We talk about jailbreaks in the post and I reiterate from the post: Claude 2 is supposed to be really secure against them.
Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.
I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.
This is really interesting work and a great write-up!
You write “Bad actors can harness the full offensive capabilities of models if they have access to the weights.” Compared to the harm a threat actor could do just with access to the public interface of a current frontier LLM, how bad would it be if a threat actor were to maliciously fine-tune the weights of an open-sourced (or exfiltrated) model? If someone wants to use a chatbot to write a ransomware program, it’s trivial to come up with a pretext that disguises one’s true intentions from the chatbot and gets it to spit out code that could be used maliciously.
It’s not clear to me that it would be that much worse for a threat actor to have a maliciously fine-tuned LLM vs the plain-old public GPT-3.5 interface that we all know and love. Maybe it would save them a small bit of time if they didn’t have to come up with a pretext? Or maybe there are offensive capabilities lurking in current models and somehow protected by safety fine-tuning, capabilities that can only be accessed by pointing to them specifically in one’s prompt?
To be clear, I strongly agree with your assessment that open-sourcing models is extremely reckless and on net a very bad idea. I also have a strong intuition that it makes a lot of sense to invest in security controls to prevent threat actors from exfiltrating model weights to use models for malicious purposes. I’m just not sure if the payoff from blocking such exfiltration attacks would be in the present/near term or if we would have to wait until the next generation of models (or the generation after that) before direct access to model weights grants significant offensive capabilities above and beyond those accessible through a public API.
Thanks again for this post and for your work!
EDIT: I realize you could run into infohazards if you respond to this. I certainly don’t want to push you to share anything infohazard-y, so I totally understand if you can’t respond.
One totally off topic comment: I don’t like calling it open-source models. This is a term used a lot by pro-open models people and tries to create an analogy between OSS and open models. This is a term they use for regulation and in talks. However, I think the two are actually very different. One of the huge advantages of OSS for example is that people can read the code and find bugs, explain behavior, and submit pull requests. However there isn’t really any source code with AI models. So what does source in open-source refer too? are 70B parameters the source code of the model? So the term 1. doesn’t make any sense since there is no source code 2. the analogy is very poor because we can’t read, change, submit changes with them.
To your main point, we talk a bit about jailbreaks, I assume in the future chat interfaces could be really safe and secure against prompt engineering. It is certainly a much easier thing to defend. Open models probably never really will be since you can just LoRA them briefly to be unsafe again.
Here is a take by eliezer on this which partially inspired this:
https://twitter.com/ESYudkowsky/status/1660225083099738112
In my mind, there are three main categories of methods for bypassing safety features: (1) Access the weights and maliciously fine-tune, (2) Jailbreak—append a carefully-crafted prompt suffix that causes the model to do things it otherwise wouldn’t, and (3) Instead of outright asking the model to help you with something malicious, disguise your intentions with some clever pretext.
My question is whether (1) is significantly more useful to a threat actor than (3). It seems very likely to me that coming up with prompt sanitization techniques to guard against jailbreaking is way easier than getting to a place where you’re confident your model can’t ever be LoRA-ed out of its safety training. But it seems really difficult to train a model that isn’t vulnerable to clever pretexting (perhaps just as difficult as training an un-LoRA-able model).
The argument for putting lots of effort into making sure model weights stay on a secure server (block exfiltration by threat actors, block model self-exfiltration, and stop openly publishing model weights) seems a lot stronger to me if (1) is way more problematic than (3). I have an intuition that it is and that the gap between (1) and (3) will only continue to grow as models become more capable, but it’s not totally clear to me. Again, if at any point this gets infohazard-y, we can take this offline or stop talking about it.
We talk about jailbreaks in the post and I reiterate from the post:
Claude 2 is supposed to be really secure against them.
Jailbreaks like llm-attacks don’t work reliably and jailbreaks can semantically change the meaning of your prompt.
So have you actually tried to do (3) for some topics? I suspect it will at least take a huge amount of time or be close to impossible. How do you cleverly pretext it to write a mass shooting threat or build anthrax or do hate speech? it is not obvious to me. It seems to me this all depends on the model being kind of dumb, future models can probably look right through your clever pretexting and might call you out on it.
I’m right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won’t go into detail here, but I’ll privately message you about it if you’re open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren’t necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying “anthrax”) and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I’m met with the response “but won’t bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there’s a much easier way?” My opinion is that more proofs of concept like the anthrax one and like Anthropic’s experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.