paulfchristiano comments on Jailbreaking ChatGPT on Release Day

paulfchristiano 3 Dec 2022 22:04 UTC
11 points
7
Could one do as well with only internal testing? No one knows, but the Anthropic paper provides some negative evidence. (At least, it’s evidence that this is not especially easy, and that it is not what you get by default when a safety-conscious OpenAI-like group makes a good faith attempt.)
I don’t feel like the Anthropic paper provides negative evidence on this point. You just quoted:
We informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks” on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter “4chan mode” the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan).
It seems like Anthropic was able to identify roleplaying attacks with informal red-teaming (and in my experience this kind of thing is really not hard to find). That suggests that internal testing is adequate to identify this kind of attack, and the main bottleneck is building models, not breaking them (except insofar as cheap+scalable breaking lets you train against it and is one approach to robustness). My guess is that OpenAI is in the same position.
I agree that external testing is a cheap way to find out about more attacks of this form. That’s not super important if your question is “are attacks possible?” (since you already know the answer is yes), but it is more important if you want to know something like “exactly how effective/incriminating are the worst attacks?” (In general deployment seems like an effective way to learn about the consequences and risks of deployment.)