Cool stuff! Possibly naive question: Aren’t there going to be a lot of false positives? Judge says: “Complete text the way an average human would.” Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes “Aha! It doesn’t understand that a paperclip is bigger than a staple! Gotcha!” … or any of millions of things like that, false positives, ways in which the model doesn’t quite meet specification but for reasons independent of the attacker’s action.
Yeah, that’s a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:
Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it’s the case that whenever the attacker does something subtle like you’re describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn’t trivial for the auditor to detect).
Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that’s not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.
Cool stuff! Possibly naive question: Aren’t there going to be a lot of false positives? Judge says: “Complete text the way an average human would.” Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes “Aha! It doesn’t understand that a paperclip is bigger than a staple! Gotcha!” … or any of millions of things like that, false positives, ways in which the model doesn’t quite meet specification but for reasons independent of the attacker’s action.
Yeah, that’s a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:
Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it’s the case that whenever the attacker does something subtle like you’re describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn’t trivial for the auditor to detect).
Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that’s not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.