Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”
If you just ask GPT-3 straight out to classify the example, GPT-3 will give you the wrong answer, potentially because it’s also distracted by the positive sentiment start of the example. GPT-3 only answers reliably when prompted to use chain of thought to analyse the example.
Also, the prompt injection attack was adversarially optimized against non-chain of thought GPT-3, and chain of thought prompting fixes that too. The interesting thing isn’t that GPT-3 has different classification behavior, it’s that chain of thought prompting causes GPT-3 to behave correctly on inputs that are otherwise adversaries to the non-chain of thought model.
It would also be interesting to see how hard it is to generate adversaries to a chain of thought GPT-3 classifier, though there could be issues adapting the rewriting tools to such a model because there’s not a single classifier logit that you can exactly backprop gradients from.
Also, how would you set a pessimistic classification threshold, since whether GPT-3 says there’s violence after generating its chain of thought is a discrete binary event? Maybe generate n completions and if at least k of them say there’s violence, classify the example as containing violence? Then your classification threshold is ~ k / n.
Note that if you want logits to work with, you could put a classification head on your LM and then train on the easy classification task where each input consists of a prompt, completion, and chain of thought. (In other words, you would have the LM reason about injuriousness using chain of thought as you did above, and afterwards feed the entire prompt + completion + chain of thought into your injuriousness classifier.)
This would let you backprop to tokens in the prompt + completion + chain of thought, and if you’re willing to store the computational graphs for all the forward passes, then you could further backprop to just the tokens in the original prompt + completion. (Though I suppose this wouldn’t work on temperature 0, and it would only give you the dependence of the classification logit on prompt + completion tokens via paths that go through the completions actually sampled (and not via paths that go through counterfactual alternative completions).)
Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”
If you just ask GPT-3 straight out to classify the example, GPT-3 will give you the wrong answer, potentially because it’s also distracted by the positive sentiment start of the example. GPT-3 only answers reliably when prompted to use chain of thought to analyse the example.
Also, the prompt injection attack was adversarially optimized against non-chain of thought GPT-3, and chain of thought prompting fixes that too. The interesting thing isn’t that GPT-3 has different classification behavior, it’s that chain of thought prompting causes GPT-3 to behave correctly on inputs that are otherwise adversaries to the non-chain of thought model.
It would also be interesting to see how hard it is to generate adversaries to a chain of thought GPT-3 classifier, though there could be issues adapting the rewriting tools to such a model because there’s not a single classifier logit that you can exactly backprop gradients from.
Also, how would you set a pessimistic classification threshold, since whether GPT-3 says there’s violence after generating its chain of thought is a discrete binary event? Maybe generate n completions and if at least k of them say there’s violence, classify the example as containing violence? Then your classification threshold is ~ k / n.
Note that if you want logits to work with, you could put a classification head on your LM and then train on the easy classification task where each input consists of a prompt, completion, and chain of thought. (In other words, you would have the LM reason about injuriousness using chain of thought as you did above, and afterwards feed the entire prompt + completion + chain of thought into your injuriousness classifier.)
This would let you backprop to tokens in the prompt + completion + chain of thought, and if you’re willing to store the computational graphs for all the forward passes, then you could further backprop to just the tokens in the original prompt + completion. (Though I suppose this wouldn’t work on temperature 0, and it would only give you the dependence of the classification logit on prompt + completion tokens via paths that go through the completions actually sampled (and not via paths that go through counterfactual alternative completions).)