But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.