I’m quite aware that we did not see natural deceptive alignment, so I don’t think I’m misinterpreting my own results in the way you were predicting. Perhaps “empirically disprove” is too strong; I agree that our results are evidence but not definitive evidence. But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
You didn’t claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.
But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
Because the current results only deal with backdoor removal, I personally think it’s outweighed by e.g. results on how well instruction-tuning generalizes.
But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.
I’m quite aware that we did not see natural deceptive alignment, so I don’t think I’m misinterpreting my own results in the way you were predicting. Perhaps “empirically disprove” is too strong; I agree that our results are evidence but not definitive evidence. But I think they’re quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.
You didn’t claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.
I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.
Because the current results only deal with backdoor removal, I personally think it’s outweighed by e.g. results on how well instruction-tuning generalizes.
I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.
I said “Deceptive reasoning in general”, not the trainability of the backdoor behavior in your experimental setup. The issue isn’t just “what was the trainability of the surface behavior”, but “what is the trainability of the cognition implementing this behavior in-the-wild.” That is, the local inductive biases are probably far different for “parameterization implementing directly-trained deceptive reasoning” vs “parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning.”
Imagine if I were arguing for some hypothetical results of mine, saying “The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models.” Would that be a valid argument given the supposed experimental result?
I’m referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it’s not a generalization question, it’s about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.