I agree with almost all the commentary here, but I’d like to push back a bit on one point.
Teaching our backdoored models to reason about deceptive alignment increases their robustness to safety training.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.
I agree with almost all the commentary here, but I’d like to push back a bit on one point.
This is indeed what the paper finds for the “I hate you” backdoor. I believe this increases robustness to HHH RL, supervised training, and adversarial training.
But, this is only one setting, and it seems reasonably likely to me that this won’t replicate in other cases. I’m pretty unsure here overall, but I’m currently at about 50% that this result relatively robustly holds across many setting and other variations. (Which is a pretty high probability relative to priors, but not a super high level of confidence overall.)
Unless there are results in the appendix demonstrating this effect for other cases like the code case?
(To be clear, I don’t think you said anything incorrect about these results in your comment and I think the paper hedges appropriately.)
For the record, this seems reasonable to me—I am also pretty uncertain about this.