I have trouble imagining some simple training technique that does successfully train out deception from models like this,
This is something we’re going to be working on a bunch! I’m particularly excited about exploring how dictionary learning techniques from “Towards Monosemanticity” can help us here: I think it’s quite plausible you’d see clear features related to deception in our models without needing to have the backdoor triggers.
I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach the authors tried in Appendix F didn’t seem to show any results — as I describe in more detail in my comment below on this subject, I think it would be worth pursuing the approach of Appendix F some more to see if it can be made to work; and if it turns out that it can’t, that actually suggests quite a bit about what’s going on in this model of deceptive alignment, in a way that suggests it might actually be a pretty good mode organism.
I wrote about this a lot more in another comment, but I was actually somewhat surprised that the very simple approach the authors tried in Appendix F didn’t seem to show any results — as I describe in more detail in my comment below on this subject, I think it would be worth pursuing the approach of Appendix F some more to see if it can be made to work; and if it turns out that it can’t, that actually suggests quite a bit about what’s going on in this model of deceptive alignment, in a way that suggests it might actually be a pretty good mode organism.