Would you be excited if someone devised an approach to detect the sleeper agents’ backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.
I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.
Would you be excited if someone devised an approach to detect the sleeper agents’ backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
Depending on model size I’m fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).
Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.
I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.