evhub comments on How to train your own “Sleeper Agents”

evhub 18 Feb 2024 20:45 UTC
4 points
3
Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.