Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.
Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.