Thanks for sharing! I’m a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here—asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.
Thanks for sharing! I’m a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here—asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.