Link post
This paper by Anthropic is believed a be a really important step in detecting sleeper agents.
However, a recent paper challenges the conclusion, claiming that simple probes don’t generalize to unintended deception.
Language Models Learn to Mislead Humans via RLHF
Link post
This paper by Anthropic is believed a be a really important step in detecting sleeper agents.
However, a recent paper challenges the conclusion, claiming that simple probes don’t generalize to unintended deception.