Language Models Learn to Mislead Humans via RLHF

Satron15 Nov 2024 20:10 UTC

−3 points

This paper by Anthropic is believed a be a really important step in detecting sleeper agents.

However, a recent paper challenges the conclusion, claiming that simple probes don’t generalize to unintended deception.

Satron15 Nov 2024 20:10 UTC

−3 points