Vladimir_Nesov comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Vladimir_Nesov 13 Jan 2024 9:53 UTC
LW: 2 AF: 1
0
AF
For AIs as deceptively aligned as trustworthy humans, control is not centrally coercion that gets intractably slippery at scale. The main issue is AIs being much smarter, but at near-human level control in the face of deceptive alignment seems potentially crucial.