habryka comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

habryka 17 Jan 2024 8:24 UTC
LW: 8 AF: 4
6
AF
Promoted to curated: Overall this seems like an important and well-written paper, that also stands out for its relative accessibility for an ML-heavy AI Alignment paper. I don’t think it’s perfect, and I do encourage people to read the discussion on the post for various important related arguments, but it overall seems like a quite good paper that starts to bridge the gap between prosaic work and concerns that have historically been hard to empirically study, like deceptive alignment.