Sam Marks comments on Framings of Deceptive Alignment

Sam Marks 26 Apr 2022 20:29 UTC
2 points
I agree that the term “deception” conflates “deceptive behavior due to outer alignment failure” and “deceptive behavior due to inner alignment failure” and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.