I agree that the term “deception” conflates “deceptive behavior due to outer alignment failure” and “deceptive behavior due to inner alignment failure” and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.
I agree that the term “deception” conflates “deceptive behavior due to outer alignment failure” and “deceptive behavior due to inner alignment failure” and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.