I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It’s not. If the system were outer aligned to begin with, then deceptive optimizers wouldn’t be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.
This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.
I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It’s not. If the system were outer aligned to begin with, then deceptive optimizers wouldn’t be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.
This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.