Richard_Ngo comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

Richard_Ngo 1 Nov 2020 11:24 UTC
LW: 2 AF: 1
AF
Actually, maybe the main reason why this whole line of thinking seems strange to me is because the whole reason the inner alignment term was coined is to point at a specific type of generalisation failure that we’re particularly worried about (see https://www.lesswrong.com/posts/2mhFMgtAjFJesaSYR/2-d-robustness). So you defining generalisation problems as part of outer misalignment basically defeats the point of having the inner alignment concept.
- johnswentworth 1 Nov 2020 16:14 UTC
  1 point
  AF Parent
  I do want to throw away the idea that deceptive alignment is mainly about the inner optimizer. It’s not. If the system were outer aligned to begin with, then deceptive optimizers wouldn’t be an issue, and if the system were not outer aligned to begin with, then the same imperfections which give rise to deceptive alignment would still cause problems even in the absence of inner optimizers. For instance, inner optimizers are certainly not the only problem caused by distribution shift.
  This does not mean throwing away the whole concept of inner alignment, though. Inner optimizers hacking the training process is still a serious problem.