Ofer comments on Outer alignment and imitative amplification

Ofer 10 Jan 2020 11:54 UTC
LW: 12 AF: 9
AF

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

I would argue that according to this definition, there are no loss functions that are outer aligned at optimum (other than ones according to which no model performs optimally). [EDIT: this may be false if a loss function may depend on anything other than the model’s output (e.g. if it may contain a regularization term).]

For any model $M$ that performs optimally according to a loss function $L$ there is a model $M^{'}$ that is identical to $M$ except that at the beginning of the execution it hacks the operating system or carries out mind crimes. But for any input, $M$ and $M^{'}$ formally map that input to the same output, and thus $M^{'}$ also performs optimally according to $L$ , and therefore $L$ is not outer aligned at optimum.