a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.
Why is the word “trying” necessary here? Surely the literal optimal model is actually doing what we want, and never has even benign failures?
The rest of the post makes sense with the “trying to do what we want” description of alignment (though I don’t agree with all of it); I’m just confused with the “outer alignment at optimum” formalization, which seems distinctly different from the notion of alignment used in the rest of the post.
I think I’m quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it’s aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can’t really consider HCH aligned (and thus can’t consider imitative amplification outer aligned at optimum).
Why is the word “trying” necessary here? Surely the literal optimal model is actually doing what we want, and never has even benign failures?
The rest of the post makes sense with the “trying to do what we want” description of alignment (though I don’t agree with all of it); I’m just confused with the “outer alignment at optimum” formalization, which seems distinctly different from the notion of alignment used in the rest of the post.
I think I’m quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it’s aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can’t really consider HCH aligned (and thus can’t consider imitative amplification outer aligned at optimum).