I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection.
Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.
What I’d like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.
In connection with this, it may be worth checking out out my old post where I try to to untangle capability from alignment in the context of a particular optimization problem. I now disagree with around 20% of what I wrote there, but I still think it was a decent first stab at formalizing some of the relevant definitions, at least from a particular viewpoint.
Great post.
Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.
In connection with this, it may be worth checking out out my old post where I try to to untangle capability from alignment in the context of a particular optimization problem. I now disagree with around 20% of what I wrote there, but I still think it was a decent first stab at formalizing some of the relevant definitions, at least from a particular viewpoint.