To me that the mesaoptimizer in the toy example is:
aligned with its goal—it reaches the door (which it incorrectly identifies)
dysfunctional—it incorrectly identifies doors.
From a consequentialist perspective this may be irrelevant, but from safety point of view this distinction is important and big.
In the context of this article I believe that misalignment (pseudo alignment) would occur when the goal of the mesa optimizer would diverge from its original goal (change completely, extend, etc.)
(As a secondary point that I haven’t thought a lot about, it seems problematic to discuss alignment unless the mesa optimizer’s goal liberally contains the base goal: Find doors in order to achieve Obase.)
To me that the mesaoptimizer in the toy example is:
aligned with its goal—it reaches the door (which it incorrectly identifies)
dysfunctional—it incorrectly identifies doors.
From a consequentialist perspective this may be irrelevant, but from safety point of view this distinction is important and big.
In the context of this article I believe that misalignment (pseudo alignment) would occur when the goal of the mesa optimizer would diverge from its original goal (change completely, extend, etc.)
(As a secondary point that I haven’t thought a lot about, it seems problematic to discuss alignment unless the mesa optimizer’s goal liberally contains the base goal: Find doors in order to achieve Obase.)