One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don’t imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn’t actually the main problem which needs to be solved here.
(Or perhaps you’re intentionally ignoring that problem by assuming “goal-alignment”?)
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
I agree with you—and yes we ignore this problem by assuming goal-alignment. I think there’s a lot riding on the pre-SLT model having “beneficial” goals.
To the extent that this framing is correct, the “sharp left turn” concept does not seem all that decision-relevant, since all most of the work of aligning the system (at least on the human side) should’ve happened way before that point.
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn’t actually the main problem which needs to be solved here.
(Or perhaps you’re intentionally ignoring that problem by assuming “goal-alignment”?)
I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
I agree with you—and yes we ignore this problem by assuming goal-alignment. I think there’s a lot riding on the pre-SLT model having “beneficial” goals.
To the extent that this framing is correct, the “sharp left turn” concept does not seem all that decision-relevant, since
allmost of the work of aligning the system (at least on the human side) should’ve happened way before that point.EDIT: “all” was too strong here