johnswentworth comments on Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

johnswentworth 25 Nov 2022 17:27 UTC
LW: 15 AF: 7
10
AF
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don’t imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn’t actually the main problem which needs to be solved here.
(Or perhaps you’re intentionally ignoring that problem by assuming “goal-alignment”?)
- Vika 25 Nov 2022 17:40 UTC
  LW: 6 AF: 4
  −2
  AF Parent
  I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts.
- Ramana Kumar 25 Nov 2022 17:38 UTC
  LW: 6 AF: 3
  3
  AF Parent
  I agree with you—and yes we ignore this problem by assuming goal-alignment. I think there’s a lot riding on the pre-SLT model having “beneficial” goals.
  - cfoster0 25 Nov 2022 18:36 UTC
    10 points
    11
    Parent
    To the extent that this framing is correct, the “sharp left turn” concept does not seem all that decision-relevant, since ~~all~~ most of the work of aligning the system (at least on the human side) should’ve happened way before that point.
    EDIT: “all” was too strong here