Vladimir_Nesov comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Vladimir_Nesov 16 Jun 2022 4:09 UTC
LW: 5 AF: 3
0
AF

Corrigibility is a repeller.

In the sense of moving a system towards many possible goals? But I think in a more appropriate space (where the aiming should take place) it’s again an attractor. Corrigibility is not a goal, a corrigible system doesn’t necessarily have any well-defined goals, traditional goal-directed agents can’t be corrigible in a robust way, and it should be possible to use it for corrigibility towards corrigibility, making this aspect stronger if that’s what the operators work towards happening.

More generally, non-agentic aspects of behavior can systematically reinforce non-agentic character of each other, preventing any opposing convergent drives (including the drive towards agency) from manifesting if they’ve been set up to do so. Sufficient intelligence/planning advantage pushes this past exploitability hazards, repelling selection theorems, even as some of the non-agentic behaviors might be about maintaining specific forms of exploitability.