Sharp left turn (models learn generally powerful cognitive tools that are efficiently reached by training on real-world tasks, especially as the real world contains useful embedded knowledge that shortcuts learning these tools from scratch; but powerful cognitive tools are somewhat anti-natural to corrigibility and the training process does not efficiently constrain the directionality of these tools towards human CEV, which manifests under distributional shift).
In particular, the last threat model feels like it is trying to cut across aspects of the first two threat models, violating MECE.
Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising “divergent ecosystems” that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.
AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).
So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.
I think the extrinsic optimization you describe is what I’m pointing toward with the label “coordination failures,” which might properly be labeled “alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned.”
AI alignment threat models that are somewhat MECE (but not quite):
We get what we measure (models converge to the human++ simulator and build a Potemkin village world without being deceptive consequentialists);
Optimization daemon (deceptive consequentialist with a non-myopic utility function arises in training and does gradient hacking, buries trojans and obfuscates cognition to circumvent interpretability tools, “unboxes” itself, executes a “treacherous turn” when deployed, coordinates with auditors and future instances of itself, etc.);
Coordination failure (otherwise-aligned AI systems combust or gridlock far from the Pareto frontier due to opaque values/capabilities, inadequate commitment mechanisms, or irreconcilable differences);
Sharp left turn (models learn generally powerful cognitive tools that are efficiently reached by training on real-world tasks, especially as the real world contains useful embedded knowledge that shortcuts learning these tools from scratch; but powerful cognitive tools are somewhat anti-natural to corrigibility and the training process does not efficiently constrain the directionality of these tools towards human CEV, which manifests under distributional shift).
In particular, the last threat model feels like it is trying to cut across aspects of the first two threat models, violating MECE.
Great overview! I find this helpful.
Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising “divergent ecosystems” that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.
AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).
So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.
Cheers, Remmelt! I’m glad it was useful.
I think the extrinsic optimization you describe is what I’m pointing toward with the label “coordination failures,” which might properly be labeled “alignment failures arising uniquely through the interactions of multiple actors who, if deployed alone, would be considered aligned.”