That said, I do agree that there are different reasons a system that behaves in accordance with pursuing one goal on the training distribution could diverge considerably on the test/deployment distributions, and they shouldn’t all be lumped under the same concept. Deceptive alignment in particular is a failure mode of the above kind that I think bears special distinction from “goal misgeneralisation”.
Roughly speaking imagine a system that chooses all its outputs by argmax/argmin over an appropriate objective function.
I.e. a system that is a pure direct optimiser.
Gosh, I find it kind of hard to engage with MIRI strategy posts.
There’s just so many frames I bounce off (I find uncompelling/unpersuasive or reject entirely) that seem heavily load bearing.
Pivotal act
Total optimisation[1]/strong coherence
Edge case/extreme robustness
Etc.
That said, I do agree that there are different reasons a system that behaves in accordance with pursuing one goal on the training distribution could diverge considerably on the test/deployment distributions, and they shouldn’t all be lumped under the same concept. Deceptive alignment in particular is a failure mode of the above kind that I think bears special distinction from “goal misgeneralisation”.
Roughly speaking imagine a system that chooses all its outputs by argmax/argmin over an appropriate objective function.
I.e. a system that is a pure direct optimiser.