DragonGod comments on Misgeneralization as a misnomer

DragonGod Apr 6, 2023, 10:31 PM
3 points
−17
Gosh, I find it kind of hard to engage with MIRI strategy posts.

There’s just so many frames I bounce off (I find uncompelling/unpersuasive or reject entirely) that seem heavily load bearing.
- Pivotal act
- Total optimisation^[1]/strong coherence
- Edge case/extreme robustness
- Etc.
That said, I do agree that there are different reasons a system that behaves in accordance with pursuing one goal on the training distribution could diverge considerably on the test/deployment distributions, and they shouldn’t all be lumped under the same concept. Deceptive alignment in particular is a failure mode of the above kind that I think bears special distinction from “goal misgeneralisation”.
1. ↩︎
  Roughly speaking imagine a system that chooses all its outputs by argmax/argmin over an appropriate objective function.
  I.e. a system that is a pure direct optimiser.