Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
goal misgeneralization:
https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
see also related results on https://www.google.com/search?q=goal+misgeneralization
see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
verifying generalization:
https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.