The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.
The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
yeah, but goal misgeneralization is an easier misgeneralization than most, and checking generalization is hard. I’ll link some papers in a bit
edit: might not be until tomorrow due to busy
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
goal misgeneralization:
https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
see also related results on https://www.google.com/search?q=goal+misgeneralization
see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
verifying generalization:
https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.