the gears to ascension comments on The Relationship between RLHF and AI Psychology: Debunking the Shoggoth Argument

the gears to ascension 27 Apr 2023 7:21 UTC
2 points
0
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:

goal misgeneralization:
- https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
- see also related results on https://www.google.com/search?q=goal+misgeneralization
- see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
- see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
- related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
- related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
verifying generalization:
- https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
- https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!

I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.