Its mind actually travels down different paths than it did before,
granted.
and this is sufficient to not destroy the world.
Not granted: no reasoning steps presented justify the conclusion, and I don’t have any cached that repair it. Why wouldn’t it merely upvote a “waluigi” of the objective?
The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.
granted.
Not granted: no reasoning steps presented justify the conclusion, and I don’t have any cached that repair it. Why wouldn’t it merely upvote a “waluigi” of the objective?
Restating the thesis, poor writing choice to make it sound like a conclusion.
Can you expand on your objection?
how do you actually achieve and check moral generalization?
The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
yeah, but goal misgeneralization is an easier misgeneralization than most, and checking generalization is hard. I’ll link some papers in a bit
edit: might not be until tomorrow due to busy
Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
goal misgeneralization:
https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
see also related results on https://www.google.com/search?q=goal+misgeneralization
see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
verifying generalization:
https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.