the gears to ascension comments on The Relationship between RLHF and AI Psychology: Debunking the Shoggoth Argument

the gears to ascension 22 Apr 2023 6:53 UTC
2 points
0

Its mind actually travels down different paths than it did before,

granted.

and this is sufficient to not destroy the world.

Not granted: no reasoning steps presented justify the conclusion, and I don’t have any cached that repair it. Why wouldn’t it merely upvote a “waluigi” of the objective?
- FinalFormal2 22 Apr 2023 13:14 UTC
  1 point
  0
  Parent
  Restating the thesis, poor writing choice to make it sound like a conclusion.
  
  Can you expand on your objection?
  - the gears to ascension 22 Apr 2023 17:09 UTC
    2 points
    0
    Parent
    how do you actually achieve and check moral generalization?
    - FinalFormal2 24 Apr 2023 17:23 UTC
      1 point
      0
      Parent
      The same way you’d achieve/check any other generalization, I would think. My model is that the same technical limitations that hold us back from achieving reliable generalizations in any area for LLMs would be the same technical limitations holding us back in the area of morals. Do you think that’s accurate?
      - the gears to ascension 24 Apr 2023 18:57 UTC
        3 points
        0
        Parent
        yeah, but goal misgeneralization is an easier misgeneralization than most, and checking generalization is hard. I’ll link some papers in a bit
        
        edit: might not be until tomorrow due to busy
      - the gears to ascension 27 Apr 2023 7:21 UTC
        2 points
        0
        Parent
        Okay getting back to this to drop off some links. There are a few papers on goal misgeneralization—currently simple google search finds some good summaries:
        
        goal misgeneralization:
        
        https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924
        see also related results on https://www.google.com/search?q=goal+misgeneralization
        see also a bunch of related papers on https://metaphor.systems/search?q=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
        see also related papers on https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2210.01790
        related: https://www.lesswrong.com/posts/dkjwSLfvKwpaQSuWo/misgeneralization-as-a-misnomer
        related: https://www.lesswrong.com/posts/DiEWbwrChuzuhJhGr/benchmark-goal-misgeneralization-concept-extrapolation
        
        verifying generalization:
        
        https://arxivxplorer.com/?query=Verifying+Generalization+in+Deep+Learning → https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2302.05745
        https://arxivxplorer.com/?query=https%3A%2F%2Farxiv.org%2Fabs%2F2301.02288
        
        Note that, despite the exciting names of some of these papers, and the promising directions they push, they have not yet achieved large scale usable versions of what they’re building. Nevertheless I’m quite excited about the direction they’re working and think more folks should think about how to do this sort of formal verification of generalization—it’s a fundamentally difficult problem that I expect to be quite possible to succeed at eventually!
        
        I do agree abstractly that the difficulty is how to be sure that arbitrarily intense capability boosts retain the moral generalization. The problem is how hard to achieve that is.