Rohin Shah comments on More examples of goal misgeneralization

Rohin Shah 8 Oct 2022 11:41 UTC
2 points
0
seems like this means the takeaway has to be that in weird circumstances, you can misgeneralize in ways that maintain surprisingly large amounts of competence, but this isn’t the default in most situations.
Sure, I endorse that conclusion today, when systems aren’t particularly general / competent. I don’t endorse that conclusion for the future, when systems will predictably become more general / competent.
(And if you take language models and put them in weird circumstances, they still look competent on some axes, they’re just weird enough that we had trouble attributing any simple goal to them.)
I’m not sure I understand what you mean by empowerment and purpose as it relates to language models, can you say it a different way?
- the gears to ascension 8 Oct 2022 16:30 UTC
  1 point
  0
  Parent
  empowerment as in ability to control an environment; I just wanted to use a different term of art because it felt more appropriate, despite not being evaluated directly, empowerment is the question we care about out of capability, is it not?
  
  and by purpose I simply meant goal.
  - Rohin Shah 9 Oct 2022 9:39 UTC
    2 points
    0
    Parent
    I understand that part, but I’m not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.
    - the gears to ascension 9 Oct 2022 10:41 UTC
      1 point
      0
      Parent
      The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.
      
      in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.
      
      situations where it’s obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.