Rohin Shah comments on More examples of goal misgeneralization

Rohin Shah 9 Oct 2022 9:39 UTC
2 points
0
I understand that part, but I’m not seeing what you mean by empowerment being reliable but purpose being confusing, and why language models are an exception to that.
- the gears to ascension 9 Oct 2022 10:41 UTC
  1 point
  0
  Parent
  The generative modeling objective applied to human datasets only makes behavior that causes empowerment because doing so correlates with behavior that causes accuracy; a reinforcement learning objective applied to the same dataset will still learn the convergent empowerment capability well, but the reward signal is relatively sparse, the model will fit whatever happens to be going on at the time.
  
  in general it seems like the thing all of the example situations have in common is much less dense feedback from anything approaching a true objective.
  
  situations where it’s obvious how to assemble steps to get things, but confusing which results of the different combinations are the ones you really want, are ones where feedback is hard to be sure you have pushed into the correct dimensions. or something.