Noosphere89 comments on When is reward ever the optimization target?

Noosphere89 16 Oct 2024 2:44 UTC
6 points
3
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don’t optimize for the reward, because optimization amplifies problems and solutions.
- Seth Herd 17 Oct 2024 18:04 UTC
  2 points
  0
  Parent
  Ah yes. I agree that the wireheading question deserves more thought. I’m not confident that my answer to wireheading applies to the types of AI we’ll actually build—I haven’t thought about it enough.
  
  FWIW the two papers I cited are secondary research, so they branch directly into a massive amount of neuroscience research that indirectly bears on the question in mammalian brains. None of it I can think of directly addresses the question of whether reward is the optimization target for humans. I’m not sure how you’d empirically test this.
  
  I do think it’s pretty clear that some types of smart, model-based RL agents would optimize for reward. Those are the ones that a) choose actions based on highest estimated sum of future rewards (like humans seem to, very very approximately), and that are smart enough to estimate future rewards fairly accurately.
  
  LLMs with RLHF/RLAIF may be the relevant case. They are model-free by TurnTrout’s definition, and I’m happy to accept his use of the terminology. But they do have a powerful critic component (at least in training—I’m not sure about deployment, but probably there too)0, so it seems possible that it might develop a highly general representation of “stuff that gives the system rewards”. I’m not worried about that, because I think that will happen long after we’ve given them agentic goals, and long after they’ve developed a representation of “stuff humans reward me for doing”—which could be mis-specified enough to lead to doom if it was the only factor.