gwern comments on When is reward ever the optimization target?

gwern 17 Oct 2024 23:41 UTC
9 points
1

I guess LLMs are model-free, so that’s relevant

FWIW, I strongly disagree with this claim. I believe they are model-based, with the usual datasets & training approaches, even before RLHF/RLAIF.
- Garrett Baker 19 Oct 2024 4:32 UTC
  13 points
  8
  Parent
  What do you mean by “model-based”?
- Seth Herd 20 Oct 2024 19:10 UTC
  4 points
  2
  Parent
  Interesting. There’s certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it’s probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
  
  Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they’ve experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
  
  So if they’re strong enough generalizers, I think even a model-free system actually optimizes for reward.
  
  I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.