Interesting. There’s certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it’s probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they’ve experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
So if they’re strong enough generalizers, I think even a model-free system actually optimizes for reward.
I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.
FWIW, I strongly disagree with this claim. I believe they are model-based, with the usual datasets & training approaches, even before RLHF/RLAIF.
What do you mean by “model-based”?
Interesting. There’s certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it’s probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they’ve experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
So if they’re strong enough generalizers, I think even a model-free system actually optimizes for reward.
I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.