Oliver Sourbut comments on Reward is not the optimization target

Oliver Sourbut 26 Jul 2022 9:13 UTC
8 points
1
FWIW I upvoted but disagree with the end part (hurray for more nuance in voting!)

I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI

At least from my epistemic position there looks like an explanation/communication gap here: I don’t think we can be as confident of this. To me this claim seems to preclude ‘creative’ forward-looking exploratory behaviour and model-based planning, which have more of a probingness and less of a merely-antecedent-computation-reinforcingness. But I see other comments from you here which talk about foresighted exploration (and foresighted non-exploration!) and I know you’ve written about these things at length. How are you squaring/nuancing these things? (Silence or a link to an already-written post will not be deemed rude.)
What links here?
- Oliver Sourbut's comment on Reward is not the optimization target by TurnTrout (5 Aug 2022 14:57 UTC; 4 points)