(the causal incentives paper convinced me to read it, thank you! good book so far)
if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which ‘reward is not the optimization target’, and why they are not applicable to most AI things right now or in the foreseeable future
Can you explain this part a bit more?
My understanding of situations in which ‘reward is not the optimization target’ is when the assumptions of the policy improvement theorem don’t hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we’re updating the policy π by greedy one-step lookahead (by argmaxing the action via qπ(s,a)).
And this basically doesn’t hold irl because realistic RL agents aren’t forced to explore all states (the classic example of “I can explore the state of doing cocaine, and I’m sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don’t have to do that). So my opinion that the circumstances under which ‘reward is the optimization target’ is very narrow remains unchanged, and I’m interested in why you believe otherwise.
(the causal incentives paper convinced me to read it, thank you! good book so far)
Can you explain this part a bit more?
My understanding of situations in which ‘reward is not the optimization target’ is when the assumptions of the policy improvement theorem don’t hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we’re updating the policy π by greedy one-step lookahead (by argmaxing the action via qπ(s,a)).
And this basically doesn’t hold irl because realistic RL agents aren’t forced to explore all states (the classic example of “I can explore the state of doing cocaine, and I’m sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don’t have to do that). So my opinion that the circumstances under which ‘reward is the optimization target’ is very narrow remains unchanged, and I’m interested in why you believe otherwise.