And also has some issues with eg claiming that the Reward is the optimization target.
That’s not a problem, because reward is the optimization target. Sutton & Barto literally start with bandits! You want to argue that reward is not the optimization target with bandits? Because even Turntrout doesn’t try to argue that. ‘reward is not the optimization target’ is inapplicable to most of Sutton & Barto (and if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which ‘reward is not the optimization target’, and why they are not applicable to most AI things right now or in the foreseeable future).
Turntrout’s review is correct when he says that it’s probably the best RL textbook out there still.
I would disagree with his claim about SARSA not being worth learning: you should at least read about it, even if you don’t implement it or do the Sutton & Barto exercises, so you understand better how different methods work and better appreciate the range of possible RL agents and how they think and how to do things like train animals/children. For example, SARSA acts fundamentally differently from Q-learning in AI safety scenarios like whether it would try to manipulate human overseers to avoid being turned off. That’s good to know now as you try to think about non-toy agents: is an LLM motivated to manipulate human overseers...?
(the causal incentives paper convinced me to read it, thank you! good book so far)
if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which ‘reward is not the optimization target’, and why they are not applicable to most AI things right now or in the foreseeable future
Can you explain this part a bit more?
My understanding of situations in which ‘reward is not the optimization target’ is when the assumptions of the policy improvement theorem don’t hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we’re updating the policy π by greedy one-step lookahead (by argmaxing the action via qπ(s,a)).
And this basically doesn’t hold irl because realistic RL agents aren’t forced to explore all states (the classic example of “I can explore the state of doing cocaine, and I’m sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don’t have to do that). So my opinion that the circumstances under which ‘reward is the optimization target’ is very narrow remains unchanged, and I’m interested in why you believe otherwise.
That’s not a problem, because reward is the optimization target. Sutton & Barto literally start with bandits! You want to argue that reward is not the optimization target with bandits? Because even Turntrout doesn’t try to argue that. ‘reward is not the optimization target’ is inapplicable to most of Sutton & Barto (and if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which ‘reward is not the optimization target’, and why they are not applicable to most AI things right now or in the foreseeable future).
Turntrout’s review is correct when he says that it’s probably the best RL textbook out there still.
I would disagree with his claim about SARSA not being worth learning: you should at least read about it, even if you don’t implement it or do the Sutton & Barto exercises, so you understand better how different methods work and better appreciate the range of possible RL agents and how they think and how to do things like train animals/children. For example, SARSA acts fundamentally differently from Q-learning in AI safety scenarios like whether it would try to manipulate human overseers to avoid being turned off. That’s good to know now as you try to think about non-toy agents: is an LLM motivated to manipulate human overseers...?
(the causal incentives paper convinced me to read it, thank you! good book so far)
Can you explain this part a bit more?
My understanding of situations in which ‘reward is not the optimization target’ is when the assumptions of the policy improvement theorem don’t hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we’re updating the policy π by greedy one-step lookahead (by argmaxing the action via qπ(s,a)).
And this basically doesn’t hold irl because realistic RL agents aren’t forced to explore all states (the classic example of “I can explore the state of doing cocaine, and I’m sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don’t have to do that). So my opinion that the circumstances under which ‘reward is the optimization target’ is very narrow remains unchanged, and I’m interested in why you believe otherwise.