Inspired by the “reward chisels cognition into the agent’s network” framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?
I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were “yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough”.
I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.
Example 1: Q-learning. You have a set S of states and a set A of actions. Given a target policy π:S→A, can you necessarily choose a reward function R:S×A→R such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ∀s∈S:argmaxa∈AQ(s,a)=π(s)?
*and assuming we visit all of the states in S many times
The answer is yes. Simply reward the behavior you want to see: let R(s,a)=1 if a=π(s) and R(s,a)=0 otherwise.
(In fact, one can more strongly choose, for any target value function Q′:S×A→R, a reward function R such that the values Q(s,a) in Q-learning converge in the limit to Q′(s,a). So not only can you force certain behavior out of the model, you can also choose the internals.)
Example 2: Neural network.
Say you have a neural network Rn→R with m tunable weights w=(w1,…,wm). Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to w′=(w′1,…,w′m)?
(I’m assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)
The following sketch convinces me that the answer is positive:
Choose m random input-output pairs (xi,yi). The gradients gi of the weight vectors are almost certainly linearly independent. Hence, some linear combination c1g1+…+cmgm of them equals w′−w. Now, for small ϵ>0, running back-propagation on the pair (xi,yi) with learning rate ϵci for all i=1,…,m gives you an update approximately in the direction of w′−w. Rinse and repeat.
Inspired by the “reward chisels cognition into the agent’s network” framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?
I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were “yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough”.
I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.
Example 1: Q-learning. You have a set S of states and a set A of actions. Given a target policy π:S→A, can you necessarily choose a reward function R:S×A→R such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: ∀s∈S:argmaxa∈AQ(s,a)=π(s)?
*and assuming we visit all of the states in S many times
The answer is yes. Simply reward the behavior you want to see: let R(s,a)=1 if a=π(s) and R(s,a)=0 otherwise.
(In fact, one can more strongly choose, for any target value function Q′:S×A→R, a reward function R such that the values Q(s,a) in Q-learning converge in the limit to Q′(s,a). So not only can you force certain behavior out of the model, you can also choose the internals.)
Example 2: Neural network.
Say you have a neural network Rn→R with m tunable weights w=(w1,…,wm). Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to w′=(w′1,…,w′m)?
(I’m assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)
The following sketch convinces me that the answer is positive:
Choose m random input-output pairs (xi,yi). The gradients gi of the weight vectors are almost certainly linearly independent. Hence, some linear combination c1g1+…+cmgm of them equals w′−w. Now, for small ϵ>0, running back-propagation on the pair (xi,yi) with learning rate ϵci for all i=1,…,m gives you an update approximately in the direction of w′−w. Rinse and repeat.