Olli Järviniemi comments on Olli Järviniemi’s Shortform

Olli Järviniemi 23 Mar 2023 10:59 UTC
2 points
−1
Inspired by the “reward chisels cognition into the agent’s network” framing from Reward is not the optimization target, I thought: is reward necessarily a fine enough tool? More elaborately: if you want the model to behave in a specific way or to have certain internal properties, can you achieve this simply by a suitable choosing of the reward function?
I looked at two toy cases, namely Q-learning and training a neural network (the latter which is not actually reinforcement learning but supervised learning). The answers were “yep, suitable reward/loss (and datapoints in the case of supervised learning) are enough”.
I was hoping for this not to be the case, as that would have been more interesting (imagine if there were fundamental limitations to the reward/loss paradigm!), but anyways. I now expect that also in more complicated situations reward/loss are, in principle, enough.
Example 1: Q-learning. You have a set $S$ of states and a set $A$ of actions. Given a target policy $π : S \to A$ , can you necessarily choose a reward function $R : S \times A \to R$ such that, training for long enough* with Q-learning (with positive learning rate and discount factor), the action that maximizes reward is the one given by the target policy: $\forall s \in S : arg {max}_{a \in A} Q (s, a) = π (s)$ ?
*and assuming we visit all of the states in $S$ many times
The answer is yes. Simply reward the behavior you want to see: let $R (s, a) = 1$ if $a = π (s)$ and $R (s, a) = 0$ otherwise.
(In fact, one can more strongly choose, for any target value function $Q^{'} : S \times A \to R$ , a reward function $R$ such that the values $Q (s, a)$ in Q-learning converge in the limit to $Q^{'} (s, a)$ . So not only can you force certain behavior out of the model, you can also choose the internals.)
Example 2: Neural network.
Say you have a neural network $R^{n} \to R$ with $m$ tunable weights $w = (w_{1}, \dots, w_{m})$ . Can you, by suitable input-output pairs and choices of the learning rate, modify the weights of the net so that they are (approximately) equal to $w^{'} = (w_{1}^{'}, \dots, w_{m}^{'})$ ?
(I’m assuming here that we simply update the weights after each data point, instead of doing SGD or something. The choice of loss function is not very relevant, take e.g. square-error.)
The following sketch convinces me that the answer is positive:
Choose $m$ random input-output pairs $(x_{i}, y_{i})$ . The gradients $g_{i}$ of the weight vectors are almost certainly linearly independent. Hence, some linear combination $c_{1} g_{1} + \dots + c_{m} g_{m}$ of them equals $w^{'} - w$ . Now, for small $ϵ > 0$ , running back-propagation on the pair $(x_{i}, y_{i})$ with learning rate $ϵ c_{i}$ for all $i = 1, \dots, m$ gives you an update approximately in the direction of $w^{'} - w$ . Rinse and repeat.