On the other hand, there doesn’t seem to be a principled difference between positive reinforcement and negative reinforcement. Like I would assume that the zero point wouldn’t affect the trade-off between two actions as long as the difference was fixed.
This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like “approach juice”), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general).
(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won’t achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don’t feel quite clear on this yet).
This is only true for optimal policies, no? For learned policies, positive reward will upweight and generalize certain circuits (like “approach juice”), while negative reward will downweight and generally-discourage those same circuits. This can then lead to path-dependent differences in generalization (e.g. whether person pursues juice in general).
(In general, I think reward is not best understood as an optimization target like “utility.”)
Good point.
(That said, it seems like to useful check to see what the optimal policy will do. And if someone believes it won’t achieve the optimal policy, it seems useful to try to understand the barrier that stops that. I don’t feel quite clear on this yet).