i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
- ^
Quick proof sketch (this is a very standard result in RL and is the motivation for advantage estimation, but still good practice to check things).
The REINFORCE estimator is .
WLOG, suppose we define a new reward (and assume that , so is moving away from the mean).
Then we can verify the expectation of the gradient is still the same:.
But the variance increases:
So:
Obviously, both terms on the right have to be non-negative. More generally, if , the variance increases with . So having your rewards be uncentered hurts a ton.
is this for a reason other than the variance thing I mention?
I think the thing I mention is still important is because it means there is no fundamental difference between positive and negative motivation. I agree that if everything was different degrees of extreme bliss then the variance would be so high that you never learn anything in practice. but if you shift everything slightly such that some mildly unpleasant things are now mildly pleasant, I claim this will make learning a bit faster or slower but still converge to the same thing.