So, (I claim that) machine learning models provide a pretty good basis for comparison of the dopamine-moving-earlier thing: eg, this is what you’d expect from a system that does a local reinforce-positive update on the policy net as soon as the value net starts predicting a higher future expected value. See something about actor-critic, eg section 3.2.1 of this pdf. Because we’re starting from the prior that the brain is well enough designed to get pretty damn close to working, seeing that policy rewards move earlier is not evidence that should update us away from models where the brain is doing correct temporal difference learning (section 2.3.3 in that pdf).
The social thing I’m suggesting is that the expected value that the value function is predicting on seeing “oh, I gained weight” is a correct representation of future reward, even though it’s a very simple approximation. I don’t mean to say that I think a complicated, multi-step model is being run, just that the usual approximation is approximating a reasoning process that if done in full using the verbal loop, would look something like:
I have higher weight
I now know that I have higher weight
I now have less justified ability to claim high status
When I next interact with someone, I will have less claim to be valuable in their eyes
I will therefore expect them to express slightly less approval toward me, because I won’t be able to hide that I know I feel I have less justified ability to claim status
I am saying that I don’t think implementation of TD-learning is the problem here.
Got it. That makes sense. I think I still disagree, but if I’ve understood you right I can agree that that hypothesis also clearly deserves to be in the mix.
So, (I claim that) machine learning models provide a pretty good basis for comparison of the dopamine-moving-earlier thing: eg, this is what you’d expect from a system that does a local reinforce-positive update on the policy net as soon as the value net starts predicting a higher future expected value. See something about actor-critic, eg section 3.2.1 of this pdf. Because we’re starting from the prior that the brain is well enough designed to get pretty damn close to working, seeing that policy rewards move earlier is not evidence that should update us away from models where the brain is doing correct temporal difference learning (section 2.3.3 in that pdf).
The social thing I’m suggesting is that the expected value that the value function is predicting on seeing “oh, I gained weight” is a correct representation of future reward, even though it’s a very simple approximation. I don’t mean to say that I think a complicated, multi-step model is being run, just that the usual approximation is approximating a reasoning process that if done in full using the verbal loop, would look something like:
I have higher weight
I now know that I have higher weight
I now have less justified ability to claim high status
When I next interact with someone, I will have less claim to be valuable in their eyes
I will therefore expect them to express slightly less approval toward me, because I won’t be able to hide that I know I feel I have less justified ability to claim status
I am saying that I don’t think implementation of TD-learning is the problem here.
Got it. That makes sense. I think I still disagree, but if I’ve understood you right I can agree that that hypothesis also clearly deserves to be in the mix.