I claim that what’s going on is that the monkey’s brain, separate from the monkey/the monkey’s S2/any sapient or strategic awareness that the monkey has, is conditioning the monkey.
I think this claim is confusing at best and false at worst. The shifting dopamine response is well-recognized in the neuroscience literature, and explained by Sutton and Barto’s Temporal-Difference model.
First, it should be emphasized that midbrain dopamine does not signal reward. The monkey can experience a ton of pleasure without any dopamine reaction. Midbrain dopamine signals reward prediction error, the difference between actual and expected reward. It signals a kind of surprise.
Now the TD model is quite Bayesian. Whereas the Rescorla-Wagner model—the previously dominant theory of reinforcement—viewed the prediction error as the difference between actual and expected current reward; the TD model instead views it as the difference between all actual and expected future rewards (properly discounted).
So when the dopamine signal shifts, the monkey is just conserving expected evidence. Initially, it is positively surprised to receive juice. But eventually, it learns that the screen perfectly predicts the juice, and so it is the appearence of the screen itself that becomes the positive surprise. On a classical model of reinforcement, these events are different, as OP seems to recognize. But on the TD model, these are just instances of the very same kind of conditioning event.
OP seems to recognize all this, but these observations seems to be complemented with somewhat unfounded interpretations and elaborations.
[Epistemic status: confident OP will be confusing to those without RL background knowledge, but still non-negligible credence that OP is explaning exactly the above but from a different perspective]
Thanks for the info! I think the diff between my explanation and yours largely falls out “true” in your favor, and I’m glad you have additional clarification (correction?) here.
I think this claim is confusing at best and false at worst. The shifting dopamine response is well-recognized in the neuroscience literature, and explained by Sutton and Barto’s Temporal-Difference model.
First, it should be emphasized that midbrain dopamine does not signal reward. The monkey can experience a ton of pleasure without any dopamine reaction. Midbrain dopamine signals reward prediction error, the difference between actual and expected reward. It signals a kind of surprise.
Now the TD model is quite Bayesian. Whereas the Rescorla-Wagner model—the previously dominant theory of reinforcement—viewed the prediction error as the difference between actual and expected current reward; the TD model instead views it as the difference between all actual and expected future rewards (properly discounted).
So when the dopamine signal shifts, the monkey is just conserving expected evidence. Initially, it is positively surprised to receive juice. But eventually, it learns that the screen perfectly predicts the juice, and so it is the appearence of the screen itself that becomes the positive surprise. On a classical model of reinforcement, these events are different, as OP seems to recognize. But on the TD model, these are just instances of the very same kind of conditioning event.
For futher reference, see the section “Two Dopamine Responses and One Theory” of Glimcher PW (2011) Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.
OP seems to recognize all this, but these observations seems to be complemented with somewhat unfounded interpretations and elaborations.
[Epistemic status: confident OP will be confusing to those without RL background knowledge, but still non-negligible credence that OP is explaning exactly the above but from a different perspective]
Thanks for the info! I think the diff between my explanation and yours largely falls out “true” in your favor, and I’m glad you have additional clarification (correction?) here.