Hmm, that’s interesting! I think I mostly agree with you in spirit here.
My starting point for Sharpe 2017 would be: the topic of discussion is really cortical learning, via editing within-cortex connections. The cortex can learn new sequences, or it can learn new categories, etc.
For sensory prediction areas, cortical learning doesn’t really need dopamine, I don’t think. You can just have a self-supervised learning rule, i.e. “if you have a sensory prediction error, then improve your models”. (Leaving aside some performance tweaks.) (Footnote—Yeah I know, there is in fact dopamine in primary sensory cortex, at least controversially and maybe only in layers 1&6, I’m still kinda confused about what’s going on with that.)
Decisionmaking areas are kinda different. Take sequence learning as an example. If I try the sequence “I see a tree branch and then I jump and grab it”, and then the branch breaks off and I fall down and everyone laughs at me, then that wasn’t a good sequence to learn, and it’s all for the better if that particular idea doesn’t pop into my head next time I see a tree branch.
So in decisionmaking areas, you could have the following rule: “run the sequence-learning algorithm (or category-learning algorithm or whatever), but only when RPE-DA is present”.
Then I pretty much agree with you on 1,2,3,4. In particular, I would guess that the learning takes place in the sensory prediction areas in 2 & 3 (where there’s a PE), and that learning takes place in the decisionmaking areas in 4 (maybe something like: IT learns to attend to C, or maybe IT learns to lump A & C together into a new joint category, or whatever).
I’m reluctant to make any strong connection between self-supervised learning and “dopamine-supervised learning” though. The reason is: Dopamine-supervised learning would require (at least) one dopamine neuron per dimension of the output space. But for self-supervised learning, at least in mammals, I generally think of it as “predicting what will happen next, expressed in terms of some learned latent space of objects/concepts”. I think of the learned latent space as being very high-dimensional, with the number of dimensions being able to change in real time as the rat learns new things. Whereas the dimensionality of the set of dopamine neurons seems to be fixed.
Is it also correct that DA for global/local RPEs and supervised/self-supervised learning in the completely naive brain should go in different directions?
Hmm, I think “not necessarily”. You can perform really crappily (by adult standards) and still get positive RPEs half the time because your baseline expectations were even worse. Like, the brainstem probably wouldn’t hold the infant cortex to adult-cortex standards. And the reward predictions should also converge to the actual rewards, which would give average RPE of 0, to a first approximation.
And for supervisory signals, they could be signed, which means DA pauses half the time and bursts half the time. I’m not sure that’s necessary—another approach is to have a pair of opponent-process learning algorithms with unsigned errors, maybe. I don’t know what the learning rules etc. are in detail.
I’m reluctant to make any strong connection between self-supervised learning and “dopamine-supervised learning” though. The reason is: Dopamine-supervised learning would require (at least) one dopamine neuron per dimension of the output space
I totally agree that there is not enough dimensionality of dopamine signals to provide the teaching feedback in self-supervised learning of the same specificity as in supervised learning.
What I was rather trying to say in general is that maybe dopamine is involved in self-supervised learning by only providing permissive signal to update the model. And was trying to understand how sensory PE is related to dopamine release.
For sensory prediction areas, cortical learning doesn’t really need dopamine, I don’t think
That’s what I always assumed before Sharpe 2017. But in their experiments inhibition of dopamine release inhibits learning association between 2 stimuli: PE is still there, little dopamine release, no model is learned. By “PE is still there” I assume that PE gets registered by neurons, (not that mouse becomes inattentive or blind upon dopamine inhibition) but the model is still not learned despite (pyramidal?) neurons signaling the presence of PE, this is the most interesting case compared to just “gets blind” case. If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.
Even if this “dopamine permits to update sensory model” is true, I also don’t get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE? But there is an interesting connection to schizophrenia: there is an abnormal dopamine release in schizophrenic patients—maybe they needlessly update their models because upregulated dopamine says so (found it in Sharpe 2020)
And the reward predictions should also converge to the actual rewards, which would give average RPE of 0, to a first approximation.
I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying “this was a really good plan, repeat it the next time”, after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?
Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to “go eat” and your untrained amygdala says “we don’t need to salivate”, and you don’t salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan “go eat” it’s not that bad and it shouldn’t be disrewarded that much just because amygdala screwed up and you didn’t salivate, so the reward signal should be corrected somehow?
If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.
Yup, that’s what I meant.
Even if this “dopamine permits to update sensory model” is true, I also don’t get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE?
For example, I’m sitting on my porch mindlessly watching cars drive by. There’s a red car and then a green car. After seeing the red car, I wasn’t expecting the next car to be green … but I also wasn’t expecting the next car not to be green. I just didn’t have any particular expectations about the next car’s color. So I would say that the “green” is unexpected but not a prediction error. There was no prediction; my models were not wrong but merely agnostic.
In other words:
The question of “what prediction to make” has a right and wrong answer, and can therefore be trained by prediction errors (self-supervised learning).
The question of “whether to make a prediction in the first place, as opposed to ignoring the thing and attending to something else (or zoning out altogether)” is a decision, and therefore cannot be trained by prediction errors. If it’s learned at all, it has to be trained by RL, I think.
(In reality, I don’t think it’s a binary “make a prediction about X / don’t make a prediction about X”, instead I think you can make stronger or weaker predictions about things.)
And I think the decision of “whether or not to make a strong prediction about what color car will come after the red car” is not being made in V1, but rather in, I dunno, maybe IT or FEF or dlPFC or HC/EC (like you suggest) or something.
I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying “this was a really good plan, repeat it the next time”, after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?
To be clear, in regards to “no dopamine”, sometimes I leave out “(compared to baseline)”, so “positive dopamine (compared to baseline)” is a burst and “negative dopamine (compared to baseline)” is a pause. (I should stop doing that!) Anyway, my impression right now is that when things are going exactly as expected, even if that’s very good in some objective sense, it’s baseline dopamine, neither burst nor pause—e.g. in the classic Wolfram Schultz experiment, there was baseline dopamine at the fully-expected juice, even though drinking juice is really great compared to what monkeys might generically expect in their evolutionary environment.
(Exception: if things are going exactly as expected, but it’s really awful and painful and dangerous, there’s apparently still a dopamine pause—it never gets fully predicted away—see here, the part that says “Punishments create dopamine pauses even when they’re fully expected”.)
I’m not sure if “baseline dopamine” corresponds to “slightly strengthen the connections for what you’re doing”, or “don’t change the connection strengths at all”.
Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to “go eat” and your untrained amygdala says “we don’t need to salivate”, and you don’t salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan “go eat” it’s not that bad and it shouldn’t be disrewarded that much just because amygdala screwed up and you didn’t salivate, so the reward signal should be corrected somehow?
OK, let’s say I happen to really need salt right now. I grab what I thought was an unsalted peanut and put it in my mouth, but actually it’s really salty. Awesome!
From a design perspective, whatever my decisionmaking circuits were doing just now was a good thing to do, and they ought to receive an RL-type dopamine burst ensuring that it happens again.
My introspective experience matches that: I’m surprised and delighted that the peanut is salty.
Your comment suggests that this is not the default, but requires some correction mechanism. I’m kinda confused by what you wrote; I’m not exactly sure where you’re coming from.
Maybe you’re thinking: it’s aversive to put something salty in your mouth without salivating first. Well, why should it be aversive? It’s not harmful for the organism, it just takes a bit longer to swallow. Anyway, the decisionmaking circuit didn’t do anything wrong. So I would expect “putting salty food into a dry mouth” to be wired up as not inherently aversive. That seems to match my introspective experience.
Or maybe you’re thinking: My hypothalamus & brainstem are tricked by the amygdala / AIC / whatever to treat the peanut as not-salty? Well, the brainstem has ground truth (there’s a direct line from the taste buds to the medulla I think) so whatever they were guessing before doesn’t matter; now that the peanut is in the mouth, they know it’s definitely salty, and will issue appropriate signals.
Exception: if things are going exactly as expected, but it’s really awful and painful and dangerous, there’s apparently still a dopamine pause—it never gets fully predicted away
Interestingly, the same goes for serotonin—FIg 7B in Matias 2017 . But also not clear which part of raphe neurons does this—seems that there is a similar picture as with dopamine -projections to different areas respond differently to the aversive stimuli.
Maybe you’re thinking: it’s aversive to put something salty in your mouth without salivating first.
Closer to this. Well, it wasn’t a fully-formed thought, just came up with the salt example and thought there might be this problem. What I meant is a sort of problem of the credit assignment: if your dopamine in midbrain depends on both cortical action/thought and assessor action, then how does midbrain assign dopamine to both cortex-plan proposers and assessors? I guess for this you need to have situation where reward(plan1, assessor_action1) > 0, but reward(plan1, assessor_action2) < 0, and the salt example is bad here because in both salivating/not salivating cases reward > 0. Maybe something like inappropriately laughing after you’ve been told about some tragedy: you got negative reward, but it doesn’t mean that this topic had to be avoided altogether in the future (reinforced by the decrease of dopamine), rather you should just change your assessor reaction, and reward will become positive. And my point was that it is not clear how this can happen if the only thing the cortex-plan proposer sees is the negative dopamine (without additionally knowing that assessors also got negative dopamine so that overall negative dopamine can be just explained by the wrong assessor action and plan proposer actually doesn’t need to change anything)
Oh I gotcha. Well one thing is, I figure the whole system doesn’t come crashing down if the plan-proposer gets an “incorrect” reward sometimes. I mean, that’s inevitable—the plan-assessors keep getting adjusted over the course of your life as you have new experiences etc., and the plan-proposer has to keep playing catch-up.
But I think it’s better than that.
Here’s an alternate example that I find a bit cleaner (sorry if it’s missing your point). You put something in your mouth expecting it to be yummy (thus release certain hormones), but it’s actually gross (thus make a disgust face and release different hormones etc.). So reward(plan1, assessor_action1)>0 but reward(plan1, assessor_action2)<0. I think as you bring the food towards your mouth, you’re getting assessor_action1 and hence the “wrong” reward, but once it’s in your mouth, your hypothalamus / brainstem immediately pivots to assessor_action2 and hence the “right reward”. And the “right reward” is stronger than the “wrong reward”, because it’s driven by a direct ground-truth experience not just an uncertain expectation. So in the end the plan proposer would get the right training signal overall, I think.
Hmm, that’s interesting! I think I mostly agree with you in spirit here.
My starting point for Sharpe 2017 would be: the topic of discussion is really cortical learning, via editing within-cortex connections. The cortex can learn new sequences, or it can learn new categories, etc.
For sensory prediction areas, cortical learning doesn’t really need dopamine, I don’t think. You can just have a self-supervised learning rule, i.e. “if you have a sensory prediction error, then improve your models”. (Leaving aside some performance tweaks.) (Footnote—Yeah I know, there is in fact dopamine in primary sensory cortex, at least controversially and maybe only in layers 1&6, I’m still kinda confused about what’s going on with that.)
Decisionmaking areas are kinda different. Take sequence learning as an example. If I try the sequence “I see a tree branch and then I jump and grab it”, and then the branch breaks off and I fall down and everyone laughs at me, then that wasn’t a good sequence to learn, and it’s all for the better if that particular idea doesn’t pop into my head next time I see a tree branch.
So in decisionmaking areas, you could have the following rule: “run the sequence-learning algorithm (or category-learning algorithm or whatever), but only when RPE-DA is present”.
Then I pretty much agree with you on 1,2,3,4. In particular, I would guess that the learning takes place in the sensory prediction areas in 2 & 3 (where there’s a PE), and that learning takes place in the decisionmaking areas in 4 (maybe something like: IT learns to attend to C, or maybe IT learns to lump A & C together into a new joint category, or whatever).
I’m reluctant to make any strong connection between self-supervised learning and “dopamine-supervised learning” though. The reason is: Dopamine-supervised learning would require (at least) one dopamine neuron per dimension of the output space. But for self-supervised learning, at least in mammals, I generally think of it as “predicting what will happen next, expressed in terms of some learned latent space of objects/concepts”. I think of the learned latent space as being very high-dimensional, with the number of dimensions being able to change in real time as the rat learns new things. Whereas the dimensionality of the set of dopamine neurons seems to be fixed.
Hmm, I think “not necessarily”. You can perform really crappily (by adult standards) and still get positive RPEs half the time because your baseline expectations were even worse. Like, the brainstem probably wouldn’t hold the infant cortex to adult-cortex standards. And the reward predictions should also converge to the actual rewards, which would give average RPE of 0, to a first approximation.
And for supervisory signals, they could be signed, which means DA pauses half the time and bursts half the time. I’m not sure that’s necessary—another approach is to have a pair of opponent-process learning algorithms with unsigned errors, maybe. I don’t know what the learning rules etc. are in detail.
Thanks for the answers!
I totally agree that there is not enough dimensionality of dopamine signals to provide the teaching feedback in self-supervised learning of the same specificity as in supervised learning.
What I was rather trying to say in general is that maybe dopamine is involved in self-supervised learning by only providing permissive signal to update the model. And was trying to understand how sensory PE is related to dopamine release.
That’s what I always assumed before Sharpe 2017. But in their experiments inhibition of dopamine release inhibits learning association between 2 stimuli: PE is still there, little dopamine release, no model is learned. By “PE is still there” I assume that PE gets registered by neurons, (not that mouse becomes inattentive or blind upon dopamine inhibition) but the model is still not learned despite (pyramidal?) neurons signaling the presence of PE, this is the most interesting case compared to just “gets blind” case. If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.
Even if this “dopamine permits to update sensory model” is true, I also don’t get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE? But there is an interesting connection to schizophrenia: there is an abnormal dopamine release in schizophrenic patients—maybe they needlessly update their models because upregulated dopamine says so (found it in Sharpe 2020)
I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying “this was a really good plan, repeat it the next time”, after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?
Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to “go eat” and your untrained amygdala says “we don’t need to salivate”, and you don’t salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan “go eat” it’s not that bad and it shouldn’t be disrewarded that much just because amygdala screwed up and you didn’t salivate, so the reward signal should be corrected somehow?
Thanks for your thoughtful & helpful comments!
Yup, that’s what I meant.
For example, I’m sitting on my porch mindlessly watching cars drive by. There’s a red car and then a green car. After seeing the red car, I wasn’t expecting the next car to be green … but I also wasn’t expecting the next car not to be green. I just didn’t have any particular expectations about the next car’s color. So I would say that the “green” is unexpected but not a prediction error. There was no prediction; my models were not wrong but merely agnostic.
In other words:
The question of “what prediction to make” has a right and wrong answer, and can therefore be trained by prediction errors (self-supervised learning).
The question of “whether to make a prediction in the first place, as opposed to ignoring the thing and attending to something else (or zoning out altogether)” is a decision, and therefore cannot be trained by prediction errors. If it’s learned at all, it has to be trained by RL, I think.
(In reality, I don’t think it’s a binary “make a prediction about X / don’t make a prediction about X”, instead I think you can make stronger or weaker predictions about things.)
And I think the decision of “whether or not to make a strong prediction about what color car will come after the red car” is not being made in V1, but rather in, I dunno, maybe IT or FEF or dlPFC or HC/EC (like you suggest) or something.
To be clear, in regards to “no dopamine”, sometimes I leave out “(compared to baseline)”, so “positive dopamine (compared to baseline)” is a burst and “negative dopamine (compared to baseline)” is a pause. (I should stop doing that!) Anyway, my impression right now is that when things are going exactly as expected, even if that’s very good in some objective sense, it’s baseline dopamine, neither burst nor pause—e.g. in the classic Wolfram Schultz experiment, there was baseline dopamine at the fully-expected juice, even though drinking juice is really great compared to what monkeys might generically expect in their evolutionary environment.
(Exception: if things are going exactly as expected, but it’s really awful and painful and dangerous, there’s apparently still a dopamine pause—it never gets fully predicted away—see here, the part that says “Punishments create dopamine pauses even when they’re fully expected”.)
I’m not sure if “baseline dopamine” corresponds to “slightly strengthen the connections for what you’re doing”, or “don’t change the connection strengths at all”.
OK, let’s say I happen to really need salt right now. I grab what I thought was an unsalted peanut and put it in my mouth, but actually it’s really salty. Awesome!
From a design perspective, whatever my decisionmaking circuits were doing just now was a good thing to do, and they ought to receive an RL-type dopamine burst ensuring that it happens again.
My introspective experience matches that: I’m surprised and delighted that the peanut is salty.
Your comment suggests that this is not the default, but requires some correction mechanism. I’m kinda confused by what you wrote; I’m not exactly sure where you’re coming from.
Maybe you’re thinking: it’s aversive to put something salty in your mouth without salivating first. Well, why should it be aversive? It’s not harmful for the organism, it just takes a bit longer to swallow. Anyway, the decisionmaking circuit didn’t do anything wrong. So I would expect “putting salty food into a dry mouth” to be wired up as not inherently aversive. That seems to match my introspective experience.
Or maybe you’re thinking: My hypothalamus & brainstem are tricked by the amygdala / AIC / whatever to treat the peanut as not-salty? Well, the brainstem has ground truth (there’s a direct line from the taste buds to the medulla I think) so whatever they were guessing before doesn’t matter; now that the peanut is in the mouth, they know it’s definitely salty, and will issue appropriate signals.
You can try again if I didn’t get it. :)
Interestingly, the same goes for serotonin—FIg 7B in Matias 2017 . But also not clear which part of raphe neurons does this—seems that there is a similar picture as with dopamine -projections to different areas respond differently to the aversive stimuli.
Closer to this. Well, it wasn’t a fully-formed thought, just came up with the salt example and thought there might be this problem. What I meant is a sort of problem of the credit assignment: if your dopamine in midbrain depends on both cortical action/thought and assessor action, then how does midbrain assign dopamine to both cortex-plan proposers and assessors? I guess for this you need to have situation where reward(plan1, assessor_action1) > 0, but reward(plan1, assessor_action2) < 0, and the salt example is bad here because in both salivating/not salivating cases reward > 0. Maybe something like inappropriately laughing after you’ve been told about some tragedy: you got negative reward, but it doesn’t mean that this topic had to be avoided altogether in the future (reinforced by the decrease of dopamine), rather you should just change your assessor reaction, and reward will become positive. And my point was that it is not clear how this can happen if the only thing the cortex-plan proposer sees is the negative dopamine (without additionally knowing that assessors also got negative dopamine so that overall negative dopamine can be just explained by the wrong assessor action and plan proposer actually doesn’t need to change anything)
Oh I gotcha. Well one thing is, I figure the whole system doesn’t come crashing down if the plan-proposer gets an “incorrect” reward sometimes. I mean, that’s inevitable—the plan-assessors keep getting adjusted over the course of your life as you have new experiences etc., and the plan-proposer has to keep playing catch-up.
But I think it’s better than that.
Here’s an alternate example that I find a bit cleaner (sorry if it’s missing your point). You put something in your mouth expecting it to be yummy (thus release certain hormones), but it’s actually gross (thus make a disgust face and release different hormones etc.). So reward(plan1, assessor_action1)>0 but reward(plan1, assessor_action2)<0. I think as you bring the food towards your mouth, you’re getting assessor_action1 and hence the “wrong” reward, but once it’s in your mouth, your hypothalamus / brainstem immediately pivots to assessor_action2 and hence the “right reward”. And the “right reward” is stronger than the “wrong reward”, because it’s driven by a direct ground-truth experience not just an uncertain expectation. So in the end the plan proposer would get the right training signal overall, I think.