Thanks for the answers!
I’m reluctant to make any strong connection between self-supervised learning and “dopamine-supervised learning” though. The reason is: Dopamine-supervised learning would require (at least) one dopamine neuron per dimension of the output space
I totally agree that there is not enough dimensionality of dopamine signals to provide the teaching feedback in self-supervised learning of the same specificity as in supervised learning.
What I was rather trying to say in general is that maybe dopamine is involved in self-supervised learning by only providing permissive signal to update the model. And was trying to understand how sensory PE is related to dopamine release.
For sensory prediction areas, cortical learning doesn’t really need dopamine, I don’t think
That’s what I always assumed before Sharpe 2017. But in their experiments inhibition of dopamine release inhibits learning association between 2 stimuli: PE is still there, little dopamine release, no model is learned. By “PE is still there” I assume that PE gets registered by neurons, (not that mouse becomes inattentive or blind upon dopamine inhibition) but the model is still not learned despite (pyramidal?) neurons signaling the presence of PE, this is the most interesting case compared to just “gets blind” case. If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.
Even if this “dopamine permits to update sensory model” is true, I also don’t get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE? But there is an interesting connection to schizophrenia: there is an abnormal dopamine release in schizophrenic patients—maybe they needlessly update their models because upregulated dopamine says so (found it in Sharpe 2020)
And the reward predictions should also converge to the actual rewards, which would give average RPE of 0, to a first approximation.
I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying “this was a really good plan, repeat it the next time”, after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?
Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to “go eat” and your untrained amygdala says “we don’t need to salivate”, and you don’t salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan “go eat” it’s not that bad and it shouldn’t be disrewarded that much just because amygdala screwed up and you didn’t salivate, so the reward signal should be corrected somehow?
Interestingly, the same goes for serotonin—FIg 7B in Matias 2017 . But also not clear which part of raphe neurons does this—seems that there is a similar picture as with dopamine -projections to different areas respond differently to the aversive stimuli.
Closer to this. Well, it wasn’t a fully-formed thought, just came up with the salt example and thought there might be this problem. What I meant is a sort of problem of the credit assignment: if your dopamine in midbrain depends on both cortical action/thought and assessor action, then how does midbrain assign dopamine to both cortex-plan proposers and assessors? I guess for this you need to have situation where reward(plan1, assessor_action1) > 0, but reward(plan1, assessor_action2) < 0, and the salt example is bad here because in both salivating/not salivating cases reward > 0. Maybe something like inappropriately laughing after you’ve been told about some tragedy: you got negative reward, but it doesn’t mean that this topic had to be avoided altogether in the future (reinforced by the decrease of dopamine), rather you should just change your assessor reaction, and reward will become positive. And my point was that it is not clear how this can happen if the only thing the cortex-plan proposer sees is the negative dopamine (without additionally knowing that assessors also got negative dopamine so that overall negative dopamine can be just explained by the wrong assessor action and plan proposer actually doesn’t need to change anything)