Steven Byrnes comments on Dopamine-supervised learning in mammals & fruit flies

Steven Byrnes 9 Sep 2021 13:58 UTC
2 points
Thanks for your thoughtful & helpful comments!
If by learning for sensory predictions areas you mean modifying synapses in V1, I agree, you might not need synaptic changes or dopamine there, sensory learning (and need for dopamine) can happen somewhere else (hippocampus-entorhinal cortex? no clue) that are sending predictions to V1. The model is learned on the level of knowing when to fire predictions from entorhinish cortex to V1.
Yup, that’s what I meant.
Even if this “dopamine permits to update sensory model” is true, I also don’t get why would you need the intermediate node dopamine between PE and updating the model, why not just update the model after you get cortical (signaled by pyramidal neurons) PE?
For example, I’m sitting on my porch mindlessly watching cars drive by. There’s a red car and then a green car. After seeing the red car, I wasn’t expecting the next car to be green … but I also wasn’t expecting the next car not to be green. I just didn’t have any particular expectations about the next car’s color. So I would say that the “green” is unexpected but not a prediction error. There was no prediction; my models were not wrong but merely agnostic.
In other words:
- The question of “what prediction to make” has a right and wrong answer, and can therefore be trained by prediction errors (self-supervised learning).
- The question of “whether to make a prediction in the first place, as opposed to ignoring the thing and attending to something else (or zoning out altogether)” is a decision, and therefore cannot be trained by prediction errors. If it’s learned at all, it has to be trained by RL, I think.
(In reality, I don’t think it’s a binary “make a prediction about X / don’t make a prediction about X”, instead I think you can make stronger or weaker predictions about things.)
And I think the decision of “whether or not to make a strong prediction about what color car will come after the red car” is not being made in V1, but rather in, I dunno, maybe IT or FEF or dlPFC or HC/EC (like you suggest) or something.
I guess I incorrectly understood your model. I assumed that for the given environment the ideal policy will lead to the big dopamine release, saying “this was a really good plan, repeat it the next time”, after rereading your decision making post it seems that assessors predict the reward, and there will be no dopamine as RPE=0?
To be clear, in regards to “no dopamine”, sometimes I leave out “(compared to baseline)”, so “positive dopamine (compared to baseline)” is a burst and “negative dopamine (compared to baseline)” is a pause. (I should stop doing that!) Anyway, my impression right now is that when things are going exactly as expected, even if that’s very good in some objective sense, it’s baseline dopamine, neither burst nor pause—e.g. in the classic Wolfram Schultz experiment, there was baseline dopamine at the fully-expected juice, even though drinking juice is really great compared to what monkeys might generically expect in their evolutionary environment.
(Exception: if things are going exactly as expected, but it’s really awful and painful and dangerous, there’s apparently still a dopamine pause—it never gets fully predicted away—see here, the part that says “Punishments create dopamine pauses even when they’re fully expected”.)
I’m not sure if “baseline dopamine” corresponds to “slightly strengthen the connections for what you’re doing”, or “don’t change the connection strengths at all”.
Side question: when you talk about plan assessors, do you think there should be some mechanism in the brainstem that corrects RPE signals going to the cortex based on the signals sent to supervised learning plan assessors? For example, If the plan is to “go eat” and your untrained amygdala says “we don’t need to salivate”, and you don’t salivate, then you get way smaller reward (especially after crunchy chips) than if you would salivate. Sure, amygdala/other assesors will get their supervisory signal, but it also seems to me that the plan “go eat” it’s not that bad and it shouldn’t be disrewarded that much just because amygdala screwed up and you didn’t salivate, so the reward signal should be corrected somehow?
OK, let’s say I happen to really need salt right now. I grab what I thought was an unsalted peanut and put it in my mouth, but actually it’s really salty. Awesome!
From a design perspective, whatever my decisionmaking circuits were doing just now was a good thing to do, and they ought to receive an RL-type dopamine burst ensuring that it happens again.
My introspective experience matches that: I’m surprised and delighted that the peanut is salty.
Your comment suggests that this is not the default, but requires some correction mechanism. I’m kinda confused by what you wrote; I’m not exactly sure where you’re coming from.
Maybe you’re thinking: it’s aversive to put something salty in your mouth without salivating first. Well, why should it be aversive? It’s not harmful for the organism, it just takes a bit longer to swallow. Anyway, the decisionmaking circuit didn’t do anything wrong. So I would expect “putting salty food into a dry mouth” to be wired up as not inherently aversive. That seems to match my introspective experience.
Or maybe you’re thinking: My hypothalamus & brainstem are tricked by the amygdala / AIC / whatever to treat the peanut as not-salty? Well, the brainstem has ground truth (there’s a direct line from the taste buds to the medulla I think) so whatever they were guessing before doesn’t matter; now that the peanut is in the mouth, they know it’s definitely salty, and will issue appropriate signals.
You can try again if I didn’t get it. :)
- bastak 19 Sep 2021 8:16 UTC
  3 points
  Parent
  Exception: if things are going exactly as expected, but it’s really awful and painful and dangerous, there’s apparently still a dopamine pause—it never gets fully predicted away
  Interestingly, the same goes for serotonin—FIg 7B in Matias 2017 . But also not clear which part of raphe neurons does this—seems that there is a similar picture as with dopamine -projections to different areas respond differently to the aversive stimuli.
  Maybe you’re thinking: it’s aversive to put something salty in your mouth without salivating first.
  Closer to this. Well, it wasn’t a fully-formed thought, just came up with the salt example and thought there might be this problem. What I meant is a sort of problem of the credit assignment: if your dopamine in midbrain depends on both cortical action/thought and assessor action, then how does midbrain assign dopamine to both cortex-plan proposers and assessors? I guess for this you need to have situation where reward(plan1, assessor_action1) > 0, but reward(plan1, assessor_action2) < 0, and the salt example is bad here because in both salivating/not salivating cases reward > 0. Maybe something like inappropriately laughing after you’ve been told about some tragedy: you got negative reward, but it doesn’t mean that this topic had to be avoided altogether in the future (reinforced by the decrease of dopamine), rather you should just change your assessor reaction, and reward will become positive. And my point was that it is not clear how this can happen if the only thing the cortex-plan proposer sees is the negative dopamine (without additionally knowing that assessors also got negative dopamine so that overall negative dopamine can be just explained by the wrong assessor action and plan proposer actually doesn’t need to change anything)
  - Steven Byrnes 19 Sep 2021 21:33 UTC
    2 points
    Parent
    Oh I gotcha. Well one thing is, I figure the whole system doesn’t come crashing down if the plan-proposer gets an “incorrect” reward sometimes. I mean, that’s inevitable—the plan-assessors keep getting adjusted over the course of your life as you have new experiences etc., and the plan-proposer has to keep playing catch-up.
    But I think it’s better than that.
    Here’s an alternate example that I find a bit cleaner (sorry if it’s missing your point). You put something in your mouth expecting it to be yummy (thus release certain hormones), but it’s actually gross (thus make a disgust face and release different hormones etc.). So reward(plan1, assessor_action1)>0 but reward(plan1, assessor_action2)<0. I think as you bring the food towards your mouth, you’re getting assessor_action1 and hence the “wrong” reward, but once it’s in your mouth, your hypothalamus / brainstem immediately pivots to assessor_action2 and hence the “right reward”. And the “right reward” is stronger than the “wrong reward”, because it’s driven by a direct ground-truth experience not just an uncertain expectation. So in the end the plan proposer would get the right training signal overall, I think.