ADifferentAnonymous comments on Inner Alignment in Salt-Starved Rats

ADifferentAnonymous 20 Nov 2020 23:13 UTC
LW: 11 AF: 5
AF
This might just be me not grokking predictive processing, but...
I feel like I do a version of the rat’s task all the time to decide what to have for dinner—I imagine different food options, feel which one seems most appetizing, and then push the button (on Seamless) that will make that food appear.
Introspectively, this feels to me there’s such a thing as ‘hypothetical reward’. When I imagine a particular food, I feel like I get a signal from… somewhere… that tells me whether I would feel reward if I ate that food, but does not itself constitute reward. I don’t generally feel any desire to spend time fantasizing about the food I’m waiting for.
To turn this into a brain model, this seems like the neocortex calling an API the subcortex exposes. Roughly, the neocortex can give the subcortex hypothetical sensory data and get a hypothetical reward in exchange. I suppose this is basically hypothesis two with a modification to avoid the pitfall you identify, although that’s not how I arrived at the idea.
This does require a second dimension of subcortex-to-neocortex signal alongside the reward. Is there a reason to think there isn’t one?
- Steven Byrnes 22 Nov 2020 1:56 UTC
  LW: 11 AF: 2
  AF Parent
  I don’t generally feel any desire to spend time fantasizing about the food I’m waiting for.
  Haha, yeah, there’s a song about that.
  So anyway, I think you’re onto something, and I think that something is that “reward” and “reward prediction” are two distinct concepts, but they’re all jumbled up in my mind, and therefore presumably also jumbled up in my writings. I’ve been vaguely aware of this for a while, but thanks for calling me out on it, I should clean up my act. So I’m thinking out loud here, bear with me, and I’m happy for any help. :-)
  The TD learning algorithm is:
  $RPE = Reward Prediction Error = (r + V (s_{new})) - V (s_{prev})$
  $V (s_{prev}) + = (learning rate) \cdot (RPE)$
  where $s_{prev}$ is the previous state, $s_{new}$ is the new state, V is the value function a.k.a. reward prediction, and r is the reward from this step.
  (I’m ignoring discounting. BTW, I don’t think the brain literally does TD learning in the exact form that computer scientists do, but I think it’s close enough to get the right idea.)
  So let’s go through two scenarios.
  Scenario A: I’m going to eat candy, anticipating a large reward (high $V (s_{prev})$ ). I eat the candy (high r) then don’t anticipate any reward after that (low $V (s_{new})$ ). RPE=0 here. That’s just what I expected. V went down, but it went down in lock-step with the arrival of the reward r.
  Scenario B: I’m going to eat candy, anticipating a large reward (high $V (s_{prev})$ ). Then I see that we’re out of candy! So I get no reward and have nothing to look forward to ( $r = 0$ , low $V (s_{new})$ ). Now this is a negative (bad) RPE! Subjectively, this feels like crushing disappointment. The TD learning rule kicks in here, so that next time when I go to eat candy, I won’t be expecting as much reward as I did this time (lower $V (s_{prev})$ than before), because I will be preemptively braced for the possibility that we’ll be out of candy.
  OK, makes sense so far.
  Interestingly, the reward r, as such, barely matters here! It’s not decision-relevant, right? Good actions can be determined entirely by the following rule:
  Each step, do whatever maximizes RPE.
  (right?)
  Or subjectively, thoughts and sensory inputs with positive RPE are attractive, while thoughts and sensory inputs with negative RPE are aversive.
  OK, so when the rat first considers the possibility that it’s going to eat salt, it gets a big injection of positive RPE. It now (implicitly) expects a large upcoming reward. Let’s say for the sake of argument that it decides to not eat the salt, and go do something else. Well now we’re not expecting to eat the salt, whereas previously we were, so that’s a big injection of negative RPE. So basically, once it gets the idea that it can eat salt, it’s very aversive (negative RPE) to drop that idea, without actually consummating it (by eating the salt and getting the anticipated reward r).
  Back to your food example, you go in with some baseline expectation for what dinner’s going to be like. Then you invoke the idea “I’m going to eat yam”. You get a negative RPE in response. OK, go back to the baseline plan then. You get a compensatory positive RPE. Then you invoke the idea “I’m going to eat beans”. You get a positive RPE. Alright! You think about it some more. Oh, I can’t have beans tonight, I don’t have any. You drop the idea and suffer a negative RPE. That’s aversive, but you’re stuck. Then you invoke another idea “I’m going to eat porridge”. Positive RPE! As you flesh out the plan, it becomes more confident, which activates the model more strongly, the idea in your head of having porridge becomes more vivid, so to speak. Each increment of increasing confidence that you’re going to eat porridge is rewarded by a corresponding spurt of RPE. Then you eat the porridge. Back to low RPE, but there’s a reward at the same time, so that’s fine, there’s no RPE.
  Let’s go to fantasizing in general. Let’s say you get the idea that a wad of cash has magically appeared in your wallet. That idea is attractive (positive RPE). But sooner or later you’re going to actually look in the wallet and find that there’s no wad of cash (negative RPE). The negative RPE triggers the TD learning rule such that next time “the idea that a wad of cash has magically appeared in your wallet” will not be such an attractive idea, it will be tinged with a negative memory of it failing to happen. Of course, you could go the other way and try to avoid the negative RPE by clinging to the original story—like, don’t look in your wallet, or if you see that the cash isn’t there you think “guess I must have deposited in the bank already”, etc. This is unhealthy but certainly a known human foible. For example, as of this writing, in the USA, each of the two major presidential candidates has millions of followers who believe that their preferred candidate will be president for the next four years. It’s painful to let go of an idea that something good is going to happen, so you resist if at all possible. Luckily the brain has some defense systems against wishful thinking. For example you can’t not expect something to happen that you’ve directly experienced multiple times. See here. Another is: if you do eventually come back to earth, and the negative RPE finally does happen, then TD learning kicks in, and all the ideas and strategies that contributed to your resisting the truth until now get tarred with a reduction in associated RPE, which makes them less likely to be used next time.
  Hmm, so maybe I had it right in the diagram here: I had the neocortex sending reward predictions to the subcortex, and the subcortex sending back RPEs to the neocortex. So if the neocortex sends a high reward prediction, then a low reward prediction, that might or might not be a RPE, depending on whether you just ate candy in between. Here, the subcortex sends a positive RPE when the neocortex starts imagining tasting salt, and sends a negative RPE when it stops imagining salt (unless it actually ate the salt at that moment). And if the salt imagination / expectation signal gets suddenly stronger, it sends a positive RPE for the difference, and so on.
  (I could make a better diagram by pulling a “basal ganglia” box out of the neocortex subsystem into a separate box in the diagram. My understanding, definitely oversimplified, is that the basal ganglia has a dense web of connections across the (frontal lobe of the) neocortex, and just memorizes reward predictions associated with different arbitrary neocortical patterns. And it also suppresses patterns that lead to lower reward predictions and amplifies patterns that lead to higher reward predictions. So in the diagram, the neocortex would sends “information” to the basal ganglia, the basal ganglia calculates a reward prediction and sends it to the subcortex, and the subcortex sends the RPE to the basal ganglia (to alter the reward predictions) and to the neocortex (to reinforce or weaken the associated patterns). Something like that...).
  Does that make sense? Sorry this is so long. Happy for any thoughts if you’ve read this far.
  Another update: Actually maybe it’s simpler (and equivalent) to say the subcortex gives a reward proportional to the time-derivative of how strongly the salt-expectation signal is activated.
  What links here?
  - Steven Byrnes's comment on Inner Alignment in Salt-Starved Rats by Steven Byrnes (22 Nov 2020 2:11 UTC; 7 points)
  - ADifferentAnonymous 24 Nov 2020 0:25 UTC
    5 points
    Parent
    Thanks for the reply; I’ve thought it over a bunch, and I think my understanding is getting clearer.
    I think one source of confusion for me is that to get any mileage out of this model I have to treat the neocortex as a black box doing trying to maximize something, but it seems like we also need to rely on the fact that it executes a particular algorithm with certain constraints.
    For instance, if we think of the ‘reward predictions’ sent to the subcortex as outputs the neocortex chooses, the neocortex has no reason to keep them in sync with the rewards it actually expects to receive—instead, it should just increase the reward predictions to the maximum for some free one-time RPE and then leave it there, while engaging in an unrelated effort to maximize actual reward.
    (The equation V(sprev)+=(learning rate)⋅(RPE) explains why the neocortex can’t do that, but adding a mathematical constraint to my intuitive model is not really a supported operation. If I say “the neocortex is a black box that does whatever will maximize RPE, subject to the constraint that it has to update its reward predictions according to that equation,” then I have no idea what the neocortex can and can’t do)
    Adding in the basal ganglia as an ‘independent’ reward predictor seems to work. My first thought was that this would lead to an adversarial situation where the neocortex is constantly incentivized to fool the basal ganglia into predicting higher rewards, but I guess that isn’t a problem if the basal ganglia is good at its job.
    Still, I feel like I’m missing a piece to be able to understand imagination as a form of prediction. Imagining eating beans to decide how rewarding they would be doesn’t seem to get any harder if I already know I don’t have any beans. And it doesn’t feel like “thoughts of eating beans” are reinforced, it feels like I gain abstract knowledge that eating beans would be rewarded.
    Meanwhile, it’s quite possible to trigger physiological responses by imagining things. Certainly the response tends to be stronger if there’s an actual possibility of the imagined thing coming to pass, but it seems like there’s a floor on the effect size, where arbitrarily low probability eventually stops weakening the effect. This doesn’t seem like it stops working if you keep doing it—AIUI, not all hungry people are happier when they imagine glorious food, but they all salivate. So that’s a feedback channel separate from reward. I don’t see why there couldn’t also be similar loops entirely within the brain, but that’s harder to prove.
    So when our rat thinks about salt, the amygdala detects that and alerts… idk, the hypothalamus? The part that knows it needs salt… and the rat starts salivating and feels something in its stomach that it previously learned means “my body wants the food” and concludes eating salt would be a good idea.
    - Steven Byrnes 24 Nov 2020 21:46 UTC
      4 points
      Parent
      Strong agree that I have lots of detailed thoughts about the neocortex’s algorithms and am probably implicitly leaning on them in ways that I’m not entire aware of and not communicating well. I appreciate your working with me. :-)
      I do want to walk back a bit about the reward prediction error stuff. I think the following is equivalent but simpler:
      I propose that the subcortex sends a reward related to the time-derivative of how strongly the neocortex is imagining / expecting to taste salt. So the neocortex gets a reward for first entertaining the idea of tasting salt, and another incremental reward for growing that idea into a definite plan. But then it would get a negative reward for dropping that idea.
      (I think this is maybe related to the Russell-Ng potential-based reward shaping thing.)
      the neocortex is constantly incentivized to fool the basal ganglia into predicting higher rewards
      Well, there’s a couple things, I think.
      First, the neocortex can’t just expect arbitrary things. It’s constrained by self-supervised learning, which throws out models that have, in the past, made predictions refuted by experience. Like, let’s say that every time you open the door, the handle makes a click. You’re going to start expecting the click to happen. You have no choice, you can’t not expect it! There are also constraints around self-consistency and other things, like you can’t visualize something that is simultaneously stationary and dancing; those two models are just inconsistent, and the message-passing algorithm will simply not allow both to be active at the same time.
      Second, I think that one neocortex “thought” is made up of a large number of different components, and all of them carry separate reward predictions, which are combined (somehow) to get the attractiveness of the overall thought. Like, when you decide to step outside, you might expect to feel cold and sore muscles and wind and you’ll say goodbye to the people inside … all those different components could have different attractiveness. And an RPE changes the reward predictions of all of the ingredients of the thought, I think.
      So like, if you’re very hungry but have no food, you can say to yourself “I’m going to open my cupboard and find that food has magically appeared”, and it seems like that should be a positive-RPE thought. But actually, the thought doesn’t carry a positive reward. The “I will find food” part by itself does, but meanwhile you’re also activating the thought “I am fooling myself”, and the previous 10 times that thought was active, it carried a negative RPE, so that thought carries a very negative RP whenever it’s invoked. But you can’t get rid of that thought, because it previously made correct sensory predictions in this kind of situation—that’s the previous paragraph.
      Imagining eating beans to decide how rewarding they would be doesn’t seem to get any harder if I already know I don’t have any beans. And it doesn’t feel like “thoughts of eating beans” are reinforced, it feels like I gain abstract knowledge that eating beans would be rewarded.
      I would posit that it’s a subtle effect in this particular example, because you don’t actually care that much about beans. I would say “You get a subtle positive reward for entertaining the idea of eating beans, and then if you realize that you’re out of beans and put the idea aside, you get a subtle negative reward upon going back to baseline.” I think if you come up with less subtle examples it might be easier to think about, perhaps.
      My general feeling is that if you just abstractly think about something for no reason in particular, it activates the models weakly (and ditto if you hear that someone else is thinking about that thing, or remember that thing in the past, etc.) If you start to think of it as “something that will happen to me”, that activates the models more strongly. If you are directly experiencing the thing right now, it activates the model most strongly of all. I acknowledge that this is vague and unjustified, I wrote this but it’s all pretty half-baked.
      An additional complication is that, as above, one thought consists of a bunch of component sub-thoughts, which all impact the reward prediction. If you imagine eating beans knowing that you’re not actually going to, the “knowing that I’m not actually going to” part of the thought can have its own reward prediction, I suppose.
      Oh, yet another thing is that I think maybe we have no subjective awareness of “reward”, just RPE. (Reward does not feel rewarding!) So if we (1) decide “I will imagine yummy food”, then (2) imagine yummy food, then (3) stop imagining yummy food, we get a positive reward from the second step and a negative reward from the third step, but both of those rewards were already predicted by the first step, so there’s no RPE in either the second or third step, and therefore they don’t feel positive or negative. Unless we’re hungrier than we thought, I guess...
      it seems like there’s a floor on the effect size, where arbitrarily low probability eventually stops weakening the effect
      Yeah sure, if a model is active at all, it’s active above some threshold, I think. Like, if the neuron fires once every 10 minutes, then, well, the model is not actually turned on and affecting the brain. This is probably related to our inability to deal with small probabilities.
      Meanwhile, it’s quite possible to trigger physiological responses by imagining things.
      Yes, I would say the “neocortex is imagining / expecting to taste salt” signal has many downstream effects, one of which is affecting the reward signal, one of which is causing salivation.
      This doesn’t seem like it stops working if you keep doing it
      Really? I think that if some thought causes you to salivate, but doesn’t actually ever lead to eating for hours afterwards, and this happens over and over again for weeks, your systems would learn to stop salivating. I guess I don’t know for sure. Didn’t Pavlov do that experiment? See also my “scary movie” example here.
      the rat starts salivating and feels something in its stomach that it previously learned means “my body wants the food” and concludes eating salt would be a good idea
      Basically, there could be a non-reward signal that indicates “whatever you’re thinking of, eat it and you’ll feel rewarded”. And that could be learned from eating other food over the course of life. Yeah, sure, that could work. I think it would sorta amount to the same thing, because the neocortex would just turn that signal into a reward prediction, and register a positive RPE when it sees it. So why not just cut out the middleman and create a positive RPE by sending a reward? I guess you would argue that if it’s not at all rewarding to imagine food that you know you’re not going to eat, your theory fits that better.
      Still thinking about it.
      Thanks again, you’re being very helpful :-)
      - ADifferentAnonymous 25 Nov 2020 0:24 UTC
        3 points
        Parent
        Glad to hear this is helpful for you too :)
        I didn’t really follow the time-derivative idea before, and since you said it was equivalent I didn’t worry about it :p. But either it’s not really equivalent or I misunderstood the previous formulation, because I think everything works for me now.
        So if we (1) decide “I will imagine yummy food”, then (2) imagine yummy food, then (3) stop imagining yummy food, we get a positive reward from the second step and a negative reward from the third step, but both of those rewards were already predicted by the first step, so there’s no RPE in either the second or third step, and therefore they don’t feel positive or negative. Unless we’re hungrier than we thought, I guess...
        Well, what exactly happens if we’re hungrier than we thought?
        (1) “I will imagine food”: No reward yet, expecting moderate positive reward followed by moderate negative reward.
        (2) [Imagining food]: Large positive reward, but now expecting large negative reward when we stop imagining, so no RPE on previous step.
        (3) [Stops imagining food]: Large negative reward as expected, no RPE for previous step.
        The size of the reward can then be informative, but not actually rewarding (since it predictably nets to zero over time). The neocortex obtains hypothetical reward information form the subcortex, without actually extracting a reward—which is the thing I’ve been insisting had to be possible. Turns out we don’t need to use a separate channel! And the subcortex doesn’t have to know or care whether its receiving a genuine prediction or an exploratory imagining from the neocortex—the incentives are right either way.
        (We do still need some explanation of why the neocortex can imagine (predict?) food momentarily but can’t keep doing it food forever, avoid step (3), and pocket a positive RPE after step (2). Common sense suggests one: keeping such a thing up is effortful, so you’d be paying ongoing costs for a one-time gain, and unless you can keep it up forever the reward still nets to zero in the end)