[Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning

5.1 Post summary /​ Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

In the previous post, I discussed the “short-term predictor”—a circuit which, thanks to a learning algorithm, emits an output that predicts a ground-truth supervisory signal arriving a short time (e.g. a fraction of a second) later.

In this post, I propose that we can take a short-term predictor, wrap it up into a closed loop involving a bit more circuitry, and we wind up with a new module that I call a “long-term predictor”. Just like it sounds, this circuit can make longer-term predictions, e.g. “I’m likely to eat in the next 10 minutes”. This circuit is closely related to Temporal Difference (TD) learning, as we’ll see.

I will argue that there are a large collection of side-by-side long-term predictors in the brain, each comprising a short-term predictor in the telencephalon (involving specific areas such as ventral striatum, medial prefrontal cortex, and amygdala) that loops down to the Steering Subsystem (hypothalamus and brainstem) and then back via a dopamine neuron. These long-term predictors make predictions about biologically-relevant inputs and outputs—for example, one long-term predictor might predict whether I’ll feel pain in my arm, another whether I’ll get goosebumps, another whether I’ll release cortisol, another whether I’ll eat, and so on. Moreover, one of these long-term predictors is essentially a value function for reinforcement learning.

All these predictors will play a major role in motivation—a story which I will finish in the next post.

Table of contents:

  • Section 5.2 starts with a toy model of a “long-term predictor” circuit, consisting of the “short-term predictor” of the previous post, plus some extra components, wrapped into a closed loop. Getting a good intuitive understanding of this model will be important going forward, and I will walk through how that model would behave under different circumstances.

  • Section 5.3 relates that model to Temporal Difference (TD) learning, which is closely related to a “long-term predictor”. I’ll show two variants of the long-term predictor circuit, a “summation” version (which leads to a value function that approximates the sum of future rewards), and a “switch” version (which leads to a value function that approximates the next reward, whenever it should arrive, which may not be for a long time). The “summation” version is universal in AI literature, but I’ll suggest that the “switch” version is probably closer to what happens in the brain. Incidentally, these two models are equivalent in cases like AlphaGo, wherein reward arrives in a lump sum right at the end of each episode (= game of Go).

  • Section 5.4 will relate long-term predictors to the neuroanatomy of (part of) the telencephalon and brainstem.

    • For the “vertical” neuroanatomy,[1] I’ll describe how the brain houses a huge number of parallel “cortico-basal ganglia-thalamo-cortical loops”, and I’ll suggest that some of these loops function as short-term predictors, with a dopamine signal as supervisor.

    • For the “horizontal” neuroanatomy, I’ll propose that the supervised learning I’m talking about involves (for example) the medial prefrontal cortex, ventral striatum, anterior insular cortex, and amygdala.

  • Section 5.5 will offer six lines of evidence that lead me to believe this story: (1) It’s a sensible way to implement a biologically-useful capability; (2) It’s introspectively plausible; (3) It’s evolutionarily plausible; (4) It offers a reconciliation between the “visceromotor” and “motivational” ways to describe the medial prefrontal cortex; (5) It explains the Dead Sea Salt experiment; and (6) It offers a nice explanation of the diversity of dopamine neuron activity.

5.2 Toy model of a “long-term predictor” circuit

A “long-term predictor” is ultimately nothing more than a short-term predictor whose output signal helps determine its own supervisory signal. Here’s a toy model of what that can look like:

Toy model of a long-term prediction circuit. I’ll spend the next couple subsections walking through how this works. Edited to add: For this and all similar diagrams in this post, every block at every moment is running in parallel, and likewise every arrow at every moment is carrying a numerical value. So this is NOT a control-flow diagram for serial code; rather, it’s the kind of diagram you might see describing an FPGA, for example.
  • The blue box is the short-term predictor of the previous post. It optimizes its output signal such that it approximates what the supervisor signal will be in 0.3 seconds (as an example).

  • The purple box is a 2-way switch. The toggle on the switch is controlled by genetically-hardwired circuitry (gray oval), according to the following rules:

    1. By and large, the switch is in the bottom setting (“defer-to-predictor mode”). This setting is akin to the genetically-hardwired circuitry “trusting” that the short-term predictor’s output is sensible, and in particular producing the suggested amount of digestive enzymes.

    2. If the genetically-hardwired circuitry gets a signal that I’m eating something right now, and that I don’t have adequate digestive enzymes, it flips the switch to “override mode”. Regardless of what the short-term predictor says, it sends the signal to manufacture digestive enzymes.

    3. If the genetically-hardwired circuitry has been asking for digestive enzyme production for an extended period, and there’s still no food being eaten, then it again flips the switch to “override mode”. Regardless of what the short-term predictor says, it sends the signal to stop manufacturing digestive enzymes.

Note: You can assume that all the signals in the diagram can vary continuously across a range of values (as opposed to being discrete on/​off signals), with the exception of the signal that toggles the 2-way switch.[2] In the brain, smoothly-adjustable signals might be created by, for example, rate-coding—i.e., encoding information as the frequency with which a neuron is firing.

5.2.1 Toy model walkthrough part 1: static context

Let’s walk through what would happen in this toy model.[3] To start with, assume that the “context” is static for some extended period of time. For example, imagine a situation where some ancient worm-like creature is digging in the sandy ocean bed for many consecutive minutes. Plausibly, its sensory environment would stay pretty much constant as long as it keeps digging, as would its thoughts and plans (insofar as this ancient worm-like creature has “thoughts and plans” in the first place). Or if you want another example of (approximately) static context—this one involving a human rather than a worm—hang on until the next subsection.

In the static-context case, let’s first consider what happens when the switch is sitting in “defer-to-predictor mode”: Since the output is looping right back to the supervisor, there is no error in the supervised learning module. The predictions are correct. The synapses aren’t changing. Even if this situation is very common, it has no bearing on how the short-term predictor eventually winds up behaving.

The times that do matter for the eventual behavior of the short-term predictor are those rare times that we go into “override mode”. Think of the overrides as like a sporadic “injection of ground truth”. They produce an error signal in the short-term predictor’s learning algorithm, changing its adjustable parameters (e.g. synapse strengths).

After enough life experience (a.k.a. “training” in ML terminology), the short-term predictor should have the property that the overrides balance out. There may still be occasional overrides that increase digestive-enzyme production, and there may still be occasional overrides that decrease digestive-enzyme production, but those two types of overrides should happen with similar frequency. After all, if they didn’t balance out, the short-term predictor’s internal learning algorithm would gradually change its parameters so that they did balance out.

And that’s just what we want! We’ll wind up with appropriate digestive enzyme production at appropriate times, in a way that properly accounts for any information available in the context data—what the animal is doing right now, what it’s planning to do in the future, what its current sensory inputs are, etc.

5.2.1.1 David-Burns-style exposure therapy—a possible real-life example of the toy model with static context?

As it happens, I recently read David Burns’s book Feeling Great (my review). David Burns has a very interesting approach to exposure therapy—an approach that happens to serve as an excellent example of how my toy model works in the static-context situation!

Here’s the short version. (Warning: If you’re thinking of doing exposure therapy on yourself at home, at least read the whole book first!) Excerpt from the book:

For example, when I was in high school, I wanted to be on the stage crew of Brigadoon, a play my school was putting on, but it required overcoming my fear of heights since the stage crew had to climb ladders and work near the ceiling to adjust the lights and curtains. My drama teacher, Mr. Krishak, helped me overcome this fear with the very type of exposure techniques I’m talking about. He led me to the theater and put a tall ladder in the middle of the stage, where there was nothing nearby to grab or hold on to. He told me all I had to do was stand on the top of the ladder until my fear disappeared. He reassured me that he’d stand on the floor next to me and wait.

I began climbing the ladder, step by step, and became more and more frightened. When I got to the top, I was terrified. My eyes were almost 18 feet from the floor, since the ladder was 12 feet tall, and I was just over 6 feet tall. I told Mr. Krishak I was in a panic and asked what I should do. Was there something I should say, do, or think about to make my anxiety go away? He shook his head and told me to just stand there until I was cured.

I continued to stand there in terror for about ten more minutes. When I told Mr. Krishak I was still in a panic, he assured me that I was doing great and that I should just stand there a few more minutes until my anxiety went away. A few minutes later, my anxiety suddenly disappeared. I couldn’t believe it!

I told him, “Hey, Mr. Krishak, I’m cured now!”

He said, “Great, you can come on down from the ladder now, and you can be on the stage crew of Brigadoon!”

I had a blast working on the stage crew. I absolutely loved climbing ladders and adjusting the lights and curtains near the ceiling, and I couldn’t even remember why or how I’d been so afraid of heights.

This story seems to be beautifully consistent with my toy model here. David started the day in a state where his short-term-predictors output “extremely strong fear reactions” when he was up high. As long as David stayed up on the ladder, those fear-reaction short-term-predictors kept on getting the same context data, and therefore they kept on firing their outputs at full strength. And David just kept feeling terrified.

Then, after 15 boring-yet-terrifying minutes on the ladder, some innate circuit in David’s brainstem issued an override—as if to say, “C’mon, nothing is changing, nothing is happening, we can’t just keep burning all these calories all day. It’s time to calm down now.” The short-term-predictors continued sending the same outputs as before, but the brainstem exercised its veto power, and forcibly reset David’s cortisol, heart-rate, etc., back to baseline. This “override” state immediately created error signals in the relevant short-term-predictors in David’s amygdala! And the error signals, in turn, led to model updates! The short-term predictors were all edited, and from then on, David was no longer afraid of heights.

This story kinda feels like speculation piled on top of speculation, but whatever, I happen to think it’s right. If nothing else, it’s good pedagogy! Here’s the diagram for this situation; make sure you can follow all the steps.

5.2.2 Toy model walkthrough, assuming changing context

The previous subsections assumed static context lines (constant sensory environment, constant behaviors, constant thoughts and plans, etc.). What happens if the context is not static?

If the context lines are changing, then it’s no longer true that learning happens only at “overrides”. If context changes in the absence of “overrides”, it will result in changing of the output, and the new output will be treated as ground truth for what the old output should have been. Again, this seems to be just what we want: if we learned something new and relevant in the last second, then our current expectation should be more accurate than our previous expectation, and thus we have a sound basis for updating our models.

5.3 Value function calculation (TD learning) as a special case of long-term prediction

At this point, ML experts will recognize a resemblance to Temporal Difference (TD) learning. It’s not quite the same, though. The differences are:

First, TD learning is usually used in reinforcement learning (RL) as a method for going from a reward function to a value function. By contrast, I was talking about things like “digestive enzyme production”, which are neither rewards nor values.

In other words, there is a generally-useful motif that involves going from some immediate quantity X to “long term expectation of X”. The calculation of a value function from a reward function is an example of that motif, but it’s not the only useful example.

(As a matter of terminology, it seems to be generally accepted that the term “TD learning” can in fact apply to things that are not RL value functions.[4] However, empirically in my own experience, as soon as I mention “TD learning”, the people I’m talking to immediately assume I must be talking about RL value functions. So I want to be clear here.)

Second, to get something closer to traditional TD learning, we’d need to replace the 2-way switch with a 2-way summation—and then the “overrides” would be analogous to rewards. Much more on “switch vs summation” in the next subsection.

Here’s a TD learning circuit that would behave similarly to what you’d see in an AI textbook. Note the purple box on the right: compared to the previous figure, I replaced the 2-way switch with a 2-way summation. More on “switch vs summation” in the next subsection.

Third, there are many additional ways to tweak the circuit which are frequently used in AI textbooks, and some of those may be involved in the brain circuits too. For example, we can put in time-discounting, or different emphases on false-positives vs false-negatives (see my discussion of distributional learning in Section 5.5.6.1 below), etc.

To keep things simple, I will be ignoring all these possibilities (including time-discounting) in the discussion below.

5.3.1 Switch (i.e., value = expected next reward) vs summation (i.e., value = expected sum of future rewards)?

The figures above show two variants of our toy model. In one, the purple box is a two-way switch between “defer to the short-term predictor” and some independent “ground truth”. In the other, the purple box is a two-way summation instead.

The switch version trains the short-term-predictor to predict the next ground truth, whenever it should arrive.

The summation version trains the short-term-predictor to predict the sum of future ground truth signals.

The correct answer could also be “something in between switch and summation”. Or it could even be “none of the above”.

RL papers universally use the summation version—i.e., “value is the expected sum of future rewards”. What about biology? And which is actually better?

It doesn’t always matter! Consider AlphaGo. Like every RL paper today, AlphaGo was originally formulated in the summation paradigm. But it happens to have one and only one nonzero reward signal per game, namely +1 at the end of the game if it wins, or −1 if it loses. In that case, switch vs summation makes no difference. The only difference is one of terminology:

  • In the summation case, we would say “each non-terminal move in the Go game has reward=0”.

  • In the switch case, we would say “each non-terminal move in the Go game has a reward of (null)”.

(Do you see why?)

But in other cases, it does matter. So back to the question: should it be switch or summation?

Let’s step back. What are we trying to do here?

One thing that a brain needs to do is make decisions that weigh cross-domain tradeoffs. If you’re a human, you need to decide whether to watch TV or go to the gym. If you’re some ancient worm-like creature, you need to “decide” whether to dig or to swim. Either way, this “decision” impacts energy balance, salt balance, probability of injury, probability of mating—you name it. The design goal in the decision-making algorithm is that you make the decision that maximizes inclusive genetic fitness. How might that goal be best realized?

One method involves building a value function that estimates the organism’s inclusive genetic fitness (compared to some arbitrary—indeed, possibly time-varying—baseline), conditional on continuing to execute a given course of action. Of course it won’t be a perfect estimate—real inclusive genetic fitness can only be calculated in hindsight, many generations after the fact. But once we have such a value function, however imperfect, we can plug it into an algorithm that makes decisions to maximize value (more on this in the next post), and thus we get approximately-fitness-maximizing behavior.

So having a value function is key for making good decisions that weigh cross-domain tradeoffs. But nowhere in this story is the claim “value is the expectation of a sum of future rewards”! That’s a particular way of setting up the value-approximating algorithm, a method which might or might not be well suited to the situation at hand.

I happen to think that brains use something closer to the switch circuit, not the summation circuit, not only for homeostatic-type predictions (like the digestive enzymes example above), but also for value functions, contrary to mainstream RL papers. Again, I think it’s really “neither of the above” in all cases; just that it’s closer to switch.

Why do I favor “switch” over “summation”?

An example: sometimes I stub my toe and it hurts for 20 seconds; other times I stub my toe and it hurts for 40 seconds. But I don’t think of the latter as twice as bad as the former. In fact, even five minutes later, I wouldn’t remember which is which. (See the peak-end rule.) This is the kind of thing I would naturally expect from switch, but is an awkward fit for summation. It’s not strictly incompatible with summation; it just requires a more complicated, value-dependent reward function. As a matter of fact, if we allow the reward function to depend on value, then switch and summation can imitate each other.

Anyway, in upcoming posts, I’ll be assuming switch, not summation. I don’t think it matters very much for the big picture. I definitely don’t think it’s part of the “secret sauce” of animal intelligence, or anything like that. But it does affect some of the detailed descriptions.

The next post will include more details of reinforcement learning in the brain, including how “reward prediction error” works and so on. I am bracing for lots of confused readers, who will be disoriented by the fact that I’m assuming a different relationship between value and reward than what everyone is used to. For example, in my picture, “reward” is a synonym for “ground truth for what the value function should be right now”—both should account for not only the organism’s current circumstances but also its future prospects. Sorry in advance for any confusion! I will do my best to be clear.

5.4 An array of long-term predictors involving the telencephalon & brainstem

Here’s the long-term-predictor circuit from above:

Copied from above.

I can lump together the 2-way switch with the rest of the genetically-hardwired circuitry, and then rearrange the boxes a bit, and I get the following:

Same as above, but drawn differently.

Now, obviously digestive enzymes are just one example. Let’s draw in some more examples, add some hypothesized neuroanatomy, and include other terminology. Here’s the result:

I claim that there is a bank of long-term-predictors, consisting of an array of short-term-predictors in the telencephalon, each with a closed-loop connection to a corresponding Steering Subsystem circuit. I’m calling the former (telencephalon) part by the name “Thought Assessors”, for reasons explained in Section 5.5.4 below.

Excellent! We’re halfway to my big picture of decision-making and motivation. The rest of the picture—including the “actor” part of actor-critic reinforcement learning—will come in the next post, and will fill in the hole in the top-left side of that diagram. (The term “Steering Subsystem” comes from Post #3.)

Here’s one more diagram and caption for pedagogical purposes.

Reminder: a “short-term predictor” is one component of a “long-term predictor”. Here’s where both those things fit into that diagram above. The only thing that makes it a long-term predictor is the possibility of “defer-to-predictor mode”—i.e., the Steering Subsystem might send a “ground truth in hindsight” signal that is not really “ground truth” in the normal sense, but is rather a copy of the corresponding entry on the scorecard. In other words, “defer-to-predictor mode” is like the Steering Subsystem saying to the short-term predictor: “OK sure, whatever, I’ll take your word for it”. If the Steering Subsystem regularly keeps a signal in “defer-to-predictor mode” for 10 minutes straight, then we can get predictions that anticipate the future by up to 10 minutes. Conversely, if the Steering Subsystem never uses “defer-to-predictor mode” for a certain signal, then we shouldn’t really be calling it a “long-term predictor” in the first place.

In the next two subsections, I will elaborate on the neuroanatomy which I’m hinting at in this diagram, and then I’ll talk about why you should believe me.

5.4.1 “Vertical” neuroanatomy:[1] cortico-basal ganglia-thalamo-cortical loops

In my post Big Picture of Phasic Dopamine, I talked about the theory (due originally to Larry Swanson) that the whole telencephalon is nicely organized into three layers (cortex, striatum, pallidum):

Cortex-like part of the loops

Hippo-
campus

Amygdala [basolateral part]

Piriform cortex

Medial prefrontal cortex

Motor & “planning” cortex

Striatum-like part of the loops

Lateral septum

Amygdala [central part]

Olfactory tubercle

Ventral striatum

Dorsal striatum

Pallidum-like part of the loops

Medial septum

BNST

Substantia innominata

Ventral pallidum

Globus pallidus

The entire telencephalon—neocortex, hippocampus, amygdala, everything—can be divided into cortex-like structures, striatum-like structures, and pallidum-like structures. If two structures are in the same column in this table, that means they’re wired together into cortico-basal ganglia-thalamo-cortical loops (see next paragraph). This table is incomplete and oversimplified; for a better version see Fig. 4 here.

This idea then connects to the earlier (and now widely accepted) theory, dating to Alexander 1986, that these three layers of the telencephalon are interconnected into a large number of parallel “cortico-basal ganglia-thalamo-cortical loops”, which can be found in almost every part of the telencephalon.

Here’s a little illustration:

Simplified cartoon illustration of how the brain has many parallel cortico-basal ganglia-thalamo-cortical loops. Source: Matthieu Thiboust.

Given all that, here is a possible rough model for how this loop architecture relates to the short-term predictor learning algorithm that I’ve been talking about:

WARNING: DON’T TAKE THIS DIAGRAM TOO LITERALLY. See Big Picture of Phasic Dopamine for slightly more details, but mostly I haven’t looked into it much, and in particular the “Layer 1, Layer 2, Final (pooling) layer” labels are kinda just spitballing. (The “pooling” is based on there being 2000× more neurons in the striatum than the pallidum—see here.) Acronyms: BLA=basolateral amygdala, BNST=bed nucleus of the stria terminalis, CEA=central amygdala, mPFC=medial prefrontal cortex, VP=ventral pallidum, VS=ventral striatum.

5.4.2 “Horizontal” neuroanatomy—cortical specialization

The previous subsection was about the “vertical” three-layer structure of the telencephalon. Now let’s switch to the “horizontal” structure, i.e. the fact that different parts of the cortex do different things (in cooperation with the corresponding parts of the striatum and pallidum).

This is oversimplified, but here’s my latest attempt at (part of) the cortex in a nutshell:

  • The extended motor cortex (and corresponding striatum) is the cortex’s main output region for behaviors involving skeletal muscles, like reaching and walking.

  • The medial prefrontal cortex (mPFC—which also includes anterior cingulate cortex) (and corresponding (ventral) striatum) is the cortex’s main output region for behaviors involving autonomic /​ visceromotor /​ hormonal actions, like releasing cortisol, vasoconstriction, goosebumps, and so on.

  • The amygdala (which has both cortex-like and striatum-like parts) is the cortex’s main output region for certain behaviors that involve both skeletal muscle actions and autonomic actions, like flinching-reactions, or freezing-reactions (when frightened), and so on.

  • The insular cortex (and corresponding (ventral) striatum) is the cortex’s main input region for autonomic /​ homeostatic /​ body status information, like blood sugar levels, pain, cold, taste, muscle strain, etc.

I won’t talk about the motor cortex in this series, but I think the other three are all involved in these long-term prediction circuits. For example:

  • I claim that if you look at a little subregion in the medial prefrontal cortex, you might find that it’s being trained to fire in proportion to the probability of upcoming cortisol release;

  • I claim that if you look at a little subregion in the amygdala, you might find that it’s being trained to fire in proportion to the probability of upcoming freezing-reactions;

  • I claim that if you look at a little subregion of the (anterior) insular cortex, you might find that it’s being trained to fire in proportion to the probability of upcoming cold feelings in your left arm.

5.5 Six reasons I like this “array of long-term predictors” picture

5.5.1 It’s a sensible way to implement a biologically-useful capability

If you start producing digestive enzymes before eating, you’ll digest faster. If your heart starts racing before you see the lion, then your muscles will be primed and ready to go when you do see the lion. Etc.

So these kinds of predictors seem obviously useful.

Moreover, as discussed in the previous post (Section 4.5.2), the technique I’m proposing here (based on supervised learning) seems either superior to or complementary with other ways to meet these needs.

5.5.2 It’s introspectively plausible

For one thing, we do in fact start salivating before we eat the cracker, start feeling nervous before we see the lion, etc.

For another thing, consider the fact that all the actions I’m talking about in this post are involuntary: you cannot salivate on command, or dilate your pupils on command, etc., at least not in quite the same way that you can wiggle your thumb on command.

(More on voluntary actions in the next post—they’re in a whole different part of the telencephalon.)

I’m glossing over a bunch of complications here, but the involuntary nature of these things seems pleasingly consistent with the idea that they are being trained by their own dedicated supervisory signals, straight from the brainstem. They’re slaves to a different master, so to speak. We can kinda trick them into behaving in certain ways, but our control is limited and indirect.

5.5.3 It’s evolutionary plausible

As discussed in Section 4.4 of the previous post, the simplest short-term predictor is extraordinarily simple, and the simplest long-term predictor is only a bit more complicated than that. And these very simple versions are already plausibly fitness-enhancing, even in very simple animals.

Moreover, as I discussed a while back (Dopamine-supervised learning in mammals & fruit flies), there is an array of little learning modules in the fruit fly, playing a seemingly-similar role to what I’m talking about here. Those modules also use dopamine as a supervisory signal, and there is some genomic evidence of a homology between those circuits and the mammalian telencephalon.

5.5.4 It offers a reconciliation between “visceromotor” and “motivation” pictures of the medial prefrontal cortex (mPFC)

Take the mPFC (which also includes the anterior cingulate cortex—ACC), as an example. People talk about this region in two quite different ways:

  • On the one hand, as mentioned above (Section 5.4.2), mPFC is described as a visceromotor /​ homeostatic /​ autonomic motor output region—it issues commands to control hormones, to execute sympathetic and parasympathetic nervous system reactions, and so on. For example, “electrical stimulation of the infralimbic cortex has been shown to affect gastric motility and to cause hypotension”, or this paper says stimulating mPFC caused “[bristling]; pupillary dilation; and changes in blood pressure, respiratory rate, and heart rate”, or see Bud Craig’s book which characterizes ACC as a homeostatic motor output center. This way of thinking also elegantly explains the fact that the region is agranular (missing layer #4 out of the 6 neocortex layers), which implies “output region” both for theoretical reasons and by analogy with the (agranular) motor cortex.

  • On the other hand, mPFC is frequently described as being related to a host of vaguely-motivation-related activities. For example, Wikipedia mentions “attention allocation, reward anticipation, decision-making, ethics and morality, impulse control … and emotion” in regards to ACC.

I think my picture works for both:[5]

For the first (visceromotor) perspective, if you look at Section 5.2 above, you’ll see that the predictors’ outputs do in fact cause homeostatic changes—at least, they do when the genetically-hardwired circuitry of the Steering Subsystem has set that signal in “defer-to-predictor mode” (as opposed to “override mode”).

For the second (motivation) perspective, this will make a bit more sense after the next post, but note my suggestive description of a “scorecard” in the diagram of Section 5.4. The idea is: The “context” lines going into the “Thought Assessors” contain the horrific complexity of everything in your conscious mind and more—where you are, what you’re seeing and doing, what you’re thinking about, what you’re planning to do in the future and why, etc. The relatively simple, genetically-hardcoded Steering Subsystem can’t make heads or tails of any of that!

But that’s a dilemma, because the Steering Subsystem is the source of rewards /​ drives /​ motivations! How can the Steering Subsystem issue rewards for a good plan, if it can’t make heads or tails of what you’re planning??

The “scorecard” is the answer. It takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.

Thus, whenever there’s an interaction between thoughts and drives—emotions, decision-making, ethics, aversions, etc.—the “Thought Assessors” need to be involved as an intermediary.

5.5.5 It explains the Dead Sea Salt Experiment

See my discussion in my old post Inner alignment in salt-starved rats. In brief, experimenters sporadically played a sound and popped an object into a rat’s cage, and immediately thereafter sprayed super-salty water directly into the rat’s mouth. The rat found the saltwater disgusting, and started reacting with horror to the sound and object. Then later, the experimenters made the rat feel salt-deprived. When they played the sound and popped the object this time, the rat got very excited—even though the rat had never been salt-deprived before in its life.

In our setup, this is exactly what we expect: when the sound and object appear, the “I anticipate tasting salt” predictor starts firing like crazy. Meanwhile the Steering Subsystem (hypothalamus & brainstem) has hardwired circuitry that says “If I’m salt-deprived, and if the ‘scorecard’ from the Learning Subsystem suggests that I will soon taste salt, then that’s awesome, and whatever thought the Learning Subsystem is thinking, it should pursue that idea with gusto!”

5.5.6 It offers a nice explanation for (some of) the diversity of dopamine neuron activity

Recall from Section 5.4.1 above that I’m claiming that dopamine neurons carry the supervisory signals of all these supervised-learning modules.[6]

There’s a pop-science misconception that there is a (singular) dopamine signal in the brain, and it bursts when good things are happening. In reality, there are many different dopamine neurons doing many different things.

Thus we get the question: what are all these diverse dopamine signals doing? There’s no consensus; claims in the literature are all over the place. But I can throw my hat into the ring: in my picture described above, there are probably hundreds or thousands of short-term predictors in the telencephalon, predicting hundreds or thousands of different things, and they each need a different dopamine supervisory signal!

(And there are even more dopamine signals besides those! One such signal, associated with the brain’s “main” reward prediction error signal, will show up in the next post. Still others are off-topic for this series but discussed here.)

If my story is right, what would we expect to see in dopamine-measuring experiments?

Imagine a rat running through a maze. Moment by moment, its array of predictors are getting dopamine supervisory signals about its various hormone levels, its heart rate, its expectation of drinking and eating and having a sore leg and freezing and tasting salt, and on and on. In short, we expect dopamine neurons to be bouncing up and down in all kinds of different ways.

Thus, pretty much any instance where an experimenter has measured that a dopamine neuron is correlated with some behavioral variable, it’s probably consistent with my picture too.

Here are a couple examples:

  • There are dopamine neurons that burst for salient stimuli like unexpected flashes of light (ref). Can I explain that? Sure, no problem! I say: they could be supervisory signals saying “this would have been a good time to orient”, or “to flinch”, or “to raise your heart rate”, etc.

  • There are dopamine neurons that correlate with the velocity with which a mouse is running on a treadmill-ball (ref). Can I explain that? Sure, no problem! I say: they could be supervisory signals saying “expect sore muscles”, or “expect cortisol”, or “expect high heart rate”, etc.

Here’s another data point which seems reassuringly consistent with my picture. A few dopamine neurons burst when aversive things happen (ref). Four of the five regions[7] in which such neurons can be found (according to the linked paper) are right where I expect that array of short-term predictors to be—namely, the cortex-like and striatum-like layers of amygdala, and medial prefrontal cortex (mPFC), and the ventromedial shell of the nucleus accumbens, which is (at least roughly?) the striatum stop of the mPFC cortico-basal ganglia-thalamo-cortical loops. This is exactly what I expect in my picture. For example, if a mouse gets shocked, then a “should-I-freeze-now” predictor would get a supervisory signal saying “Yes, you should have been freezing”.

Side note: Lammel et al. 2014 mentions so-called “‘non-conventional’ VTA [dopamine] neurons” in “medial posterior VTA (PN and medial PBP)”. These seem to project to exactly the non-value-function Thought Assessor areas, and it’s claimed that they have different firing patterns from other dopamine neurons. Maybe the firing pattern difference is reflective of the different requirements of supervised learning versus reinforcement learning? (I’m not an expert; I’m just flagging that it sounds intriguing and would be worth looking into more.)

UPDATE JAN 2023: Upon further investigation (thanks Nathaniel Daw), I think what I’m talking about here is basically the right explanation for the diverse dopamine signals on the fringes of VTA /​ SNc, or something like that, but the fine-grained dopamine diversity more typically measured has a different explanation which is at least spiritually closer to the “distributional” story next.

5.5.6.1 Aside: Distributional predictor outputs

I didn’t talk about it in the last post, but short-term predictors have hyperparameters in their learning algorithms, two of which are “how strongly to update upon a false-positive (overshoot) error”, and “how strongly to update upon a false-negative (undershoot) error”. As the ratio of these two hyperparameters varies from 0 to ∞, the resulting predictor behavior varies from “fire the output if there’s even the faintest chance that the supervisor will fire” to “never fire the output unless it’s all but certain that the supervisor will fire”.

Therefore, if we have many predictors, each with a different ratio of those hyperparameters, then we can (at least approximately) output a probability distribution for the prediction, rather than a point estimate.

A recent set of experiments from DeepMind and collaborators found evidence (based on measurements of dopamine neurons) that the brain does in fact use this trick, at least for reward prediction.

I speculate that it may use the same trick for the other long-term predictors too—e.g. maybe the predictions of arm pain and cortisol and goosebumps etc. are all in the form of ensembles of long-term predictors that each sample a probability distribution.

I bring this up, first, because it’s another example where dopamine neurons are behaving in a way that seems pleasingly consistent with my worldview, and second, because it’s plausibly useful for AGI safety—and thus I was looking for an excuse to bring it up anyway!

5.6 Conclusion

Anyway, as usual I don’t pretend to have smoking-gun proof of my hypothesis (i.e. that the brain has an array of long-term predictors involving telencephalon-brainstem loops), and there are some bits that I know I’m still confused about. But considering the evidence in the previous subsection (and rest of the post), I wind up feeling strongly that I’m broadly on the right track. I’m happy to discuss more in the comments. Otherwise, onward to the next post, where we will finally put everything together into a big picture of how I think motivation and decision-making work in the brain!

  1. ^

    ‘Horizontal’ neuroanatomy versus ‘vertical’ neuroanatomy is my idiosyncratic terminology, but I’m hoping it’s intuitive. If you imagine stretching out the cortex into a sheet, oriented horizontally, then the ‘vertical’ neuroanatomy would include e.g. the interconnections between cortical and subcortical structures, and the ‘horizontal’ neuroanatomy would include e.g. the different roles played by different parts of the cortex. See also the table in Section 5.4.1.

  2. ^

    To be clear, in reality, there probably isn’t a discrete all-or-nothing 2-way switch here. There could be a “weighted average” setting, for example. Remember, this whole discussion is just a pedagogical “toy model”; I expect that reality is more complicated in various respects.

  3. ^

    I note that I’m just running through this algorithm in my head; I haven’t simulated it. I’m optimistic that I didn’t majorly screw up, i.e. that everything I’m saying about the algorithm is qualitatively true, or at least can be qualitatively true with appropriate parameter settings and perhaps other minor tweaks.

  4. ^

    Examples of using the terminology “TD learning” for something which is not related to RL reward functions include “TD networks”, and the Successor Representations literature (example), or this paper, etc.

  5. ^

    The classic attempt to reconcile “visceromotor” and “motivation” pictures of mPFC is Antonio Damasio’s “somatic marker hypothesis”. My discussion here has some similarities and some differences from the somatic marker hypothesis. I won’t get into that; it’s off-topic.

  6. ^

    As in the previous post, when I say that “dopamine carries the supervisory signal”, I’m open to the possibility that dopamine is actually a closely-related signal like the error signal, or the negative error signal, or the negative supervisory signal. It really doesn’t matter for present purposes.

  7. ^

    The fifth area where that paper found dopamine neurons bursting under aversive circumstances, namely the tail of the striatum, has a different explanation I think—see here.