Inner Alignment in Salt-Starved Rats
(This post is deprecated. It has some kernels of truth but also lots of mistakes and confusions. You should instead read Incentive Learning vs Dead Sea Salt Experiment (2024), which covers many of the same topics. —Steve, 2024)
(See comment here for some corrections and retractions. —Steve, 2022)
Introduction: The Dead Sea Salt Experiment
In this 2014 paper by Mike Robinson and Kent Berridge at University of Michigan (see also this more theoretical follow-up discussion by Berridge and Peter Dayan), rats were raised in an environment where they were well-nourished, and in particular, where they were never salt-deprived—not once in their life. The rats were sometimes put into a test cage with a lever which, when it appeared, was immediately followed by a device spraying ridiculously salty water directly into their mouth. The rats were disgusted and repulsed by the extreme salt taste, and quickly learned to hate the lever—which from their perspective would seem to be somehow causing the saltwater spray. One of the rats went so far as to stay tight against the opposite wall—as far from the lever as possible!
Then the experimenters made the rats feel severely salt-deprived, by depriving them of salt. Haha, just kidding! They made the rats feel severely salt-deprived by injecting the rats with a pair of chemicals that are known to induce the sensation of severe salt-deprivation. Ah, the wonders of modern science!
...And wouldn’t you know it, almost instantly upon injection, the rats changed their behavior! When shown the lever (this time without the salt-water spray), they now went right over to that lever and jumped on it and gnawed at it, obviously desperate for that super-salty water.
The end.
Aren’t you impressed? Aren’t you floored? You should be!!! I don’t think any standard ML algorithm would be able to do what these rats just did!
Think about it:
Is this Reinforcement Learning? No. RL would look like the rats randomly stumbling upon the behavior of “nibbling the lever when salt-deprived”, find it rewarding, and then adopt that as a goal via “credit assignment”. That’s not what happened. While the rats were nibbling at the lever, they had never in their life had an experience where the lever had brought forth anything other than an utterly repulsive experience. And they had never in their life had an experience where they were salt-deprived, tasted something extremely salty, and found it gratifying. I mean, they were clearly trying to interact with the lever—this is a foresighted plan we’re talking about—but that plan does not seem to have been reinforced by any experience in their life.
Update for clarification: Specifically, it’s not any version of RL where you learn about the reward function only by observing past rewards. This category includes all model-free RL and some model-based RL (e.g. MuZero). If, by contrast, you have a version of model-based RL where the agent can submit arbitrary hypothetical queries to the true reward function, then OK, sure, now you can get the rats’ behavior. I don’t think that’s what’s going on here for reasons I’ll mention at the bottom.
Is this Imitation Learning? Obviously not; the rats had never seen any other rat around any lever for any reason.
Is this an innate, hardwired, stimulus-response behavior? No, the connection between a lever and saltwater was an arbitrary, learned connection. (I didn’t mention it, but the researchers also played a distinctive sound each time the lever appeared. Not sure how important that is. But anyway, that connection is arbitrary and learned, too.)
So what’s the algorithm here? How did their brains know that this was a good plan? That’s the subject of this post.
What does this have to do with inner alignment? What is inner alignment anyway? Why should we care about any of this?
With apologies to the regulars on this forum who already know all this, the so-called “inner alignment problem” occurs when you, a programmer, build an intelligent, foresighted, goal-seeking agent. You want it to be trying to achieve a certain goal, like maybe “do whatever I, the programmer, want you to do” or something. The inner alignment problem is: how do you ensure that the agent you programmed is actually trying to pursue that goal? (Meanwhile, the “outer alignment problem” is about choosing a good goal in the first place.) The inner alignment problem is obviously an important safety issue, and will become increasingly important as our AI systems get more powerful in the future.
(See my earlier post mesa-optimizers vs “steered optimizers” for specifics about how I frame the inner alignment problem in the context of brain-like algorithms.)
Now, for the rats, there’s an evolutionarily-adaptive goal of “when in a salt-deprived state, try to eat salt”. The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed! And remarkably, that goal was installed even before that situation was ever encountered! So it’s worth studying this example—perhaps we can learn from it!
Before we get going on that, one more boring but necessary thing:
Aside: Obligatory post-replication-crisis discussion
The dead sea salt experiment strikes me as trustworthy. Pretty much all the rats—and for key aspects literally every tested rat—displayed an obvious qualitative behavioral change almost instantaneously upon injection. There were sensible tests with control levers and with control rats. The authors seem to have tested exactly one hypothesis, and it’s a hypothesis that was a priori plausible and interesting. And so on. I can’t assess every aspect of the experiment, but from what I see, I believe this experiment, and I’m taking its results at face value. Please do comment if you see anything questionable.
Outline of the rest of the post
Next I’ll go through my hypothesis for how the rat brain works its magic here. Actually, I’ve come up with three variants of this hypothesis over the past year or so, and I’ll talk through all of them, in chronological order. Then I’ll speculate briefly on other possible explanations.
My hypothesis for how the rat brain did what it did
The overall story
As I discussed in My Computational Framework for the Brain, my starting-point assumption is that the rat brain has a “neocortex subsystem” (really the neocortex, hippocampus, parts of thalamus and basal ganglia, maybe other things too). The neocortex subsystem takes sensory inputs and reward inputs, builds a predictive model from scratch, and then chooses thoughts and actions that maximize reward. The reward, in turn, is issued by a different subsystem of the brain that I’ll call “subcortex”.
To grossly oversimplify the “neocortex builds a predictive model” part of that, let’s just say for present purposes that the neocortex subsystem memorizes patterns in the inputs, and then patterns in the patterns, and so on.
To grossly oversimplify the “neocortex chooses thoughts and actions that maximize reward” part, let’s just say for present purposes that different parts of the predictive model are associated with different reward predictions, the reward predictions are updated by a TD learning system that has something to do with dopamine and the basal ganglia, and parts of the model that predict higher reward are favored while parts of the model that predict lower reward are pushed out of mind.
Since the “predictive model” part is invoked for the “reward-maximization” part, we can say that the neocortex does model-based RL.
(Aside: It’s sometimes claimed in the literature that brains do both model-based and model-free RL. I disagree that this is a fundamental distinction; I think “model-free” = “model-based with a dead-simple model”. See my old comment here.)
Why is this important? Because that brings us to imagination! The neocortex can activate parts of the predictive model not just to anticipate what is about to happen, but also to imagine what may happen, and (relatedly) to remember what has happened.
Now we get a crucial ingredient: I hypothesize that the subcortex somehow knows when the neocortex is imagining the taste of salt. How? This is the part where I have three versions of the story, which I’ll go through shortly. For now, let’s just assume that there is a wire going into the subcortex, and when it’s firing, that means the neocortex is activating the parts of the predictive model that correspond (semantically) to tasting salt.
And once we have that, the last ingredient is simple: The subcortex has an innate, hardwired circuit that says “If the neocortex is imagining tasting salt, and I am currently salt-deprived, then send a reward to the neocortex.”
OK! So now the experiment begins. The rat is salt-deprived, and it sees the lever appear. That naturally evokes its previous memory of tasting salt, and that thought is rewarded! When the rat imagines walking over and nibbling the lever, it finds that to be a very pleasing (high-reward-prediction) thought indeed! So it goes and does it!
(UPDATE: Commenters point out that this description isn’t quite right—it doesn’t make sense to say that the idea of tasting salt is rewarding per se. Rather, I propose that the subcortex sends a reward related to the time-derivative of how strongly the neocortex is imagining / expecting to taste salt. So the neocortex gets a reward for first entertaining the idea of tasting salt, and another incremental reward for growing that idea into a definite plan. But then it would get a negative reward for dropping that idea. Sorry for the mistake / confusion. Thanks commenters!)
Now let’s fill in that missing ingredient: How does the subcortex get its hands on a signal flagging that the neocortex is imagining the taste of salt? I have three hypotheses.
Hypothesis 1 for the “imagining taste of salt” signal: The neocortex API enables outputting a prediction for any given input channel
This was my first theory, I guess from last year. As argued by the “predictive coding” people, Jeff Hawkins, Yann LeCun, and many others, the neocortex is constantly predicting what input signals it will receive next, and updating its models when the predictions are wrong. This suggests that it should be possible to stick an arbitrary input line into the neocortex, and then pull out a signal carrying the neocortex’s predictions for that input line. (It would look like a slightly-earlier copy of the input line, with sporadic errors for when the neocortex is surprised.) I can imagine, for example, that if you put an input signal into cortical mini-column #592843 layer 4, then you look at a certain neuron in the same mini-column, you find those predictions.
If this is the case, then the rest is pretty straightforward. The genome wires the salt taste bud signal to wherever in the neocortex, pulls out the corresponding prediction, and we’re done! For the reason described above, that line will also fire when merely imagining salt taste.
Commentary on hypothesis 1: I have mixed feelings.
On the one hand, I haven’t really come across any independent evidence that this mechanism exists. And, having learned more about the nitty-gritty of neocortex algorithms (the outputs come from layer 5, blah blah blah), I don’t think the neocortex outputs carry this type of data.
On the other hand, I have a strong prior belief that if there are ten ways for the brain to do a certain calculation, and each is biologically and computationally plausible without dramatic architectural change, the brain will do all ten! (Probably in ten different areas of the brain.) After all, evolution doesn’t care much about keeping things elegant and simple. I mean, there is a predictive signal for each input—it has to be there somewhere! And I don’t currently see any reason that this signal couldn’t be extracted from the neocortex. So I feel sorta obligated to believe that this mechanism probably exists.
So anyway, all things considered, I don’t put much weight on this hypothesis, but I also won’t strongly reject it.
With that, let’s move on to the later ideas that I like better.
Hypothesis 2 for the “neocortex is imagining the taste of salt” signal: The neocortex is rewarded for “communicating its thoughts”
This was my second guess, I guess dating to several months ago.
The neocortex subsystem has a bunch of output lines for motor control and whatever else, and it has a special output line S (S for salt).
Meanwhile, the subcortex sends rewards under various circumstances, and one of those things is that the neocortex is rewarded for sending a signal into S whenever salt is tasted. (The subcortex knows when salt is tasted, because it gets a copy of that same input.)
So now, as the rat lives its life, it stumbles upon the behavior of outputting a signal into S when eating a bite of saltier-than-usual food. This is reinforced, and gradually becomes routine.
The rest is as before: when the rat imagines a salty taste, it reuses the same model. We did it!
Commentary on hypothesis 2: A minor problem (from the point-of-view of evolution) is that it would take a while for the neocortex to learn to send a signal into S when eating salt. Maybe that’s OK.
A much bigger potential problem is that the neocortex could learn a pattern where it sends a signal into S when tasting salt, and also learns a different pattern where it sends a signal into S whenever salt-deprived, whether thinking about salt or not. This pattern would, after all, be rewarded, and I can’t immediately see how to stop it from developing.
So I’m pretty skeptical about this hypothesis now.
Hypothesis 3 for the “neocortex is imagining the taste of salt” signal (my favorite!): Sorta an “interpretability” approach, probably involving the amygdala
This one comes out of my last post, Supervised Learning of Outputs in the Brain. Now we have a separate brain module that I labeled “supervised learning algorithm”, and which I suspect is primarily located in the amygdala. This module does supervised learning: the salt signal (from the taste buds) functions as the supervisory signal, and a random assortment of neurons in the neocortex subsystem (describing latent variables in the neocortex’s predictive model) function as the inputs to the learned model. Then the supervised learning module learns which patterns in those latent variables tend to reliably predict that salt is about to be tasted. Having done that, when it sees those patterns recur, that’s our signal that the neocortex is probably expecting the taste of salt … and as described above, it will also see those same patterns when the neocortex is merely imagining or remembering the taste of salt. So we have our signal!
Commentary on Hypothesis 3: There’s a lot I really like about this. It seems to at-least-vaguely match various things I’ve seen in the literature about the functionality and connectivity of the amygdala. It makes a lot of sense from a design perspective—the patterns would be learned quickly and reliably, etc., as far as I can tell. I find it satisfyingly obvious and natural (in retrospect). So I would put this forward as my favorite hypothesis by far.
It also transfers in an obvious way to AGI programming, where it would correspond to something like an automated “interpretability” module that tries to make sense of the AGI’s latent variables by correlating them with some other labeled properties of the AGI’s inputs, and then rewarding the AGI for “thinking about the right things” (according to the interpretability module’s output), which in turn helps turn those thoughts into the AGI’s goals, using the time-derivative reward-shaping trick as described above.
(Is this a good design idea that AGI programmers should adopt? I don’t know, but I find it interesting, and at least worthy of further thought. I don’t recall coming across this idea before in the context of the inner alignment problem.)
(Update 6 months later: I’m now more confident that this hypothesis is basically right, except maybe I should have said “medial prefrontal cortex and ventral striatum” where I said “amygdala”. Or maybe it’s all of the above. Anyway, see my later post Big Picture Of Phasic Dopamine.)
What would other possible explanations for the rat experiment look like?
The theoretical follow-up by Dayan & Berridge is worth reading, but I don’t think they propose any real answers, just lots of literature and interesting ideas at a somewhat-more-vague level.
(Update to add this paragraph) Next: At the top I mentioned “a version of model-based RL where the agent can submit arbitrary hypothetical queries to the true reward function” (this category includes AlphaZero). If the neocortex had a black-box ground-truth reward calculator (not a learned-from-observations model of the reward) and a way to query it, that would seem to resolve the mystery of how the rats knew to get the salt. But I can’t see how this would work. First, the ground-truth reward is super complicated. There are thousands of pain receptors, there are hormones sloshing around, there are multiple subcortical brain regions doing huge complicated calculations involving millions of neurons that provide input to the reward calculation (I believe), and so on. You can learn to model this reward-calculating system by observing it, of course, but actually running this system (or a copy of it) on hypotheticals seems unrealistic to me. Second, how exactly would you query the ground-truth reward calculator? Third, there seems to be good evidence that the neocortex subsystem chooses thoughts and actions based on reward predictions that are updated by TD learning, and I can’t immediately see how you can simultaneously have that system and a mechanism that chooses thoughts and actions by querying a ground-truth reward calculator. I think my preferred mechanism, “reward depends in part on what you’re thinking” (which we know is true anyway), is more plausible and flexible than “your imagination has special access to the reward function”.
Next: What would Steven Pinker say? He is my representative advocate of a certain branch of cognitive neuroscience—a branch to which I do not subscribe. Of course I don’t know what he would say, but maybe it’s a worthwhile exercise for me to at least try. Well, first, I think he would reject the idea that there’s a “neocortex subsystem”. And I think he would more generally reject the idea that there is any interesting question along the lines of “how does the reward system know that the rat is thinking about salt?”. Of course I want to pose that question, because I come from a perspective of “things like this need to learned from scratch” (again see My Computational Framework for the Brain). But Pinker would not be coming from that perspective. I think he wants to assume that a comparatively elaborate world-modeling infrastructure is already in place, having been hardcoded by the genome. So maybe he would say there’s a built-in “diet module” which can model and understand food, taste, satiety, etc., and he would say there’s a built-in “navigation module” which can plan a route to walk over to the lever, and he would there’s a built-in “3D modeling module” which can make sense of the room and lever, etc. etc.
OK, now that possibly-strawman-Steven-Pinker has had his say in the previous paragraph, I can respond. I don’t think this is so far off as a description of the calculations done by an adult brain. In ML we talk about “how the learning algorithm works” (SGD, BatchNorm, etc.), and separately (and much less frequently!) we talk about “how the trained model works” (OpenAI Microscope, etc.). I want to put all that infrastructure in the previous paragraph at the “trained model” level, not the “learning algorithm” level. Why? First, because I think there’s pretty good evidence for cortical uniformity. Second—and I know this sounds stupid—because I personally am unable to imagine how this setup would work in detail. How exactly do you insert learned content into the innate framework? How exactly do you interface the different modules with each other? And so on. Obviously, yes I know, it’s possible that answers exist, even if I can’t figure them out. But that’s where I’m at right now.
- Full-time AGI Safety! by 1 Mar 2021 12:42 UTC; 118 points) (
- Voting Results for the 2020 Review by 2 Feb 2022 18:37 UTC; 108 points) (
- Inner alignment in the brain by 22 Apr 2020 13:14 UTC; 79 points) (
- 2020 Review Article by 14 Jan 2022 4:58 UTC; 74 points) (
- [Book Review] “The Alignment Problem” by Brian Christian by 20 Sep 2021 6:36 UTC; 70 points) (
- Big picture of phasic dopamine by 8 Jun 2021 13:07 UTC; 66 points) (
- Theoretical Neuroscience For Alignment Theory by 7 Dec 2021 21:50 UTC; 65 points) (
- AXRP Episode 22 - Shard Theory with Quintin Pope by 15 Jun 2023 19:00 UTC; 52 points) (
- Mesa-Optimizers vs “Steered Optimizers” by 10 Jul 2020 16:49 UTC; 45 points) (
- Technical Predictions Related to AI Safety by 13 Aug 2021 0:29 UTC; 29 points) (
- Supervised learning of outputs in the brain by 26 Oct 2020 14:32 UTC; 28 points) (
- Incentive Learning vs Dead Sea Salt Experiment by 25 Jun 2024 17:49 UTC; 27 points) (
- Consciousness research as a cause? [asking for advice] by 11 Mar 2021 7:44 UTC; 26 points) (EA Forum;
- “Wanting” and “liking” by 30 Aug 2023 14:52 UTC; 23 points) (
- Model-based RL, Desires, Brains, Wireheading by 14 Jul 2021 15:11 UTC; 22 points) (
- Supervised learning in the brain, part 4: compression / filtering by 5 Dec 2020 17:06 UTC; 19 points) (
- Supplement to “Big picture of phasic dopamine” by 8 Jun 2021 13:08 UTC; 14 points) (
- 31 May 2021 11:20 UTC; 11 points) 's comment on What is the most effective way to donate to AGI XRisk mitigation? by (
- 21 Jan 2022 10:28 UTC; 3 points) 's comment on Emotions = Reward Functions by (
- 5 Mar 2021 19:09 UTC; 3 points) 's comment on Book review: “A Thousand Brains” by Jeff Hawkins by (
- 24 Nov 2020 23:11 UTC; 2 points) 's comment on Rafael Harth’s Shortform by (
I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.
But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?
First, my “neocortex vs subcortex” division eventually developed into “learning subsystem vs steering subsystem”, with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the “learning subsystem” does “learning-from-scratch” in the sense here.
Second, whenever I said “amygdala”, I probably should have said “anterior insula”, or better yet “some cortico-basal ganglia-thalamocortical loop involving anterior insula and ventral striatum”. Back when I wrote this, I thought that supervised learning was the unique realm of the cerebellum and amygdala, but now I think that it’s one aspect of the functioning of (parts of) the neocortex too. See here.
Third, I kinda mangled the description of what happens when the rat’s brainstem is craving salt and then learns that saltwater is expected. Keep in mind that nibbling the lever is pointless. The lever doesn’t do anything. It never did! (This experiment is in the “Pavlovian” paradigm not the “instrumental” paradigm.) So why does the rat run to it and nibble at it?
It seems to me that these Pavlovian experiments are just really weird. Under normal circumstances, saltwater winds up in a rat’s mouth because the rat was drinking it. Here, the rat is just doing whatever, and magically (from the rat’s perspective), saltwater appears in the rat’s mouth, thanks of course to the scientists’ crazy saltwater-squirting backpack contraption.
I think that when the brainstem thinks “oh wow I’m expecting very good things to happen imminently”, that turns into something like “hey cortex, whatever thought you happen to be thinking right now, do it right now, do it with great vigor!!!” Because, in the normal ecological situation, the thought that the rat is thinking is the cause of “expecting very good things to happen”.
But in these Pavlovian experiments, the good thing happens out of nowhere, so the behavior comes down to “whatever the rat happens to be thinking about at the key moment”. And this is actually underdetermined! Different rats’ minds tend to go to different places, and hence they wind up doing different things in these Pavlovian experiments. Thus, in the lingo, some rats are “sign-tracking rats”, other rats are “goal-tracking rats”.
Anyway, in this case, we wind up with the rats looking at and attending to the lever (because it appeared at the right time), and the brainstem says “Yes whatever you’re thinking about, it’s awesome, do it with vigor”. Attending to the lever isn’t exactly an action proposal per se, i.e. it’s not something the rat can “do”, but it happens to overlap with the beginning stage of the salient action plan “go to the lever and nibble the lever”. So that’s what happens. And I just think maybe we shouldn’t think too hard about the details here.
By contrast, in the article, I told a story involving a time-derivative. I think that story is right in other contexts—see here. Just probably not here.
Fourth, my discussion of Hypotheses 2 & 3 weren’t quite hitting the nail on the head. There are a few issues in play:
(A) Supervised learning vs RL.
Supervised learning is something like learning an N-dimensional output with an N-dimensional ground truth, so you get an error gradient “for free” each query. Reinforcement learning is something like learning an N-dimensional output with a 1-dimensional “reward” ground truth, and tends to require trial-and-error. This is an important distinction in many contexts, but in retrospect it’s not so important for this post.
(B) One “system” vs two “systems”.
Let’s say I want to salivate profusely right now. I can’t just consciously decide to do that. It doesn’t work. I can try to vividly imagine eating a salty cracker. That works a little bit. Or I can go to the pantry and get an actual cracker. That works better.
What we’re seeing here is two systems, one that we associate with free will etc., and another that “decides” whether to salivate. The second system is not under control of the first system. Both systems learn, but with different training signals. See Reward is Not Enough.
(C) Adversarial dynamics
…And thus, we can think of this as a kind of adversarial ML type thing. Every time I (the first system) trick the second system into salivating, without later eating salt, then there’s a training signal that helps the second system learn not to be fooled. That’s not to say they’re evenly matched; it’s also possible that, in equilibrium, the first system winds up consistently calling the shots.
Thanks to adversarial dynamics, by the way, my story about why hypothesis 2 is wrong, isn’t as compelling a reason as I had thought.
Also, the difference between Hypotheses 2 and 3 is less profound than it seems, because “two systems” maximizing A and B respectively is fundamentally not so different from “one system” maximizing A+B, for example. The implementation is still different, and the learning speed is different, and the corresponding bundle of intuitions is kinda different. So I still think Hypothesis 3 is the right way to think about it.
Fifth, having learned more about the neocortex, I’m more confidently opposed to Hypothesis 1.
Sixth, I didn’t know anything about this at the time, but there’s an interesting connection to the “incentive learning” literature, which involves various other rat experiments that seem to contradict the Dead Sea Salt experiment—rats need to learn from experience, in situations that (on a naive reading of this post) one might expect the rats to be able to do the task optimally the first time, without learning. This is a fun topic and I have a draft about it that I’ll post at some point.
This is one of multiple posts by Steven that explain the cognitive architecture of the brain. All posts together helped me understand the mechanism of motivation and learning and answered open questions. Unfortunately, the best post of the sequence is not in 2020. I recommend including either this compact and self-contained post or the longer (and currently higher voted) My computational framework for the brain.