Eliezer Yudkowsky comments on Inner Alignment in Salt-Starved Rats

Eliezer Yudkowsky 13 Dec 2020 21:40 UTC
LW: 43 AF: 14
0
AF
Now, for the rats, there’s an evolutionarily-adaptive goal of “when in a salt-deprived state, try to eat salt”. The genome is “trying” to install that goal in the rat’s brain. And apparently, it worked! That goal was installed!
This is importantly technically false in a way that should not be forgotten on pain of planetary extinction:
The outer loss function training the rat genome was strictly inclusive genetic fitness. The rats ended up with zero internal concept of inclusive genetic fitness, and indeed, no coherent utility function; and instead ended up with complicated internal machinery running off of millions of humanly illegible neural activations; whose properties included attaching positive motivational valence to imagined states that the rat had never experienced before, but which shared a regularity with states experienced by past rats during the “training” phase.
A human, who works quite similarly to the rat due to common ancestry, may find it natural to think of this as a very simple ‘goal’; because things similar to us appear to have falsely low algorithmic complexity when we model them by empathy; because the empathy can model them using short codes. A human may imagine that natural selection successfully created rats with a simple salt-balance term in their simple generalization of a utility function, simply by natural-selection-training them on environmental scenarios with salt deficits and simple loss-function penalties for not balancing the salt deficits, which were then straightforwardly encoded into equally simple concepts in the rat.
This isn’t what actually happened. Natural selection applied a very simple loss function of ‘inclusive genetic fitness’. It ended up as much more complicated internal machinery in the rat that made zero mention of the far more compact concept behind the original loss function. You share the complicated machinery so it looks simpler to you than it should be, and you find the results sympathetic so they seem like natural outcomes to you. But from the standpoint of natural-selection-the-programmer the results were bizarre, and involved huge inner divergences and huge inner novel complexity relative to the outer optimization pressures.
What links here?
- On Robin Hanson’s “Social Proof, but of What?” by Zvi (20 Dec 2020 22:20 UTC; 33 points)
- Steven Byrnes 13 Dec 2020 23:04 UTC
  LW: 35 AF: 17
  0
  AF Parent
  Thanks for your comment! I think that you’re implicitly relying on a different flavor of “inner alignment” than the one I have in mind.
  (And confusingly, the brain can be described using either version of “inner alignment”! With different resulting mental pictures in the two cases!!)
  See my post Mesa-Optimizers vs “Steered Optimizers” for details on those two flavors of inner alignment.
  I’ll summarize here for convenience.
  I think you’re imagining that the AGI programmer will set up SGD (or equivalent) and the thing SGD does is analogous to evolution acting on the entire brain. In that case I would agree with your perspective.
  I’m imagining something different:
  The first background step is: I argue (for example here) that one part of the brain (I’m calling it the “neocortex subsystem”) is effectively implementing a relatively simple, quasi-general-purpose learning-and-acting algorithm. This algorithm is capable of foresighted goal seeking, predictive-world-model-building, inventing new concepts, etc. etc. I don’t think this algorithm looks much like a deep neural net trained by SGD; I think it looks like an learning algorithm that no human has invented yet, one which is more closely related to learning probabilistic graphical models than to deep neural nets. So that’s one part of the brain, comprising maybe 80% of the weight of a human brain. Then there are other parts of the brain (brainstem, amygdala, etc.) that are not part of this subsystem. Instead, one thing they do is run calculations and interact with the “neocortex subsystem” in a way that tends to “steer” that “neocortex subsystem” towards behaving in ways that are evolutionarily adaptive. I think there are many different “steering” brain circuits, and they are designed to steer the neocortex subsystem towards seeking a variety of goals using a variety of mechanisms.
  So that’s the background picture in my head—and if you don’t buy into that, nothing else I say will make sense.
  Then the second step is: I’m imagining that the AGI programmers will build a learning-and-acting algorithm that resembles the “neocortex subsystem”’s learning-and-acting algorithm—not by some blind search over algorithm space, but by directly programming it, just as people have directly programmed AlphaGo and many other learning algorithms in the past. (They will do this either by studying how the neocortex works, or by reinventing the same ideas.) Once those programmers succeed, then OK, now these programmers will have in their hands a powerful quasi-general-purpose learning-and-acting (and foresighted goal-seeking, concept-inventing, etc.) algorithm. And then the programmer will be in a position analogous to the position of the genes wiring up those other brain modules (brainstem, etc.): the programmers will be writing code that tries to get this neocortex-like algorithm to do the things they want it to do. Let’s say the code they write is a “steering subsystem”.
  The simplest possible “steering subsystem” is ridiculously simple and obvious: just a reward calculator that sends rewards for the exact thing that the programmer wants it to do. (Note: the “neocortex subsystem” algorithm has an input for reward signals, and these signals are involved in creating and modifying its internal goals.) And if the programmer unthinkingly build that kind of simple “steering subsystem”, it would kinda work, but not reliably, for the usual reasons like ambiguity in extrapolating out-of-distribution, the neocortex-like algorithm sabotaging the steering subsystem, etc. But, we can hope that there are more complicated possible “steering subsystems” that would work better.
  So then this article is part of a research program of trying to understand the space of possibilities for the “steering subsystem”, and figuring out which of them (if any!!) would work well enough to keep arbitrarily powerful AGIs (of this basic architecture) doing what we want them to do.
  Finally, if you can load that whole perspective into your head, I think from that perspective it’s appropriate to say “the genome is “trying” to install that goal in the rat’s brain”, just as you can say that a particular gene is “trying” to do a certain step of assembling a white blood cell or whatever. (The “trying” is metaphorical but sometimes helpful.) I suppose I should have said “rat’s neocortex subsystem” instead of “rat’s brain”. Sorry about that.
  Does that help? Sorry if I’m misunderstanding you. :-)
  - azsantosk 28 Mar 2022 18:01 UTC
    1 point
    0
    Parent
    Having read Steven’s post on why humans will not create AGI through a process analogous to evolution, his metaphor of the gene trying to do something felt appropriate to me.
    If the “genome = code” analogy is the better one for thinking about the relationship of AGIs and brains, then the fact that the genome can steer the neocortex towards such proxy goals as salt homeostasis is very noteworthy, as a similar mechanism may give us some tools, even if limited, to steer a brain-like AGI toward goals that we would like it to have.
    I think Eliezer’s comment is also important in that it explains quite eloquently how complex these goals really are, even though they seem simple to us. In particular the positive motivational valence that such brain-like systems attribute to internal mental states makes them very different from other types of world-optimizing agents that may only care about themselves for instrumental reasons.
    Also the fact that we don’t have genetic fitness as a direct goal is evidence not only that evolution-like algorithms don’t do inner alignment well, but also that simple but abstract goals such as inclusive genetic fitness may be hard to install in a brain-like system. This is especially so if you agree, in the case of humans, that having genetic fitness as a direct goal, at least alongside the proxies, would probably help fitness, even in the ancestral environment.
    I don’t really know how big of a problem this is. Given that our own goals are very complex and that outer alignment is hard, maybe we shouldn’t be trying to put a simple goal into an AGI to begin with.
    Maybe there is a path for using these brain-like mechanisms (including positive motivational valence for imagined states and so on) to create a secure aligned AGI. Getting this answer right seems extremely important to me, and if I understand correctly, this is a key part of Steven’s research.
    Of course, it is also possible that this is fundamentally unsafe and we shouldn’t do that, but somehow I think that is unlikely. It should be possible to build such systems in a smaller scale (therefore not superintelligent) so that we can investigate their motivations to see what the internal goals are, and whether the system is treacherous or looking for proxies. If it turns out that such a path is indeed fundamentally unsafe, I would expect this to be related to ontological crises or profound motivational changes that are expected to occur as capability increases.