azsantosk comments on Inner Alignment in Salt-Starved Rats

azsantosk 28 Mar 2022 18:01 UTC
1 point
Having read Steven’s post on why humans will not create AGI through a process analogous to evolution, his metaphor of the gene trying to do something felt appropriate to me.
If the “genome = code” analogy is the better one for thinking about the relationship of AGIs and brains, then the fact that the genome can steer the neocortex towards such proxy goals as salt homeostasis is very noteworthy, as a similar mechanism may give us some tools, even if limited, to steer a brain-like AGI toward goals that we would like it to have.
I think Eliezer’s comment is also important in that it explains quite eloquently how complex these goals really are, even though they seem simple to us. In particular the positive motivational valence that such brain-like systems attribute to internal mental states makes them very different from other types of world-optimizing agents that may only care about themselves for instrumental reasons.
Also the fact that we don’t have genetic fitness as a direct goal is evidence not only that evolution-like algorithms don’t do inner alignment well, but also that simple but abstract goals such as inclusive genetic fitness may be hard to install in a brain-like system. This is especially so if you agree, in the case of humans, that having genetic fitness as a direct goal, at least alongside the proxies, would probably help fitness, even in the ancestral environment.
I don’t really know how big of a problem this is. Given that our own goals are very complex and that outer alignment is hard, maybe we shouldn’t be trying to put a simple goal into an AGI to begin with.
Maybe there is a path for using these brain-like mechanisms (including positive motivational valence for imagined states and so on) to create a secure aligned AGI. Getting this answer right seems extremely important to me, and if I understand correctly, this is a key part of Steven’s research.
Of course, it is also possible that this is fundamentally unsafe and we shouldn’t do that, but somehow I think that is unlikely. It should be possible to build such systems in a smaller scale (therefore not superintelligent) so that we can investigate their motivations to see what the internal goals are, and whether the system is treacherous or looking for proxies. If it turns out that such a path is indeed fundamentally unsafe, I would expect this to be related to ontological crises or profound motivational changes that are expected to occur as capability increases.