I think that analogies to evolution tell us very little about how an AGI’s value formation process would work. Biological evolution is a very different sort of optimizer than SGD, and there are evolution-specific details that entirely explain our misalignment wrt inclusive genetic fitness. See point 5 in the post linked below for details, but tl;dr:
Evolution can only optimize over our learning process and reward circuitry, not directly over our values or cognition. Moreover, robust alignment to IGF requires that you even have a concept of IGF in the first place. Ancestral humans never developed such a concept, so it was never useful for evolution to select for reward circuitry that would cause humans to form values around the IGF concept.
It would be an enormous coincidence if the reward circuitry that lead us to form values around those IGF-promoting concepts that are learnable in the ancestral environment were to also lead us to form values around IGF itself once it became learnable in the modern environment, despite the reward circuitry not having been optimized for that purpose at all. That would be like successfully directing a plane to land at a particular airport while only being able to influence the geometry of the plane’s fuselage at takeoff, without even knowing where to find the airport in question.
SGD is different in that it directly optimizes over values / cognition, and that AIs will presumably have a conception of human values during training.
Additionally, on most dimensions of comparison, humans seem like the more relevant analogy, even ignoring the fact that we will literally train our AIs to imitate humans.
I agree that the processes are different, but I think the analogy still holds well.
SGD doesn’t get to optimize directly over a conveniently factored out values module. It’s as blind to the details of how it gets results as evolution, since it can only care about which local twiddles get locally better results.
So it seems to me that SGD should basically build up a cognitive mess that doesn’t get refactored in nice ways when you do further training. Which looks a lot like evolution in the analogy.
Maybe there’s some evidence for this in the difficulty of retraining a language model to generate text in the middle, even though this is apparently easy to do if you train the model to do infilling from the get-go? https://arxiv.org/abs/2207.14255
(I also disagree about ancestral humans not having a concept or sense tracking “multitude of my descendants” / “power of my family” / etc. And indeed some of these are in my values.)
The key difference between evolution and SGD isn’t about locality or efficiency (though I disagree with your characterization of SGD / deep learning as inefficient or inelegant). The key difference is that human evolution involved a two-level optimization process, with evolution optimizing over the learning process + initial reward system of the brain, and the brain learning (optimizing) within lifetime.
Values form within lifetimes, and evolution does not operate on that scale. Thus, the mechanisms available to evolution for it to influence learned values are limited and roundabout.
Ancestral humans had concepts somewhat related to IGF, but they didn’t have IGF itself. That matters a lot for determining whether the sorts of learning process / reward circuit tweaks that evolution applied in the ancestral environment will lead to modern humans forming IGF values that generalize to situations such as maximally donating to sperm banks. Not-coincidentally, humans are more likely to value these ancestral environment accessible notions than IGF.
There’s also the further difficulty of aligning any RL-esque learning process to valuing IGF specifically: the long time horizons (relative to within lifetime learning) over which differences in IGF become apparent means any possible reward for increasing IGF will be very sparse and rarely influence an organism’s cognition. Additionally, learning to act coherently over longer time horizons is just generally difficult.
What you’re saying is that evolution optimized over changes to a kind of blueprint-for-a-human (DNA) that does not directly “do” anything like cognition with concepts and values, but which grows, through cell division and later through cognitive learning, into a human that does do things like cognition with concepts and values. This grown human then goes on to exhibit behavior and have an impact on the world. So there is an approximate two-stage thing happening:
(1) blueprint → (2) agent → (3) behavior
In contrast, when we optimize over policies in ML, we optimize directly at the level of a kind of cognition-machine (e.g. some neural net architecture) that directly acts in the world, and could, quite plausibly, have concepts and values.
So evolution optimizes at (1), whereas in today’s ML we optimize at (2) and there is nothing really corresponding to (1) in most of today’s ML.
That’s the key mechanistic difference between evolution and SGD. There’s an additional layer here that comes from how that mechanistic difference interacts with the circumstances of the ancestral environment (I.e., that ancestral humans never had an IGF abstraction), which means evolutionary optimization over the human mind blueprint in the ancestral environment would have never produced a blueprint that lead to value formation around IGF in the modern environment. This fully explains modern humanity’s misalignment wrt IGF, which would have happened even in worlds where inner alignment is never a problem for ML systems. Thus, evolutionary analogies tell us ~nothing about whether we should be worried about inner alignment.
(This is even ignoring the fact that IGF seems like a very hard concept to align minds to at all, due to the sparseness of IGF reward signals.)
I think that analogies to evolution tell us very little about how an AGI’s value formation process would work. Biological evolution is a very different sort of optimizer than SGD, and there are evolution-specific details that entirely explain our misalignment wrt inclusive genetic fitness. See point 5 in the post linked below for details, but tl;dr:
Evolution can only optimize over our learning process and reward circuitry, not directly over our values or cognition. Moreover, robust alignment to IGF requires that you even have a concept of IGF in the first place. Ancestral humans never developed such a concept, so it was never useful for evolution to select for reward circuitry that would cause humans to form values around the IGF concept.
It would be an enormous coincidence if the reward circuitry that lead us to form values around those IGF-promoting concepts that are learnable in the ancestral environment were to also lead us to form values around IGF itself once it became learnable in the modern environment, despite the reward circuitry not having been optimized for that purpose at all. That would be like successfully directing a plane to land at a particular airport while only being able to influence the geometry of the plane’s fuselage at takeoff, without even knowing where to find the airport in question.
SGD is different in that it directly optimizes over values / cognition, and that AIs will presumably have a conception of human values during training.
Additionally, on most dimensions of comparison, humans seem like the more relevant analogy, even ignoring the fact that we will literally train our AIs to imitate humans.
I agree that the processes are different, but I think the analogy still holds well.
SGD doesn’t get to optimize directly over a conveniently factored out values module. It’s as blind to the details of how it gets results as evolution, since it can only care about which local twiddles get locally better results.
So it seems to me that SGD should basically build up a cognitive mess that doesn’t get refactored in nice ways when you do further training. Which looks a lot like evolution in the analogy.
Maybe there’s some evidence for this in the difficulty of retraining a language model to generate text in the middle, even though this is apparently easy to do if you train the model to do infilling from the get-go? https://arxiv.org/abs/2207.14255
(I also disagree about ancestral humans not having a concept or sense tracking “multitude of my descendants” / “power of my family” / etc. And indeed some of these are in my values.)
The key difference between evolution and SGD isn’t about locality or efficiency (though I disagree with your characterization of SGD / deep learning as inefficient or inelegant). The key difference is that human evolution involved a two-level optimization process, with evolution optimizing over the learning process + initial reward system of the brain, and the brain learning (optimizing) within lifetime.
Values form within lifetimes, and evolution does not operate on that scale. Thus, the mechanisms available to evolution for it to influence learned values are limited and roundabout.
Ancestral humans had concepts somewhat related to IGF, but they didn’t have IGF itself. That matters a lot for determining whether the sorts of learning process / reward circuit tweaks that evolution applied in the ancestral environment will lead to modern humans forming IGF values that generalize to situations such as maximally donating to sperm banks. Not-coincidentally, humans are more likely to value these ancestral environment accessible notions than IGF.
There’s also the further difficulty of aligning any RL-esque learning process to valuing IGF specifically: the long time horizons (relative to within lifetime learning) over which differences in IGF become apparent means any possible reward for increasing IGF will be very sparse and rarely influence an organism’s cognition. Additionally, learning to act coherently over longer time horizons is just generally difficult.
What you’re saying is that evolution optimized over changes to a kind of blueprint-for-a-human (DNA) that does not directly “do” anything like cognition with concepts and values, but which grows, through cell division and later through cognitive learning, into a human that does do things like cognition with concepts and values. This grown human then goes on to exhibit behavior and have an impact on the world. So there is an approximate two-stage thing happening:
(1) blueprint → (2) agent → (3) behavior
In contrast, when we optimize over policies in ML, we optimize directly at the level of a kind of cognition-machine (e.g. some neural net architecture) that directly acts in the world, and could, quite plausibly, have concepts and values.
So evolution optimizes at (1), whereas in today’s ML we optimize at (2) and there is nothing really corresponding to (1) in most of today’s ML.
Did I understand you correctly?
That’s the key mechanistic difference between evolution and SGD. There’s an additional layer here that comes from how that mechanistic difference interacts with the circumstances of the ancestral environment (I.e., that ancestral humans never had an IGF abstraction), which means evolutionary optimization over the human mind blueprint in the ancestral environment would have never produced a blueprint that lead to value formation around IGF in the modern environment. This fully explains modern humanity’s misalignment wrt IGF, which would have happened even in worlds where inner alignment is never a problem for ML systems. Thus, evolutionary analogies tell us ~nothing about whether we should be worried about inner alignment.
(This is even ignoring the fact that IGF seems like a very hard concept to align minds to at all, due to the sparseness of IGF reward signals.)