So, if we were to imagine a world where there is a human-friendly attractor of sorts that a self-improving AI would settle into, how would that world look?
Honestly, I think it looks pretty much like our own world.
There’s a widespread assumption in the alignment community that the processes by which humans learn values are complex, hard to replicate, and rely on “weird quirks” of our cognition left to us by the evolutionary pressures of the ancestral environment. I think this assumption is very, very wrong.
The alignment community’s beliefs about the complexity of human value learning mostly formed prior to the deep learning era. At that time, it was easy to think that the brain’s learning process had to be complex, that evolution had extensively tuned and tweak the brain’s core learning algorithm, and that our cognitive architecture was extensively specialized to the ancestral environment.
It seemed reasonable to anchor our expectations about the complexity and specialization of the brain’s learning algorithm to the complexity and specialization of other biological systems. If the brain’s learning algorithm were as complex as, say, the immune system, that would indicate the mechanisms by which we acquired and generalized values were similarly complex. Reproducing such a delicate and complex process in an AI would be incredibly difficult.
We can see a telling example of such assumptions in Eliezer Yudkowsky’s post My Childhood Role Model:
Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.
It’s amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics. This deserves applause. It deserves more than applause, it deserves a place in the Guinness Book of Records. Like successfully building the fastest car ever to be made entirely out of Jello.
How poorly did the blind idiot god (evolution) really design the human brain?
This is something that can only be grasped through much study of cognitive science, until the full horror begins to dawn upon you.
All the biases we have discussed here should at least be a hint.
Likewise the fact that the human brain must use its full power and concentration, with trillions of synapses firing, to multiply out two three-digit numbers without a paper and pencil.
Yudkowsky says that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm[1]. Because in reality, humans represent—by far—the most general learning system currently known.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication, and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
How remarkable is it that such a wildly different learning process should so precisely reproduce our own particular flavor of cognitive inadequacy? What a coincidence, this should seem, to anyone who thinks that the human learning process represents some tiny, parochial niche in the space of possible learning processes.
But this is no coincidence. It is also no coincidence that adversarial examples optimized to fool a model of one architecture often transfer to fooling models with different architectures. It is again no coincidence that models of different architectures often converge to similar internal representations when trained on similar data. There are deep symmetries in the behaviors of general learning algorithms, for they all share a common trait: simplicity.
Since the advent of deep learning, we’ve learned a lot about what general and capable learning architectures actually look like. Time and time again, we’ve seen that “simple architectures scaled up” beats out complex tuning. This pattern of empirical evidence in favor of simple architectures is called, somewhat dramatically, the “bitter lesson” by ML researchers. (Personally, I view the “bitter lesson” as the straightforward consequence of applying a simplicity prior to the space of possible learning algorithms.)
We now have abundant evidence showing that general and capable learning systems tend to be simple. This pattern should also hold true for evolution[2] in regards to the complexity of the brain’s learning procedure[3]. I think we should greatly decrease the amount of complexity we assume to be behind our own value learning mechanisms.
I’ll go even further: I think there is a single mechanism behind most of human value acquisition and generalization, and it’s not even that mysterious. I think human values arise from an inner alignment failure between the brain’s learning and steering subsystems. I think many (possibly all) of our most fragile-seeming intuitions about values arise pretty robustly from the resulting multi-agent negotiation dynamics of such an inner alignment failure.
You can read more about this perspective on human values acquisition in this comment. There, I argue that multi-agent inner alignment failure dynamics account for (1) our avoidance of wireheading, (2) the high diversity of our values, (3) the critical role childhood plays in setting our values, (4) the “moral philosophy”-like reasoning that governs our adoption of new values, and (5) our inclination towards preserving the diversity of the present into the future. I also think there are other values-related intuitions that arise from said dynamics. To wit:
Humans are not a great example of an animal-aligned intelligence, of course. Our influence on other lifeforms is so far a huge net-negative, with the diversity of life on Earth plummeting badly.
While it’s true that humans have greatly reduced biological diversity, doing such was contrary to at least a portion of our values. If you selected a random human and gave them unlimited power, few people would use that power to use that power to continue the current trajectory of species extinction[4]. Given broader capabilities, the human inclination is to satisfy a wider collection of values. This represents a important aspect of our values related intuitions and one that aligned AI systems ought to replicate. It is also key to preventing “a relentless drive to optimize something at any price”.
Note that, in multi-party negotiations, increasing the total resources available typically[5] leads to weaker parties getting a larger absolute share of the resources, but a smaller relative share. In this regard, inner alignment failure multiagent dynamics seem like they could reproduce the human tendency to implement a wider range of values as capabilities increase.
I should clarify that I am not an absolutist in my view of there being limited specialization in the brain’s learning algorithm. I freely admit that there are regions of the brain specialized to particular categories of cognition. What I object to is the assumption that exist enormous reservoirs of additional complexity that are crucial to the value learning process.
I strongly believe this to be true. It’s not just that human ML researchers are bad at finding complex, general learning algorithms. As mentioned above, the bitter lesson derives from a simplicity prior over the space of learning algorithms. Simple learning procedures generalize well for the same reason that simple hypotheses generalize well.
An important note: the brain itself can be very complex while still implementing a simple learning procedure. The computational complexity of an ML model’s training procedure is usually much lower than the complexity of the firmware running the processors responsible for its training.
This doesn’t always happen. A lot depends on the multiagent consensus mechanism. I think that ensuring capable AIs have internal consensus mechanisms that respect the preferences of weaker components of the AI’s cognition will be challenging, but tractable.
Honestly, I think it looks pretty much like our own world.
There’s a widespread assumption in the alignment community that the processes by which humans learn values are complex, hard to replicate, and rely on “weird quirks” of our cognition left to us by the evolutionary pressures of the ancestral environment. I think this assumption is very, very wrong.
The alignment community’s beliefs about the complexity of human value learning mostly formed prior to the deep learning era. At that time, it was easy to think that the brain’s learning process had to be complex, that evolution had extensively tuned and tweak the brain’s core learning algorithm, and that our cognitive architecture was extensively specialized to the ancestral environment.
It seemed reasonable to anchor our expectations about the complexity and specialization of the brain’s learning algorithm to the complexity and specialization of other biological systems. If the brain’s learning algorithm were as complex as, say, the immune system, that would indicate the mechanisms by which we acquired and generalized values were similarly complex. Reproducing such a delicate and complex process in an AI would be incredibly difficult.
We can see a telling example of such assumptions in Eliezer Yudkowsky’s post My Childhood Role Model:
Yudkowsky says that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm[1]. Because in reality, humans represent—by far—the most general learning system currently known.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication, and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
How remarkable is it that such a wildly different learning process should so precisely reproduce our own particular flavor of cognitive inadequacy? What a coincidence, this should seem, to anyone who thinks that the human learning process represents some tiny, parochial niche in the space of possible learning processes.
But this is no coincidence. It is also no coincidence that adversarial examples optimized to fool a model of one architecture often transfer to fooling models with different architectures. It is again no coincidence that models of different architectures often converge to similar internal representations when trained on similar data. There are deep symmetries in the behaviors of general learning algorithms, for they all share a common trait: simplicity.
Since the advent of deep learning, we’ve learned a lot about what general and capable learning architectures actually look like. Time and time again, we’ve seen that “simple architectures scaled up” beats out complex tuning. This pattern of empirical evidence in favor of simple architectures is called, somewhat dramatically, the “bitter lesson” by ML researchers. (Personally, I view the “bitter lesson” as the straightforward consequence of applying a simplicity prior to the space of possible learning algorithms.)
We now have abundant evidence showing that general and capable learning systems tend to be simple. This pattern should also hold true for evolution[2] in regards to the complexity of the brain’s learning procedure[3]. I think we should greatly decrease the amount of complexity we assume to be behind our own value learning mechanisms.
I’ll go even further: I think there is a single mechanism behind most of human value acquisition and generalization, and it’s not even that mysterious. I think human values arise from an inner alignment failure between the brain’s learning and steering subsystems. I think many (possibly all) of our most fragile-seeming intuitions about values arise pretty robustly from the resulting multi-agent negotiation dynamics of such an inner alignment failure.
You can read more about this perspective on human values acquisition in this comment. There, I argue that multi-agent inner alignment failure dynamics account for (1) our avoidance of wireheading, (2) the high diversity of our values, (3) the critical role childhood plays in setting our values, (4) the “moral philosophy”-like reasoning that governs our adoption of new values, and (5) our inclination towards preserving the diversity of the present into the future. I also think there are other values-related intuitions that arise from said dynamics. To wit:
While it’s true that humans have greatly reduced biological diversity, doing such was contrary to at least a portion of our values. If you selected a random human and gave them unlimited power, few people would use that power to use that power to continue the current trajectory of species extinction[4]. Given broader capabilities, the human inclination is to satisfy a wider collection of values. This represents a important aspect of our values related intuitions and one that aligned AI systems ought to replicate. It is also key to preventing “a relentless drive to optimize something at any price”.
Note that, in multi-party negotiations, increasing the total resources available typically[5] leads to weaker parties getting a larger absolute share of the resources, but a smaller relative share. In this regard, inner alignment failure multiagent dynamics seem like they could reproduce the human tendency to implement a wider range of values as capabilities increase.
I should clarify that I am not an absolutist in my view of there being limited specialization in the brain’s learning algorithm. I freely admit that there are regions of the brain specialized to particular categories of cognition. What I object to is the assumption that exist enormous reservoirs of additional complexity that are crucial to the value learning process.
I strongly believe this to be true. It’s not just that human ML researchers are bad at finding complex, general learning algorithms. As mentioned above, the bitter lesson derives from a simplicity prior over the space of learning algorithms. Simple learning procedures generalize well for the same reason that simple hypotheses generalize well.
An important note: the brain itself can be very complex while still implementing a simple learning procedure. The computational complexity of an ML model’s training procedure is usually much lower than the complexity of the firmware running the processors responsible for its training.
If anything, they’d be much more likely to un-extinct some dinosaurs.
This doesn’t always happen. A lot depends on the multiagent consensus mechanism. I think that ensuring capable AIs have internal consensus mechanisms that respect the preferences of weaker components of the AI’s cognition will be challenging, but tractable.