I don’t see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
[Coming at this a few months late, sorry. This comment by @Steven Byrnes sparked my interest in this topic once again]
I don’t see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
Ngl, I find everything you’re written here a bit… baffling, Seth. Your writing in particular and your exposition of your thoughts on AI risk generally does not use evolutionary analogies, but this only means that posts and comments criticizing analogies with evolution (sample: 1, 2, 3, 4, 5, etc) are just not aimed at you and your reasoning. I greatly enjoy reading your writing and pondering the insights you bring up, but you are simply not even close to the most publicly-salient proponent of “somewhat high P(doom)” among the AI alignment community. It makes perfect sense from the perspective of those who disagree with you (or other, more hardcore “doomers”) on the bottom-line question of AI to focus their public discourse primarily on responding to the arguments brought up by the subset of “doomers” who are most salient and also most extreme in their views, namely the MIRI-cluster centered around Eliezer, Nate Soares, and Rob Bensinger.
And when you turn to MIRI and the views that its members have espoused on these topics, I am very surprised to hear that “The arguments for misgeneralization/mis-specification stand on their own” and are not ultimately based on analogies with evolution.
But anyway, to hopefully settle this once and for all, let’s go through all the examples that pop up in my head immediately when I think of this, shall we?
From the section on inner & outer alignment of “AGI Ruin: A List of Lethalities”, by Yudkowsky (I have removed the original emphasis and added my own):
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.Given otherwise insufficient foresight by the operators, I’d expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn’t break alignment with the ‘inclusive reproductive fitness’ outer loss function, immediately after the introduction of farming—something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn’t carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned ‘lethally’ dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
[...]
21. There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ and ‘How do I figure out the environment?’ and ‘Which of my possible outputs interact with reality in a way that causes reality to have certain properties?‘, where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of ‘relative inclusive reproductive fitness’ - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you’d expect to be true about outer optimization loops on the order of both ‘natural selection’ and ‘gradient descent’. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
When I say “general intelligence”, I’m usually thinking about “whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems”.
[...]
Human brains aren’t perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
[...]
Human brains underwent no direct optimization for STEM ability in our ancestral environment, beyond traits like “I can distinguish four objects in my visual field from five objects”.[5]
[5] More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter
I think this view is wrong, and I don’t see much hope here. Here’s a variety of propositions I believe that I think sharply contradict this view:
There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly “nice”.
Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI’s mind to work differently from how you hope it works.
The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.
These are the sorts of features of human evolutionary history that resulted in us caring (at least upon reflection) about a much more diverse range of minds than “my family”, “my coalitional allies”, or even “minds I could potentially trade with” or “minds that share roughly the same values and faculties as me”.
Humans today don’t treat a family member the same as a stranger, or a sufficiently-early-development human the same as a cephalopod; but our circle of concern is certainly vastly wider than it could have been, and it has widened further as we’ve grown in power and knowledge.
Eliezer, summarized by Richard (continued): “In biological organisms, evolution is one source the ultimate source of consequentialism. A second secondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist—rather, it has many individual traits that, when we view them from this lens, point in the same direction.) A third thing that makes humans in particular consequentialist is planning, Another outcome of evolution, which helps make humans in particular more consequentialist, is planning—especially when we’re aware of concepts like utility functions.”
Perhaps Joe thinks that alignment is so easy that it can be solved in a short time window?
My main guess, though, is that Joe is coming at things from a different angle altogether, and one that seems foreign to me.
Attempts to generate such angles along with my corresponding responses:
Claim: perhaps it’s just not that hard to train an AI system to be “good” in the human sense? Like, maybe it wouldn’t have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
Counter: Maybe? But I expect these sorts of things to take time, and at least some mastery of the system’s internals, and if you want them to be done so well that they actually work in practice even across the great Change-Of-Distribution to operating in the real world then you’ve got to do a whole lot of clever and probably time-intensive work.
Claim: perhaps there’s just a handful of relevant insights, and new ways of thinking about things, that render the problem easy?
Counter: Seems like wishful thinking to me, though perhaps I could go point-by-point through hopeful-to-Joe-seeming candidates?
Eliezer: Something like, “Evolution constructed a jet engine by accident because it wasn’t particularly trying for high-speed flying and ran across a sophisticated organism that could be repurposed to a jet engine with a few alterations; a human industry would be gaining economic benefits from speed, so it would build unsophisticated propeller planes before sophisticated jet engines.” It probably sounds more convincing if you start out with a very high prior against rapid scaling / discontinuity, such that any explanation of how that could be true based on an unseen feature of the cognitive landscape which would have been unobserved one way or the other during human evolution, sounds more like it’s explaining something that ought to be true.
And why didn’t evolution build propeller planes? Well, there’d be economic benefit from them to human manufacturers, but no fitness benefit from them to organisms, I suppose? Or no intermediate path leading to there, only an intermediate path leading to the actual jet engines observed.
I actually buy a weak version of the propeller-plane thesis based on my inside-view cognitive guesses (without particular faith in them as sure things), eg, GPT-3 is a paper airplane right there, and it’s clear enough why biology could not have accessed GPT-3. But even conditional on this being true, I do not have the further particular faith that you can use propeller planes to double world GDP in 4 years, on a planet already containing jet engines, whose economy is mainly bottlenecked by the likes of the FDA rather than by vaccine invention times, before the propeller airplanes get scaled to jet airplanes.
The part where the whole line of reasoning gets to end with “And so we get huge, institution-reshaping amounts of economic progress before AGI is allowed to kill us!” is one that doesn’t feel particular attractored to me, and so I’m not constantly checking my reasoning at every point to make sure it ends up there, and so it doesn’t end up there.
Yudkowsky: and lest anyone start thinking that was an exhaustive list of fundamental problems, note the absence of, for example, “applying lots of optimization using an outer loss function doesn’t necessarily get you something with a faithful internal cognitive representation of that loss function” aka “natural selection applied a ton of optimization power to humans using a very strict very simple criterion of ‘inclusive genetic fitness’ and got out things with no explicit representation of or desire towards ‘inclusive genetic fitness’ because that’s what happens when you hill-climb and take wins in the order a simple search process through cognitive engines encounters those wins”
Yudkowsky: I would “destroy the world” from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.
From the perspective of my highly similar fellow humans with whom I evolved in context, they’d get nice stuff, because “my fellow humans get nice stuff” happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don’t care about at all.
Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.
From Yudkowsky’s appearance on the Bankless podcast (full transcript here):
Ice cream didn’t exist in the natural environment, the ancestral environment, the environment of evolutionary adeptedness. There was nothing with that much sugar, salt, fat combined together as ice cream. We are not built to want ice cream. We were built to want strawberries, honey, a gazelle that you killed and cooked [...] We evolved to want those things, but then ice cream comes along, and it fits those taste buds better than anything that existed in the environment that we were optimized over.
[...]
Leaving that aside for a second, the reason why this metaphor breaks down is that although the humans are smarter than the chickens, we’re not smarter than evolution, natural selection, cumulative optimization power over the last billion years and change. (You know, there’s evolution before that but it’s pretty slow, just, like, single-cell stuff.)
There are things that cows can do for us, that we cannot do for ourselves. In particular, make meat by eating grass. We’re smarter than the cows, but there’s a thing that designed the cows; and we’re faster than that thing, but we’ve been around for much less time. So we have not yet gotten to the point of redesigning the entire cow from scratch. And because of that, there’s a purpose to keeping the cow around alive.
And humans, furthermore, being the kind of funny little creatures that we are — some people care about cows, some people care about chickens. They’re trying to fight for the cows and chickens having a better life, given that they have to exist at all. And there’s a long complicated story behind that. It’s not simple, the way that humans ended up in that [??]. It has to do with the particular details of our evolutionary history, and unfortunately it’s not just going to pop up out of nowhere.
But I’m drifting off topic here. The basic answer to the question “where does that analogy break down?” is that I expect the superintelligences to be able to do better than natural selection, not just better than the humans.
At this point, I’m tired, so I’m logging off. But I would bet a lot of money that I can find at least 3x the number of these examples if I had the energy to. As Alex Turner put it, it seems clear to me that, for a very high portion of “classic” alignment arguments about inner & outer alignment problems, at least in the form espoused by MIRI, the argumentative bedrock is ultimately based on little more than analogies with evolution.
I don’t see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
[Coming at this a few months late, sorry. This comment by @Steven Byrnes sparked my interest in this topic once again]
Ngl, I find everything you’re written here a bit… baffling, Seth. Your writing in particular and your exposition of your thoughts on AI risk generally does not use evolutionary analogies, but this only means that posts and comments criticizing analogies with evolution (sample: 1, 2, 3, 4, 5, etc) are just not aimed at you and your reasoning. I greatly enjoy reading your writing and pondering the insights you bring up, but you are simply not even close to the most publicly-salient proponent of “somewhat high P(doom)” among the AI alignment community. It makes perfect sense from the perspective of those who disagree with you (or other, more hardcore “doomers”) on the bottom-line question of AI to focus their public discourse primarily on responding to the arguments brought up by the subset of “doomers” who are most salient and also most extreme in their views, namely the MIRI-cluster centered around Eliezer, Nate Soares, and Rob Bensinger.
And when you turn to MIRI and the views that its members have espoused on these topics, I am very surprised to hear that “The arguments for misgeneralization/mis-specification stand on their own” and are not ultimately based on analogies with evolution.
But anyway, to hopefully settle this once and for all, let’s go through all the examples that pop up in my head immediately when I think of this, shall we?
From the section on inner & outer alignment of “AGI Ruin: A List of Lethalities”, by Yudkowsky (I have removed the original emphasis and added my own):
From “A central AI alignment problem: capabilities generalization, and the sharp left turn”, by Nate Soares, which, by the way, quite literally uses the exact phrase “The central analogy”; as before, emphasis is mine:
From “The basic reasons I expect AGI ruin”, by Rob Bensinger:
From “Niceness is unnatural”, by Nate Soares:
From “Superintelligent AI is necessary for an amazing future, but far from sufficient”, by Nate Soares:
From the Eliezer-edited summary of “Ngo and Yudkowsky on alignment difficulty”, by… Ngo and Yudkowsky:
From “Comments on Carlsmith’s “Is power-seeking AI an existential risk?”″, by Nate Soares:
From “Soares, Tallinn, and Yudkowsky discuss AGI cognition”, by… well, you get the point:
From “Humans aren’t fitness maximizers”, by Soares:
From “Shah and Yudkowsky on alignment failures”, by the usual suspects:
From the comments on “Late 2021 MIRI Conversations: AMA / Discussion”, by Yudkowsky:
From Yudkowsky’s appearance on the Bankless podcast (full transcript here):
At this point, I’m tired, so I’m logging off. But I would bet a lot of money that I can find at least 3x the number of these examples if I had the energy to. As Alex Turner put it, it seems clear to me that, for a very high portion of “classic” alignment arguments about inner & outer alignment problems, at least in the form espoused by MIRI, the argumentative bedrock is ultimately based on little more than analogies with evolution.