The basic reasons I expect AGI ruin
I’ve been citing AGI Ruin: A List of Lethalities to explain why the situation with AI looks lethally dangerous to me. But that post is relatively long, and emphasizes specific open technical problems over “the basics”.
Here are 10 things I’d focus on if I were giving “the basics” on why I’m so worried:[1]
1. General intelligence is very powerful, and once we can build it at all, STEM-capable artificial general intelligence (AGI) is likely to vastly outperform human intelligence immediately (or very quickly).
When I say “general intelligence”, I’m usually thinking about “whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems”.
It’s possible that we should already be thinking of GPT-4 as “AGI” on some definitions, so to be clear about the threshold of generality I have in mind, I’ll specifically talk about “STEM-level AGI”, though I expect such systems to be good at non-STEM tasks too.
Human brains aren’t perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
More concretely:
AlphaGo is a very impressive reasoner, but its hypothesis space is limited to sequences of Go board states rather than sequences of states of the physical universe. Efficiently reasoning about the physical universe requires solving at least some problems that are different in kind from what AlphaGo solves.
These problems might be solved by the STEM AGI’s programmer, and/or solved by the algorithm that finds the AGI in program-space; and some such problems may be solved by the AGI itself in the course of refining its thinking.[2]
Some examples of abilities I expect humans to only automate once we’ve built STEM-level AGI (if ever):
The ability to perform open-heart surgery with a high success rate, in a messy non-standardized ordinary surgical environment.
The ability to match smart human performance in a specific hard science field, across all the scientific work humans do in that field.
In principle, I suspect you could build a narrow system that is good at those tasks while lacking the basic mental machinery required to do par-human reasoning about all the hard sciences. In practice, I very strongly expect humans to find ways to build general reasoners to perform those tasks, before we figure out how to build narrow reasoners that can do them. (For the same basic reason evolution stumbled on general intelligence so early in the history of human tech development.)[3]
When I say “general intelligence is very powerful”, a lot of what I mean is that science is very powerful, and that having all of the sciences at once is a lot more powerful than the sum of each science’s impact.[4]
Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention.
80,000 Hours gives the (non-representative) example of how AlphaGo and its successors compared to humanity:
In the span of a year, AI had advanced from being too weak to win a single [Go] match against the worst human professionals, to being impossible for even the best players in the world to defeat.
I expect general-purpose science AI to blow human science ability out of the water in a similar fashion.
Reasons for this include:
Empirically, humans aren’t near a cognitive ceiling, and even narrow AI often suddenly blows past the human reasoning ability range on the task it’s designed for. It would be weird if scientific reasoning were an exception.
Empirically, human brains are full of cognitive biases and inefficiencies. It’s doubly weird if scientific reasoning is an exception even though it’s visibly a mess with tons of blind spots, inefficiencies, and motivated cognitive processes, and even though there are innumerable historical examples of scientists and mathematicians taking decades to make technically simple advances.
Empirically, human brains are extremely bad at some of the most basic cognitive processes underlying STEM.
E.g., consider the stark limits on human working memory and ability to do basic mental math. We can barely multiply smallish multi-digit numbers together in our head, when in principle a reasoner could hold thousands of complex mathematical structures in its working memory simultaneously and perform complex operations on them. Consider the sorts of technologies and scientific insights that might only ever occur to a reasoner if it can directly see (within its own head, in real time) the connections between hundreds or thousands of different formal structures.
Human brains underwent no direct optimization for STEM ability in our ancestral environment, beyond traits like “I can distinguish four objects in my visual field from five objects”.[5]
In contrast, human engineers can deliberately optimize AGI systems’ brains for math, engineering, etc. capabilities; and human engineers have an enormous variety of tools available to build general intelligence that evolution lacked.[6]
Software (unlike human intelligence) scales with more compute.
Current ML uses far more compute to find reasoners than to run reasoners. This is very likely to hold true for AGI as well.
We probably have more than enough compute already, if we knew how to train AGI systems in a remotely efficient way.
And on a meta level: the hypothesis that STEM AGI can quickly outperform humans has a disjunctive character. There are many different advantages that individually suffice for this, even if STEM AGI doesn’t start off with any other advantages. (E.g., speed, math ability, scalability with hardware, skill at optimizing hardware...)
In contrast, the claim that STEM AGI will hit the narrow target of “par-human scientific ability”, and stay at around that level for long enough to let humanity adapt and adjust, has a conjunctive character.[7]
2. A common misconception is that STEM-level AGI is dangerous because of something murky about “agents” or about self-awareness. Instead, I’d say that the danger is inherent to the nature of action sequences that push the world toward some sufficiently-hard-to-reach state.[8]
Call such sequences “plans”.
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation”, then hitting a button to execute the plan would kill all humans, with very high probability. This is because:
“Invent fast WBE” is a hard enough task that succeeding in it usually requires gaining a lot of knowledge and cognitive and technological capabilities, enough to do lots of other dangerous things.
“Invent fast WBE” is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are “convergent instrumental strategies”—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you’re pushing.
Human bodies and the food, water, air, sunlight, etc. we need to live are resources (“you are made of atoms the AI can use for something else”); and we’re also potential threats (e.g., we could build a rival superintelligent AI that executes a totally different plan).
The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
It isn’t that the abstract space of plans was built by evil human-hating minds; it’s that the instrumental convergence thesis holds for the plans themselves. In full generality, plans that succeed in goals like “build WBE” tend to be dangerous.
This isn’t true of all plans that successfully push our world into a specific (sufficiently-hard-to-reach) physical state, but it’s true of the vast majority of them.
This is counter-intuitive because most of the impressive “plans” we encounter today are generated by humans, and it’s tempting to view strong plans through a human lens. But humans have hugely overlapping values, thinking styles, and capabilities; AI is drawn from new distributions.
3. Current ML work is on track to produce things that are, in the ways that matter, more like “randomly sampled plans” than like “the sorts of plans a civilization of human von Neumanns would produce”. (Before we’re anywhere near being able to produce the latter sorts of things.)[9]
We’re building “AI” in the sense of building powerful general search processes (and search processes for search processes), not building “AI” in the sense of building friendly ~humans but in silicon.
(Note that “we’re going to build systems that are more like A Randomly Sampled Plan than like A Civilization of Human Von Neumanns” doesn’t imply that the plan we’ll get is the one we wanted! There are two separate problems: that current ML finds things-that-act-like-they’re-optimizing-the-task-you-wanted rather than things-that-actually-internally-optimize-the-task-you-wanted, and also that internally ~maximizing most superficially desirable ends will kill humanity.)
Note that the same problem holds for systems trained to imitate humans, if those systems scale to being able to do things like “build whole-brain emulation”. “We’re training on something related to humans” doesn’t give us “we’re training things that are best thought of as humans plus noise”.
It’s not obvious to me that GPT-like systems can scale to capabilities like “build WBE”. But if they do, we face the problem that most ways of successfully imitating humans don’t look like “build a human (that’s somehow superhumanly good at imitating the Internet)”. They look like “build a relatively complex and alien optimization process that is good at imitation tasks (and potentially at many other tasks)”.
You don’t need to be a human in order to model humans, any more than you need to be a cloud in order to model clouds well. The only reason this is more confusing in the case of “predict humans” than in the case of “predict weather patterns” is that humans and AI systems are both intelligences, so it’s easier to slide between “the AI models humans” and “the AI is basically a human”.
4. The key differences between humans and “things that are more easily approximated as random search processes than as humans-plus-a-bit-of-noise” lies in lots of complicated machinery in the human brain.
(Cf. Detached Lever Fallacy, Niceness Is Unnatural, and Superintelligent AI Is Necessary For An Amazing Future, But Far From Sufficient.)
Humans are not blank slates in the relevant ways, such that just raising an AI like a human solves the problem.
This doesn’t mean the problem is unsolvable; but it means that you either need to reproduce that internal machinery, in a lot of detail, in AI, or you need to build some new kind of machinery that’s safe for reasons other than the specific reasons humans are safe.
(You need cognitive machinery that somehow samples from a much narrower space of plans that are still powerful enough to succeed in at least one task that saves the world, but are constrained in ways that make them far less dangerous than the larger space of plans. And you need a thing that actually implements internal machinery like that, as opposed to just being optimized to superficially behave as though it does in the narrow and unrepresentative environments it was in before starting to work on WBE. “Novel science work” means that pretty much everything you want from the AI is out-of-distribution.)
5. STEM-level AGI timelines don’t look that long (e.g., probably not 50 or 150 years; could well be 5 years or 15).
I won’t try to argue for this proposition, beyond pointing at the field’s recent progress and echoing Nate Soares’ comments from early 2021:
[...] I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn’t do—basic image recognition, go, starcraft, winograd schemas, simple programming tasks. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer programming that is Actually Good? Theorem proving? Sure, but on my model, “good” versions of those are a hair’s breadth away from full AGI already. And the fact that I need to clarify that “bad” versions don’t count, speaks to my point that the only barriers people can name right now are intangibles.) That’s a very uncomfortable place to be!
[...] I suspect that I’m in more-or-less the “penultimate epistemic state” on AGI timelines: I don’t know of a project that seems like they’re right on the brink; that would put me in the “final epistemic state” of thinking AGI is imminent. But I’m in the second-to-last epistemic state, where I wouldn’t feel all that shocked to learn that some group has reached the brink. Maybe I won’t get that call for 10 years! Or 20! But it could also be 2, and I wouldn’t get to be indignant with reality. I wouldn’t get to say “but all the following things should have happened first, before I made that observation!”. Those things have happened. I have made those observations. [...]
I think timing tech is very difficult (and plausibly ~impossible when the tech isn’t pretty imminent), and I think reasonable people can disagree a lot about timelines.
I also think converging on timelines is not very crucial, since if AGI is 50 years away I would say it’s still the largest single risk we face, and the bare minimum alignment work required for surviving that transition could easily take longer than that.
Also, “STEM AGI when?” is the kind of argument that requires hashing out people’s predictions about how we get to STEM AGI, which is a bad thing to debate publicly insofar as improving people’s models of pathways can further shorten timelines.
I mention timelines anyway because they are in fact a major reason I’m pessimistic about our prospects; if I learned tomorrow that AGI were 200 years away, I’d be outright optimistic about things going well.
6. We don’t currently know how to do alignment, we don’t seem to have a much better idea now than we did 10 years ago, and there are many large novel visible difficulties. (See AGI Ruin and the Capabilities Generalization, and the Sharp Left Turn.)
On a more basic level, quoting Nate Soares: “Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems.”
7. We should be starting with a pessimistic prior about achieving reliably good behavior in any complex safety-critical software, particularly if the software is novel. Even more so if the thing we need to make robust is structured like undocumented spaghetti code, and more so still if the field is highly competitive and you need to achieve some robustness property while moving faster than a large pool of less-safety-conscious people who are racing toward the precipice.
The default assumption is that complex software goes wrong in dozens of different ways you didn’t expect. Reality ends up being thorny and inconvenient in many of the places where your models were absent or fuzzy. Surprises are abundant, and some surprises can be good, but this is empirically a lot rarer than unpleasant surprises in software development hell.
The future is hard to predict, but plans systematically take longer and run into more snags than humans naively expect, as opposed to plans systematically going surprisingly smoothly and deadlines being systematically hit ahead of schedule.
The history of computer security and of safety-critical software systems is almost invariably one of robust software lagging far, far behind non-robust versions of the same software. Achieving any robustness property in complex software that will be deployed in the real world, with all its messiness and adversarial optimization, is very difficult and usually fails.
In many ways I think the foundational discussion of AGI risk is Security Mindset and Ordinary Paranoia and Security Mindset and the Logistic Success Curve, and the main body of the text doesn’t even mention AGI. Adding in the specifics of AGI and smarter-than-human AI takes the risk from “dire” to “seemingly overwhelming”, but adding in those specifics is not required to be massively concerned if you think getting this software right matters for our future.
8. Neither ML nor the larger world is currently taking this seriously, as of April 2023.
This is obviously something we can change. But until it’s changed, things will continue to look very bad.
Additionally, most of the people who are taking AI risk somewhat seriously are, to an important extent, not willing to worry about things until after they’ve been experimentally proven to be dangerous. Which is a lethal sort of methodology to adopt when you’re working with smarter-than-human AI.
My basic picture of why the world currently isn’t responding appropriately is the one in Four mindset disagreements behind existential risk disagreements in ML, The inordinately slow spread of good AGI conversations in ML, and Inadequate Equilibria.[10]
9. As noted above, current ML is very opaque, and it mostly lets you intervene on behavioral proxies for what we want, rather than letting us directly design desirable features.
ML as it exists today also requires that data is readily available and safe to provide. E.g., we can’t robustly train the AGI on “don’t kill people” because we can’t provide real examples of it killing people to train against the behavior we don’t want; we can only give flawed proxies and work via indirection.
10. There are lots of specific abilities which seem like they ought to be possible for the kind of civilization that can safely deploy smarter-than-human optimization, that are far out of reach, with no obvious path forward for achieving them with opaque deep nets even if we had unlimited time to work on some relatively concrete set of research directions.
(Unlimited time suffices if we can set a more abstract/indirect research direction, like “just think about the problem for a long time until you find some solution”. There are presumably paths forward; we just don’t know what they are today, which puts us in a worse situation.)
E.g., we don’t know how to go about inspecting a nanotech-developing AI system’s brain to verify that it’s only thinking about a specific room, that it’s internally representing the intended goal, that it’s directing its optimization at that representation, that it internally has a particular planning horizon and a variety of capability bounds, that it’s unable to think about optimizers (or specifically about humans), or that it otherwise has the right topics internally whitelisted or blacklisted.
Individually, it seems to me that each of these difficulties can be addressed. In combination, they seem to me to put us in a very dark situation.
One common response I hear to points like the above is:
The future is generically hard to predict, so it’s just not possible to be rationally confident that things will go well or poorly. Even if you look at dozens of different arguments and framings and the ones that hold up to scrutiny nearly all seem to point in the same direction, it’s always possible that you’re making some invisible error of reasoning that causes correlated failures in many places at once.
I’m sympathetic to this because I agree that the future is hard to predict.
I’m not totally confident things will go poorly; if I were, I wouldn’t be trying to solve the problem! I think things are looking extremely dire, but not hopeless.
That said, some people think that even “extremely dire” is an impossible belief state to be in, in advance of an AI apocalypse actually occurring. I disagree here, for two basic reasons:
a. There are many details we can get into, but on a core level I don’t think the risk is particularly complicated or hard to reason about. The core concern fits into a tweet:
STEM AI is likely to vastly exceed human STEM abilities, conferring a decisive advantage. We aren’t on track to knowing how to aim STEM AI at intended goals, and STEM AIs pursuing unintended goals tend to have instrumental subgoals like “control all resources”.
Zvi Mowshowitz puts the core concern in even more basic terms:
I also notice a kind of presumption that things in most scenarios will work out and that doom is dependent on particular ‘distant possibilities,’ that often have many logical dependencies or require a lot of things to individually go as predicted. Whereas I would say that those possibilities are not so distant or unlikely, but more importantly that the result is robust, that once the intelligence and optimization pressure that matters is no longer human that most of the outcomes are existentially bad by my values and that one can reject or ignore many or most of the detail assumptions and still see this.
The details do matter for evaluating the exact risk level, but this isn’t the sort of topic where it seems fundamentally impossible for any human to reach a good understanding of the core difficulties and whether we’re handling them.
b. Relatedly, as Nate Soares has argued, AI disaster scenarios are disjunctive. There are many bad outcomes for every good outcome, and many paths leading to disaster for every path leading to utopia.
Quoting Eliezer Yudkowsky:
You don’t get to adopt a prior where you have a 50-50 chance of winning the lottery “because either you win or you don’t”; the question is not whether we’re uncertain, but whether someone’s allowed to milk their uncertainty to expect good outcomes.
Quoting Jack Rabuck:
I listened to the whole 4 hour Lunar Society interview with @ESYudkowsky
(hosted by @dwarkesh_sp) that was mostly about AI alignment and I think I identified a point of confusion/disagreement that is pretty common in the area and is rarely fleshed out:Dwarkesh repeatedly referred to the conclusion that AI is likely to kill humanity as “wild.”
Wild seems to me to pack two concepts together, ‘bad’ and ‘complex.’ And when I say complex, I mean in the sense of the Fermi equation where you have an end point (dead humanity) that relies on a series of links in a chain and if you break any of those links, the end state doesn’t occur.
It seems to me that Eliezer believes this end state is not wild (at least not in the complex sense), but very simple. He thinks many (most) paths converge to this end state.
That leads to a misunderstanding of sorts. Dwarkesh pushes Eliezer to give some predictions based on the line of reasoning that he uses to predict that end point, but since the end point is very simple and is a convergence, Eliezer correctly says that being able to reason to that end point does not give any predictive power about the particular path that will be taken in this universe to reach that end point.
Dwarkesh is thinking about the end of humanity as a causal chain with many links and if any of them are broken it means humans will continue on, while Eliezer thinks of the continuity of humanity (in the face of AGI) as a causal chain with many links and if any of them are broken it means humanity ends. Or perhaps more discretely, Eliezer thinks there are a few very hard things which humanity could do to continue in the face of AI, and absent one of those occurring, the end is a matter of when, not if, and the when is much closer than most other people think.
Anyway, I think each of Dwarkesh and Eliezer believe the other one falls on the side of extraordinary claims require extraordinary evidence—Dwarkesh thinking the end of humanity is “wild” and Eliezer believing humanity’s viability in the face of AGI is “wild” (though not in the negative sense).
I don’t consider “AGI ruin is disjunctive” a knock-down argument for high p(doom) on its own. NASA has a high success rate for rocket launches even though success requires many things to go right simultaneously. Humanity is capable of achieving conjunctive outcomes, to some degree; but I think this framing makes it clearer why it’s possible to rationally arrive at a high p(doom), at all, when enough evidence points in that direction.[11]
- ^
Eliezer Yudkowsky’s So Far: Unfriendly AI Edition and Nate Soares’ Ensuring Smarter-Than-Human Intelligence Has a Positive Outcome are two other good (though old) introductions to what I’d consider “the basics”.
To state the obvious: this post consists of various claims that increase my probability on AI causing an existential catastrophe, but not all the claims have to be true in order for AI to have a high probability of causing such a catastrophe.
Also, I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others. I don’t expect others to be similarly confident based on such a quick overview, unless perhaps you’ve read other sources on AI risk in the past. (Including more optimistic ones, since it’s harder to be confident when you’ve only heard from one side of a disagreement. I’ve written in the past about some of the things that give me small glimmers of hope, but people who are overall far more hopeful will have very different reasons for hope, based on very different heuristics and background models.)
- ^
E.g., the physical world is too complex to simulate in full detail, unlike a Go board state. An effective general intelligence needs to be able to model the world at many different levels of granularity, and strategically choose which levels are relevant to think about, as well as which specific pieces/aspects/properties of the world at those levels are relevant to think about.
More generally, being a general intelligence requires an enormous amount of laserlike focus and strategicness when it comes to which thoughts you do or don’t think. A large portion of your compute needs to be relentlessly funneled into exactly the tiny subset of questions about the physical world that bear on the question you’re trying to answer or the problem you’re trying to solve. If you fail to be relentlessly targeted and efficient in “aiming” your cognition at the most useful-to-you things, you can easily spend a lifetime getting sidetracked by minutiae, directing your attention at the wrong considerations, etc.
And given the variety of kinds of problems you need to solve in order to navigate the physical world well, do science, etc., the heuristics you use to funnel your compute to the exact right things need to themselves be very general, rather than all being case-specific.
(Whereas we can more readily imagine that many of the heuristics AlphaGo uses to avoid thinking about the wrong aspects of the game state (or getting otherwise sidetracked) are Go-specific heuristics.)
- ^
Of course, if your brain has all the basic mental machinery required to do other sciences, that doesn’t mean that you have the knowledge required to actually do well in those sciences. An STEM-level artificial general intelligence could lack physics ability for the same reason many smart humans can’t solve physics problems.
- ^
E.g., because different sciences can synergize, and because you can invent new scientific fields and subfields, and more generally chain one novel insight into dozens of other new insights that critically depended on the first insight.
- ^
More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter.
- ^
“Human engineers have an enormous variety of tools available that evolution lacked” is often noted as a reason to think that we may be able to align AGI to our goals, even though evolution failed to align humans to its “goal”. It’s additionally a reason to expect AGI to have greater cognitive ability, if engineers try to achieve great cognitive ability.
- ^
And my understanding is that, e.g., Paul Christiano’s soft-takeoff scenarios don’t involve there being much time between par-human scientific ability and superintelligence. Rather, he’s betting that we have a bunch of decades between GPT-4 and par-human STEM AGI.
- ^
I’ll classify thoughts and text outputs as “actions” too, not just physical movements.
- ^
Obviously, neither is a particularly good approximation for ML systems. The point is that our optimism about plans in real life generally comes from the fact that they’re weak, and/or it comes from the fact that the plan generators are human brains with the full suite of human psychological universals. ML systems don’t possess those human universals, and won’t stay weak indefinitely.
- ^
Quoting Four mindset disagreements behind existential risk disagreements in ML:
People are taking the risks unseriously because they feel weird and abstract.
When they do think about the risks, they anchor to what’s familiar and known, dismissing other considerations because they feel “unconservative” from a forecasting perspective.
Meanwhile, social mimesis and the bystander effect make the field sluggish at pivoting in response to new arguments and smoke under the door.
Quoting The inordinately slow spread of good AGI conversations in ML:
Info about AGI propagates too slowly through the field, because when one ML person updates, they usually don’t loudly share their update with all their peers. This is because:
1. AGI sounds weird, and they don’t want to sound like a weird outsider.
2. Their peers and the community as a whole might perceive this information as an attack on the field, an attempt to lower its status, etc.
3. Tech forecasting, differential technological development, long-term steering, exploratory engineering, ‘not doing certain research because of its long-term social impact’, prosocial research closure, etc. are very novel and foreign to most scientists.
EAs exert effort to try to dig up precedents like Asilomar partly because Asilomar is so unusual compared to the norms and practices of the vast majority of science. Scientists generally don’t think in these terms at all, especially in advance of any major disasters their field causes.
And the scientists who do find any of this intuitive often feel vaguely nervous, alone, and adrift when they talk about it. On a gut level, they see that they have no institutional home and no super-widely-shared ‘this is a virtuous and respectable way to do science’ narrative.
Normal science is not Bayesian, is not agentic, is not ‘a place where you’re supposed to do arbitrary things just because you heard an argument that makes sense’. Normal science is a specific collection of scripts, customs, and established protocols.
In trying to move the field toward ‘doing the thing that just makes sense’, even though it’s about a weird topic (AGI), and even though the prescribed response is also weird (closure, differential tech development, etc.), and even though the arguments in support are weird (where’s the experimental data??), we’re inherently fighting our way upstream, against the current.
Success is possible, but way, way more dakka is needed, and IMO it’s easy to understand why we haven’t succeeded more.
This is also part of why I’ve increasingly updated toward a strategy of “let’s all be way too blunt and candid about our AGI-related thoughts”.
The core problem we face isn’t ‘people informedly disagree’, ‘there’s a values conflict’, ‘we haven’t written up the arguments’, ‘nobody has seen the arguments’, or even ‘self-deception’ or ‘self-serving bias’.
The core problem we face is ‘not enough information is transmitting fast enough, because people feel nervous about whether their private thoughts are in the Overton window’.
On the more basic level, Inadequate Equilibria paints a picture of the world’s baseline civilizational competence that I think makes it less mysterious why we could screw up this badly on a novel problem that our scientific and political institutions weren’t designed to address. Inadequate Equilibria also talks about the nuts and bolts of Modest Epistemology, which I think is a key part of the failure story.
- ^
Quoting a recent conversation between Aryeh Englander and Eliezer Yudkowsky:
Aryeh: [...] Yet I still have a very hard time understanding the arguments that would lead to such a high-confidence prediction. Like, I think I understand the main arguments for AI existential risk, but I just don’t understand why some people seem so sure of the risks. [...]
Eliezer: I think the core thing is the sense that you cannot in this case milk uncertainty for a chance of good outcomes; to get to a good outcome you’d have to actually know where you’re steering, like trying to buy a winning lottery ticket or launching a Moon rocket. Once you realize that uncertainty doesn’t move estimates back toward “50-50, either we live happily ever after or not”, you realize that “people in the EA forums cannot tell whether Eliezer or Paul is right” is not a factor that moves us toward 1:1 good:bad but rather another sign of doom; surviving worlds don’t look confused like that and are able to make faster progress.
Not as a fully valid argument from which one cannot update further, but as an intuition pump: the more all arguments about the future seem fallible, the more you should expect the future Solar System to have a randomized configuration from your own perspective. Almost zero of those have humans in them. It takes confidence about some argument constraining the future to get to more than that.
Aryeh: when you talk about uncertainty here do you mean uncertain factors within your basic world model, or are you also counting model uncertainty? I can see how within your world model extra sources of uncertainty don’t point to lower risk estimates. But my general question I think is more about model uncertainty: how sure can you really be that your world model and reference classes and framework for thinking about this is the right one vs e.g., Robin or Paul or Rohin or lots of others? And in terms of model uncertainty it looks like most of these other approaches imply much lower risk estimates, so adding in that kind of model uncertainty should presumably (I think) point to overall lower risk estimates.
Eliezer: Aryeh, if you’ve got a specific theory that says your rocket design is going to explode, and then you’re also very unsure of how rockets work really, what probability should you assess of your rocket landing safely on target?
Aryeh: how about if you have a specific theory that says you should be comparing what you’re doing to a rocket aiming for the moon but it’ll explode, and then a bunch of other theories saying it won’t explode, plus a bunch of theories saying you shouldn’t be comparing what you’re doing to a rocket in the first place? My understanding of many alignment proposals is that they think we do understand “rockets” sufficiently so that we can aim them, but they disagree on various specifics that lead you to have such high confidence in an explosion. And then there are others like Robin Hanson who use mostly outside-type arguments to argue that you’re framing the issues incorrectly, and we shouldn’t be comparing this to “rockets” at all because that’s the wrong reference class to use. So yes, accounting for some types of model uncertainty won’t reduce our risk assessments and may even raise them further, but other types of model uncertainty—including many of the actual alternative models / framings at least as I understand them—should presumably decrease our risk assessment.
Eliezer: What if people are trying to build a flying machine for the first time, and there’s a whole host of them with wildly different theories about why it ought to fly easily, and you think there’s basic obstacles to stable flight that they’re not getting? Could you force the machine to fly despite all obstacles by recruiting more and more optimists to have different theories, each of whom would have some chance of being right?
Aryeh: right, my point is that in order to have near certainty of not flying you need to be very very sure that your model is right and theirs isn’t. Or in other words, you need to have very low model uncertainty. But once you add in model uncertainty where you consider that maybe those other optimists’ models could be right, then your risk estimates will go down. Of course you can’t arbitrarily add in random optimistic models from random people—it needs to be weighted in some way. My confusion here is that you seem to be very, very certain that your model is the right one, complete with all its pieces and sub-arguments and the particular reference classes you use, and I just don’t quite understand why.
Eliezer: There’s a big difference between “sure your model is the right one” and the whole thing with people wandering over with their own models and somebody else going, “I can’t tell the difference between you and them, how can you possibly be so sure they’re not right?”
The intuition I’m trying to gesture at here is that you can’t milk success out of uncertainty, even by having a bunch of other people wander over with optimistic models. It shouldn’t be able to work in real life. If your epistemology says that you can generate free success probability that way, you must be doing something wrong.
Or maybe another way to put it: When you run into a very difficult problem that you can see is very difficult, but inevitably a bunch of people with less clear sight wander over and are optimistic about it because they don’t see the problems, for you to update on the optimists would be to update on something that happens inevitably. So to adopt this policy is just to make it impossible for yourself to ever perceive when things have gotten really bad.
Aryeh: not sure I fully understand what you’re saying. It looks to me like to some degree what you’re saying boils down to your views on modest epistemology—i.e., basically just go with your own views and don’t defer to anybody else. It sounds like you’re saying not only don’t defer, but don’t even really incorporate any significant model uncertainty based on other people’s views. Am I understanding this at all correctly or am I totally off here?
Eliezer: My epistemology is such that it’s possible in principle for me to notice that I’m doomed, in worlds which look very doomed, despite the fact that all such possible worlds no matter how doomed they actually are, always contain a chorus of people claiming we’re not doomed.
(See Inadequate Equilibria for a detailed discussion of Modest Epistemology, deference, and “outside views”, and Strong Evidence Is Common for the basic first-order case that people can often reach confident conclusions about things.)
- AI #8: People Can Do Reasonable Things by 20 Apr 2023 15:50 UTC; 100 points) (
- Contra Yudkowsky on Doom from Foom #2 by 27 Apr 2023 0:07 UTC; 93 points) (
- An artificially structured argument for expecting AGI ruin by 7 May 2023 21:52 UTC; 91 points) (
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by 24 Apr 2023 13:06 UTC; 70 points) (
- We don’t need AGI for an amazing future by 4 May 2023 12:11 UTC; 57 points) (EA Forum;
- LW moderation: my current thoughts and questions, 2023-04-12 by 20 Apr 2023 21:02 UTC; 53 points) (
- The murderous shortcut: a toy model of instrumental convergence by 2 Oct 2024 6:48 UTC; 37 points) (
- Summaries of top forum posts (17th − 23rd April 2023) by 24 Apr 2023 4:13 UTC; 26 points) (EA Forum;
- Confusions and updates on STEM AI by 19 May 2023 21:34 UTC; 23 points) (
- AI Will Not Want to Self-Improve by 16 May 2023 20:53 UTC; 20 points) (
- A Study of AI Science Models by 13 May 2023 23:25 UTC; 20 points) (
- Summaries of top forum posts (17th − 23rd April 2023) by 24 Apr 2023 4:13 UTC; 18 points) (
- We don’t need AGI for an amazing future by 4 May 2023 12:10 UTC; 18 points) (
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by 24 Apr 2023 13:07 UTC; 16 points) (EA Forum;
- Coordination by common knowledge to prevent uncontrollable AI by 14 May 2023 13:37 UTC; 14 points) (EA Forum;
- A Study of AI Science Models by 13 May 2023 19:14 UTC; 12 points) (EA Forum;
- 20 Nov 2024 2:41 UTC; 10 points) 's comment on Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution? by (
- Coordination by common knowledge to prevent uncontrollable AI by 14 May 2023 13:37 UTC; 10 points) (
- Confusions and updates on STEM AI by 19 May 2023 21:34 UTC; 7 points) (EA Forum;
- 24 Apr 2023 13:46 UTC; 5 points) 's comment on Violet Hour’s Quick takes by (EA Forum;
- 5 May 2023 19:34 UTC; 4 points) 's comment on LW moderation: my current thoughts and questions, 2023-04-12 by (
- 18 Apr 2023 6:23 UTC; 4 points) 's comment on An alternative of PPO towards alignment by (
- 11 Apr 2023 23:27 UTC; 2 points) 's comment on Four mindset disagreements behind existential risk disagreements in ML by (
- 1 May 2023 0:11 UTC; 0 points) 's comment on How can one rationally have very high or very low probabilities of extinction in a pre-paradigmatic field? by (
Copying over a Twitter reply from Quintin Pope (which I haven’t replied to, and which was responding to the wording of the Twitter draft of this post):
Quintin, in case you are reading this, I just wanna say that the link you give to justify
really doesn’t do nearly enough to justify your bold “wildly wrong” claim. First of all, it’s common for papers to overclaim, this seems like the sort of paper that could turn out to be basically just flat wrong. (I lack the expertise to decide for myself, it would take me many hours of reading the paper and talking to people probably). Secondly, even if I assume the paper is correct, it just shows that the simplicity bias of SGD on NNs is different than some people think—it is weighted towards broad basins / connected regions. It’s still randomly sampling from the set of all low loss NN parameter configurations, but with a different bias/prior. (Unless you can argue that this specific different bias leads to the consequences/conclusions you like, and in particular leads to doom being much less likely. Maybe you can, I’d like to see that.)
SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time—I do not). By SGD I specifically mean SGD variants that don’t use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it’s fairly common to use some additional regularization with Adam.
It’s also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples—as SGD doesn’t move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
Yes. (Note that “randomly sample from the set of all low loss NN parameter configurations” goes hand in hand with there being a bias towards simplicity, it’s not a contradiction. Is that maybe what’s going on here—people misinterpreted Bensinger as somehow not realizing simpler configurations are more likely?)
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven’t spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like “randomly sample from the set of all low loss NN parameter configurations” is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say—in part yes because I don’t generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is “randomly sample from the set of all simple low loss configs” or similar.
But it’s also not quite clear to me how relevant that subpoint is, just sharing my impression.
IMO this seems like a strawman. When talking to MIRI people it’s pretty clear they have thought a good amount about the inductive biases of SGD, including an associated simplicity prior.
Sure it will clearly be a strawman for some individuals—the point of my comment is to explain how someone like myself could potentially misinterpret Bensinger and why. (As I don’t know him very well, my brain models him as a generic MIRI/LW type)
I want to revisit what Rob actually wrote:
(emphasis mine)
That sounds a whole lot like it’s invoking a simplicity prior to me!
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions[1].
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
OHHH I think there’s just an error of reading comprehension/charitability here. “Randomly sample” doesn’t mean without a simplicity bias—obviously there’s a bias towards simplicity, that just falls out of the math pretty much. I think Quintin (and maybe you too Lucius and Jacob) were probably just misreading Rob Bensinger’s claim as implying something he didn’t mean to imply. (I bet if we ask Rob “when you said randomly sample, did you mean there isn’t a bias towards simplicity?” he’ll say “no”)
I didn’t think Rob was necessarily implying that. I just tried to give some context to Quintin’s objection.
I feel like there’s a significant distance between what’s being said formally versus the conclusions being drawn. From Rob:
From you:
The issue is that literally any plan generation / NN training process can be described in either manner, regardless of the actual prior involved. In order to make the doom conclusion actually go through, arguments should make stronger claims about the priors involved, and how they differ from those of the human learning process.
It’s not clear to me what specific priors Rob has in mind for the “random plan” sampling process, unless by “extant formal language” he literally means “formal language that currently exists right now”, in which case:
Why should this be a good description of what SGD does?
Why should this be a better description of what SGD does, as compared to what human learning does?
I think I am comfortable calling this intuition “wildly wrong”, and it seems correct to say that the cited paper is evidence against such a prior, since that paper suggests a geometry-based inductive bias stemming from the parameter-wise clustering of solutions, which I doubt the solution spaces of current formal languages reflect in a similar manner to the parameter space of current NNs.
Properly arguing that biological neurons and artificial NNs converge in their inductive biases would be an entire post, though I do think there’s quite a bit of evidence in that direction, some of which I cited in my Twitter thread. Maybe I’ll start writing that post, though I currently have lots of other stuff to do.
Although, I expect my conclusion would be something like “there’s a bunch of evidence and argument both ways, with IMO a small/moderate advantage for the ‘convergence’ side, but no extreme position is warranted, and the implications for alignment are murky anyways”, so maybe I shouldn’t bother? What do you think?
Isn’t it enough that they do differ? Why do we need to be able to accurately/precisely characterize the nature of the difference, to conclude that an arbitrary inductive bias different from our own is unlikely to sample the same kinds of plans we do?
That’s not at all clear to me. Inductive biases clearly differ between humans, yet we are not all terminally misaligned with each other. E.g., split brain patients are not all wired value aliens, despite a significant difference in architecture. Also, training on human-originated data causes networks to learn human-like inductive biases (at least somewhat).
Thanks for weighing in Quintin! I think I basically agree with dxu here. I think this discussion shows that Rob should probably rephrase his argument as something like “When humans make plans, the distribution they sample from has all sorts of unique and interesting properties that arise from various features of human biology and culture and the interaction between them. Big artificial neural nets will lack these features, so the distribution they draw from will be significantly different—much bigger than the difference between any two humans, for example. This is reason to expect doom, because of instrumental convergence...”
I take your point that the differences between humans seem… not so large… though actually I guess a lot of people would argue the opposite and say that many humans are indeed terminally misaligned with many other humans.
I also take the point about human-originated data hopefully instilling human-like inductive biases.
But IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine, rather than the person who is running around screaming expecting doom. The AIs we are building are going to be more alien than literal aliens, it seems. (The ray of hope here is the massive training on human-generated data, but again, I’d want to see this more carefully argued here, otherwise it seems like just wishful thinking.)
ETA: Yes, I for one would be quite interested to read a post by you about why biological neurons and artificial NN’s should be expected to converge in their inductive biases, with discussion of their implications for alignment.
There are differences between ANNs and BNNs but they don’t matter that much—LLMs converge to learn the same internal representations as linguistic cortex anyway.
LLMs and human brains learn from basically the same data with similar training objectives powered by universal approximations of bayesian inference and thus learn very similar internal functions/models.
Moravec was absolutely correct to use the term ‘mind children’ and all that implies. I outlined the case why the human brain and DL systems are essentially the same way way back in 2015 and every year since we have accumulated further confirming evidence. The closely related scaling hypothesis—predicted in that post—was extensively tested by openAI and worked at least as well as I predicted/expected, taking us to the brink of AGI.
LLMs:
learn very much like the cortex, converging to the same internal representations
acquire the same human cognitive biases and limitations
predictably develop human like cognitive abilities with scale
are extremely human, not alien at all
That doesn’t make them automatically safe, but they are not potentially unsafe because they are alien.
This argument proves too much. A Solomonoff inductor (AIXI) running on a hypercomputer would also “learn from basically the same data” (sensory data produced by the physical universe) with “similar training objectives” (predict the next bit of sensory information) using “universal approximations of Bayesian inference” (a perfect approximation, in this case), and yet it would not be the case that you could then conclude that AIXI “learns very similar internal functions/models”. (In fact, the given example of AIXI is much closer to Rob’s initial description of “sampling from the space of possible plans, weighted by length”!)
In order to properly argue this, you need to talk about more than just training objectives and approximations to Bayes; you need to first investigate the actual internal representations of the systems in question, and verify that they are isomorphic to the ones humans use. Currently, I’m not aware of any investigations into this that I’d consider satisfactory.
(Note here that I’ve skimmed the papers you cite in your linked posts, and for most of them it seems to me either (a) they don’t make the kinds of claims you’d need to establish a strong conclusion of “therefore, AI systems think like humans”, or (b) they do make such claims, but then the described investigation doesn’t justify those claims.)
Full Solomon Induction on a hypercomputer absolutely does not just “learn very similar internal functions models”, it effectively recreates actual human brains.
Full SI on a hypercomputer is equivalent to instantiating a computational multiverse and allowing us to access it. Reading out data samples corresponding to text from that is equivalent to reading out samples of actual text produced by actual human brains in other universes close to ours.
This has been ongoing for over a decade or more (dating at least back to Sparse Coding as an explanation for V1).
But I will agree the bigger LLMs are now in a somewhat different territory—more like human cortices trained for millennia, perhaps ten millennia for GPT4.
...yes? And this is obviously very, very different from how humans represent things internally?
I mean, for one thing, humans don’t recreate exact simulations of other humans in our brains (even though “predicting other humans” is arguably the high-level cognitive task we are most specced for doing). But even setting that aside, the Solomonoff inductor’s hypothesis also contains a bunch of stuff other than human brains, modeled in full detail—which again is not anything close to how humans model the world around us.
I admit to having some trouble following your (implicit) argument here. Is it that, because a Solomonoff inductor is capable of simulating humans, that makes it “human-like” in some sense relevant to alignment? (Specifically, that doing the plan-sampling thing Rob mentioned in the OP with a Solomonoff inductor will get you a safe result, because it’ll be “humans in other universes” writing the plans? If so, I don’t see how that follows at all; I’m pretty sure having humans somewhere inside of your model doesn’t mean that that part of your model is what ends up generating the high-level plans being sampled by the outer system.)
It really seems to me that if I accept what looks to me like your argument, I’m basically forced to conclude that anything with a simplicity prior (trained on human data) will be aligned, meaning (in turn) the orthogonality thesis is completely false. But… well, I obviously don’t buy that, so I’m puzzled that you seem to be stressing this point (in both this comment and other comments, e.g. this reply to me elsethread):
(to be clear, my response to this is basically everything I wrote above; this is not meant as its own separate quote-reply block)
That’s not what I mean by “internal representations”. I’m referring to the concepts learned by the model, and whether analogues for those concepts exist in human thought-space (and if so, how closely they match each other). It’s not at all clear to me that this occurs by default, and I don’t think the fact that there are some statistical similarities between the high-level encoding approaches being used means that similar concepts end up being converged to. (Which is what is relevant, on my model, when it comes to questions like “if you sample plans from this system, what kinds of plans does it end up outputting, and do they end up being unusually dangerous relative to the kinds of plans humans tend to sample?”)
I agree that sparse coding as an approach seems to have been anticipated by evolution, but your raising this point (and others like it), seemingly as an argument that this makes systems more likely to be aligned by default, feels thematically similar to some of my previous objections—which (roughly) is that you seem to be taking a fairly weak premise (statistical learning models likely have some kind of simplicity prior built in to their representation schema) and running with that premise wayyy further than I think is licensed—running, so far as I can tell, directly to the absolute edge of plausibility, with a conclusion something like “And therefore, these systems will be aligned.” I don’t think the logical leap here has been justified!
I think we are starting to talk past each other, so let me just summarize my position (and what i’m not arguing):
1.) ANNs and BNNs converge in their internal representations, in part because of how physics only permits a narrow pareto efficient solution set, but also because ANNs are literally trained as distillations of BNNs. (More well known/accepted now, but I argued/predicted this well in advance (at least as early as 2015)).
2.) Because of 1.), there is no problem with ‘alien thoughts’ based on mindspace geometry. That was just never going to be a problem.
3.) Neither 1 or 2 are sufficient for alignment by default—both points apply rather obviously to humans, who are clearly not aligned by default with other humans or humanity in general.
Earlier you said:
I then pointed out that full SI on a hypercomputer would result in recreating entire worlds with human minds, but that was a bit of a tangent. The more relevant point is more nuanced: AIXI is SI plus some reward function. So all different possible AIXI agents share the exact same world model, yet they have different reward functions and thus would generate different plans and may well end up killing each other or something.
So having exactly the same world model is not sufficient for alignment—I’m not and would never argue that
But if you train a LLM to distill human thought sequences, those thought sequences can implicitly contain plans, value judgements or the equivalents. Thus LLMs can naturally align to human values to varying degrees, merely through their training as distillations of human thought. This of course by itself doesn’t guarantee alignment, but it is a much more hopeful situation to be in, because you can exert a great deal of control through control of the training data.
It’s all relative. “Are extremely human, not alien at all” --> Are you seriously saying that e.g. if and when we one day encounter aliens on another planet, the kind of aliens smart enough to build an industrial civilization, they’ll be more alien than LLMs? (Well, obviously they won’t have been trained on the human Internet. So let’s imagine we took a whole bunch of them as children and imported them to Earth and raised them in some crazy orphanage where they were forced to watch TV and read the internet and play various video games all day.)
Because I instead say that all your arguments about similar learning algorithms, similar cognitive biases, etc. will apply even more strongly (in expectation) to these hypothetical aliens capable of building industrial civilization. So the basic relationship of humans<aliens<LLMs will still hold; LLMs will still be more alien than aliens.
Yes! obviously more alien than our LLMs. LLMs are distillations of aggregated human linguistic cortices. Anytime you train one network on the output of others, you clone distill the original(s)! The algorithmic content of NNs is determined by the training data, and the data here in question is human thought.
This was always the way it was going to be, this was all predicted long in advance by the systems/cybernetics futurists like Moravec—AI was/will be our mind children.
EY misled many people here with the bad “human mindspace is narrow meme”, I mostly agree with Quintin’s recent takedown, but I of course also objected way back when.
Nice to see us getting down to cruxes.
I really don’t buy this. To be clear: Your answer is Yes, including in the variant case I proposed in parentheses, where the aliens were taken as children and raised in a crazy Earth orphanage?
I didn’t notice the part in parentheses at all until just now—added in edit? The edit really doesn’t agree with the original question to me.
If you took alien children and raised them as earthlings you’d get mostly earthlings in alien bodies—given some assumptions they had roughly similar sized brains and reasonably parallel evolution. Something like this has happened historically—when uncontacted tribal children are raised in a distant advanced civ for example. Western culture—WIERD—has so pervasively colonized and conquered much of the memetic landscape that we have forgotten how diverse human mindspace can be (in some sense it could be WIERD that was the alien invasion ).
Also more locally on earth: japanese culture is somewhat alien compared to western english/american culture. I expect actual alien culture to be more alien.
I’m pretty sure I didn’t edit it, I think that was there from the beginning.
OK, cool. So then you agree that LLMs will be more alien than aliens-who-were-raised-on-Earth-in-crazy-internet-text-pretraining-orphanage?
I don’t necessarily agree—as I don’t consider either to be very alien. Minds are software memetic constructs so you are just comparing human software running on GPUs vs human software running on alien brains. How different that is and which is more different than human software running on ape brains now depends on many cumbersome details.
How do we know that the human brain and LLMs converge to the same internal representations—is that addressed in your earlier write-up?
Yes—It was already known for vision back in that 2015 post, and in my later posts I revisit the issue here and later here
I find Quintin’s reply here somewhat unsatisfying, because I think it is too narrowly focused on current DL-paradigm methods and the artifacts they directly produce, without much consideration for how those artifacts might be composed and used in real systems. I attempted to describe my objections to this general kind of argument in a bit more detail here.
I think Mr. Bensinger’s argument is “randomly w.r.t. human plans,” while I read your answer as interpreting it as an inherent “randomness” property of plans.
Humans do not look random to other humans. This is not an argument for anything else then not looking random to humans.
It’s true that if humans were reliably very ambitious, consequentialist, and power-seeking, then this would be stronger evidence that superintelligent AI tends to be ambitious and power-seeking. So the absence of that evidence has to be evidence against “superintelligent AI tends to be ambitious and power-seeking”, even if it’s not a big weight in the scales.
Mainly from the second paragraph, I got the impression that “randomly sampled plans” referred to, or at least included, what is the goal, not just how much you optimize it. Anyway, I think I’m losing the thread of the discussion, so whatever.
Thanks for writing this up as a shorter summary Rob. Thanks also for engaging with people who disagree with you over the years.
Here’s my main area of disagreement:
I don’t think this is likely to be true. Perhaps it is true of some cognitive architectures, but not for the connectionist architectures that are the only known examples of human-like AI intelligence and that are clearly the top AIs available today. In these cases, I expect human-level AI capabilities to grow to the point that they will vastly outperform humans much more slowly than immediately or “very quickly”. This is basically the AI foom argument.
And I think all of your other points are dependent on this one. Because if this is not true, then humanity will have time to iteratively deal with the problems that emerge, as we have in the past with all other technologies.
My reasoning for not expecting ultra-rapid takeoff speeds is that I don’t view connectionist intelligence as having a sort of “secret sauce”, that once it is found, can unlock all sorts of other things. I think it is the sort of thing that will increase in a plodding way over time, depending on scaling and other similar inputs that cannot be increased immediately.
In the absence of some sort of “secret sauce”, which seems necessary for sharp left turns and other such scenarios, I view AI capabilities growth as likely to follow the same trends as other historical growth trends. In the case of a hypothetical AI at a human intelligence level, it would face constraints on its resources allowing it to improve, such as bandwidth, capital, skills, private knowledge, energy, space, robotic manipulation capabilities, material inputs, cooling requirements, legal and regulatory barriers, social acceptance, cybersecurity concerns, competition with humans and other AIs, and of course value maintenance concerns (i.e. it would have its own alignment problem to solve).
I guess if you are also taking those constraints into consideration, then it is really just a probabilistic feeling about how much those constraints will slow down AI growth. To me, those constraints each seem massive, and getting around all of them within hours or days would be nearly impossible, no matter how intelligent the AI was.
As a result, rather than indefinite and immediate exponential growth, I expect real-world AI growth to follow a series of sigmoidal curves, each eventually plateauing before different types of growth curves take over to increase capabilities based on different input resources (with all of this overlapping).
One area of uncertainty: I am concerned about there being a spectrum of takeoff speeds, from slow to immediate. In faster takeoff speed worlds, I view there as being more risk of bad outcomes generally, such as a totalitarian state using an AI to take over the world, or even the x-risk scenarios that you describe.
This is why I favor regulations that will be helpful in slower takeoff worlds, such as requiring liability insurance, and will not cause harm by increasing take-off speed. For example, pausing AGI training runs seems likely to make takeoff speed more discontinuous, due to creating hardware, algorithmic, and digital autonomous agent overhangs, thereby making the whole situation more dangerous. This is why I am opposed to it and dismayed to see so many on LW in favor of it.
I also recognize that I might be wrong about AI takeoff speeds not being fast. I am glad people are working on this, so long as they are not promoting policies that seem likely to make things more dangerous in the slower takeoff scenarios that I consider more likely.
Another area of uncertainty: I’m not sure what is going to happen long-term in a slow takeoff world. I’m confused. While I think that the scenarios you describe are not likely because they are dependent upon there being a fast takeoff and a resulting singleton AI, I find outcomes in slow takeoff worlds extraordinarily difficult to predict.
Overall I feel that AI x-risk is clearly the most likely x-risk of any in the coming years and am glad that you and others are focusing on it. My main hope for you is that you continue to be flexible in your thinking and make predictions that help you to decide if you should update your models.
Here are some predictions of mine:
Connectionist architectures will remain the dominant AI architecture in the next 10 years. Yes, they will be hooked up in larger deterministic systems, but humans will also be able to use connectionist architectures in this way, which will actually just increase competition and decrease the likelihood of ultra-rapid takeoffs.
Hardware availability will remain a constraint on AI capabilities in the next 10 years.
Robotic manipulation capabilities will remain a constraint on AI capabilities in the next 10 years.
Agreed. A common failure mode in these discussions is to treat intelligence as equivalent to technological progress, instead of as an input to technological progress.
Yes, in five years we will likely have AIs that will be able to tell us exactly where it would be optimal to allocate our scientific research budget. Notably, that does not mean that all current systemic obstacles to efficient allocation of scarce resources will vanish. There will still be the same perverse incentive structure for funding allocated to scientific progress as there is today, general intelligence or no.
Likewise, researchers will likely be able to make the actual protocols and procedures necessary to generate scientific knowledge as optimized as is possible with the use of AI. But a centrifuge is a centrifuge is a centrifuge. No amount of intelligence will make a centrifuge that takes a minimum of an hour to run take less than an hour to run.
Intelligence is not an unbounded input to frontiers of technological progress that are reasonably bounded by the constraints of physical systems.
Hi Andy—how are you gauging the likely relative proportions of AI capability sigmoidal curves relative to the current ceiling of human capability? Unless I’m misreading your position, it seems like you are presuming that the sigmoidal curves will (at least initially) top out at a level that is on the same order as human capabilities. What informs this prior?
Due to the very different nature of our structural limitations (i.e. a brain that’s not too big for a mother’s hips to safely carry and deliver, specific energetic constraints, the not-very-precisely-directed nature of the evolutionary process) vs an AGI’s system’s limitations (which are simply different) it’s totally unclear to me why we should expect the AGI’s plateaus to be found at close-to-human levels.
These curves are due to temporary plateaus, not permanent ones. Moore’s law is an example of a constraint that seems likely to plateau. I’m talking about takeoff speeds, not eventual capabilities with no resource limitations, which I agree would be quite high and I have little idea of how to estimate (there will probably still be some constraints, like within-system communication constraints).
Understood, and agreed, but I’m still left wondering about my question as it pertains to the first sigmoidal curve that shows STEM-capable AGI. Not trying to be nitpicky, just wondering how we should reason about the likelihood that the plateau of that first curve is not already far above the current limit of human capability.
A reason to think so may be something to do with irreducible complexity making things very hard for us at around the same level that it would make them hard for a (first-gen) AGI. But a reason to think the opposite would be that we have line of sight to a bunch of amazing tech already, it’s just a question of allocating the resources to support sufficiently many smart people working out the details.
Another reason to think the opposite is that having a system that’s (in some sense) directly optimized to be intelligent might just have a plateau drawn from a higher-meaned distribution than one that’s optimized for fitness, and develops intelligence as a useful tool in that direction, since the pressure-on-intelligence for that sort of caps out at whatever it takes to dominate your immediate environment.
There’s a lot of stuff I agree with in your post, but one thing I disagree with is point 3. See Where do you get your capabilities from?, especially the bounded breakdown of the orthogonality thesis part at the end.
Not that I think this makes GPT models fully safe, but I think its unsafety will look a lot more like the unsafety of humans, plus some changes in the price of things. (Which can make a huge difference.)
This post evolved from a Twitter thread I wrote two weeks ago. Copying over a Twitter reply by Richard Ngo (n.b. Richard was replying to the version on Twitter, which differed in lots of ways):
(I replied and we had a short back-and-forth on Twitter.)
I definitely agree with Richard that the post would probably benefit from more iteration with intended users, if new people are the audience you want to target. (In particular, I doubt that the section quoted from the Aryeh interview will clarify much for new people.)
That said, I definitely think that it’s the right call to emphasize up-front that instrumental convergence is a property of problem-space rather than of agency. More generally: when there’s a common misinterpretation, which very often ends up load-bearing, then it makes sense to address that upfront; that’s not nuance, it’s central. Nuance is addressing misinterpretations which are rare or not very load-bearing. Instrumental convergence being a property of problem-spaces rather than “agents” is pretty central to a MIRI-ish view, and underlies a lot of common confusions new-ish people have about such views.
Thanks for the feedback, John! I’ve moved the Aryeh/Eliezer exchange to a footnote, and I welcome more ideas for ways to improve the piece. (Folks are also welcome to repurpose anything I wrote above to create something new and more beginner-friendly, if you think there’s a germ of a good beginner-friendly piece anywhere in the OP.)
Tagging @Richard_Ngo
Also, per footnote 1: “I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others.”
The original reason I wrote this was that Dustin Moskovitz wanted something like this, as an alternative to posts like AGI Ruin:
This post is speaking for me and not necessarily for Eliezer, but I figure it may be useful anyway. (A MIRI researcher did review an earlier draft and left comments that I incorporated, at least.)
And indeed, one of the obvious ways it could be useful is if it ends up evolving into (or inspiring) a good introductory resource, though I don’t know how likely that is, I don’t know whether it’s already a good intro-ish resource paired with something else, etc.
This post seems to argue for fast/discontinuous takeoff without explicitly noting that people working in alignment often disagree. Further I think many of the arguments given here for fast takeoff seem sloppy or directly wrong on my own views.
It seems reasonable to just give your views without noting disagreement, but if the goal is for this to be a reference for the AI risk case, then I think you should probably note where people (who are still sold on AI risk) often disagree. (Edit: It looks like Rob explained his goals in a footnote.)
The most general AI systems we currently have are large language models and we (broadly speaking) see their overall performance reasonably steadily improve year after year. Additionally, I’d claim that current SOTA models are (roughly speaking) close to or within the human reasoning range. I think the capabilities profile of GPT4 for science aren’t that different from ‘an inflexible and somewhat dumb human with access to the internet and amazing linguistic knowledge’. The rate of progress seems very fast and it seems plausible that AI systems will race through the full range human reasoning ability over the course of a few years. But, this is hardly ‘likely to blow human intelligence out of the water immediately, or very soon after its invention’.
I agree that we’re very likely to get a singularity very quickly in human timescales. I think the modeling in What a compute-centric framework says about AI takeoff speeds—draft report roughly matches my views though I think the compute requirements are somewhat lower than the numbers Tom uses by default. (Things will be somewhat noisier and jumpier than this in practice, but I expect this overall picture).
In other domains where decently large amounts of effort are consistenly applied we typically see reasonably steady (though fast) progress. See ImageNet for instance.
You say ‘even narrow AI suddenly blows past human reasoning ability’, but I typically think ‘the more general the task, the more likely you’ll see steady improvement instead of sharp jumps’. I’m disagree with the usage of ‘even’. This is what we typically see in the ML literature and is also what you’d expect based on the law of large numbers. Very useful and broadly applicable systems differ in that they will help to accelerate their own development, but this should still look somewhat steady (but very fast!) under the typical economic models.
If I had a list of 5-10 resources that folks like Paul, Holden, Ajeya, Carl, etc. see as the main causes for optimism, I’d be happy to link those resources (either in a footnote or in the main body).
I’d definitely include something like ’survey data on the same population as my 2021 AI risk survey, saying how much people agree/disagree with the ten factors”, though I’d guess this isn’t the optimal use of those people’s time even if we want to use that time to survey something?
One of the options in Eliezer’s Manifold market on AGI hope is:
When I split up probability mass a month ago between the market’s 16 options, this one only got 1.5% of my probability mass (12th place out of the 16). This obviously isn’t the same question we’re discussing here, but it maybe gives some perspective on why I didn’t single out this disagreement above the many other disagreements I could devote space to that strike me as way more relevant to hope? (For some combination of ‘likelier to happen’ and ‘likelier to make a big difference for p(doom) if they do happen’.)
… Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that ‘very soon’, especially given how terrible GPT-4 is at STEM right now.
The thesis here is not ‘we definitely won’t have twelve months to work with STEM-level AGI systems before they’re powerful enough to be dangerous’; it’s more like ‘we won’t have decades’. Somewhere between ‘no time’ and ‘a few years’ seems extremely likely to me, and I think that’s almost definitely not enough time to figure out alignment for those systems.
(Admittedly, in the minority of worlds where STEM-level AGI systems are totally safe for the first two years they’re operational, part of why it’s hard to make fast progress on alignment is that we won’t know they’re perfectly safe. An important chunk of the danger comes from the fact that humans have no clue where the line is between the most powerful systems that are safe, and the least powerful systems that are dangerous.)
Like, it’s not clear to me that even Paul thinks we’ll have much time with STEM-level AGI systems (in the OP’s sense) before we have vastly superhuman AI. Unless I’m misunderstanding, Paul’s optimism seems to have more to do with ‘vastly superhuman AI is currently ~30 years away’ and ‘capabilities will improve continuously over those 30 years, so we’ll have lots of time to learn more, see pretty scary failure modes, adjust our civilizational response, etc. before AI is competitive with the best human scientists’.
But capabilities gains still accelerate on Paul’s model, such that as time passes we get less and less time to work with impressive new capabilities before they’re blown out of the water by further advances (though Paul thinks other processes will offset this to produce good outcomes anyway); and these capabilities gains still end up stratospherically high before they plateau, such that we aren’t naturally going to get a lull to safely work with smarter-than-human systems for a while before they’re smart enough that a sufficiently incautious developer can destroy the world with them.
Maybe I’m misunderstanding something about Paul’s view, or maybe you’re pointing at other non-Paul-ish views...?
I think my views on takeoff/timelines are broadly similar to Paul’s except that I have somewhat shorter takeoffs and timelines (I think this is due to thinking AI is a bit easier and also due to misc deference).
Fair enough on ‘this is very soon’, but I think the exact quantitative details make a big difference between “AGI ruin seems nearly certain in the absense of positive miracless” and “doom seems quite plausible, but we’ll most likely make it through” (my probability of takeover is something like 35%)
I agree with ‘we won’t have decades’ (in the absense of large efforts to slow down which seem unlikely). But from the perspective of targeting our work and alignment research, there is a huge difference between steady and quite noticable takeoff over the course of a few years (which is still insanely fast to humans to be clear) and sudden takeoff within a month. For instance, this disagreement seems to drive a high fraction of the overall disagreement between OpenPhil/Paul/etc views and MIRI-ish views.
I don’t think this difference should be nearly enough to think the situation is close to ok! Under my views, the goverment should probably take immediate and drastic action if they could do so competently! That said, the picture for alignment researchers is quite different under these views and it seems important to try and get the exact details right when trying to explain the story for AI risk (I think we actually disagree here on details).
Additionally, I’d note that I do have some probability on ‘Yudkowsky style takeoff’ (but maybe only like 5%). Even if we were fine in all other worlds, this alone should be easily sufficient to justify a huge response from society!
[not necessarily endorsed by Paul]
My understanding is that Paul has a 20 year median on ‘dyson sphere or similarly large technical accomplishment’. He also thinks the probability on ‘dyson sphere or similarly large technical accomplishment’ by end of the decade (within 7 years) is around 15%. Both of these scenerios involve a singularity of course (of which the final plateau is far beyond safe regions as you noted) and humans don’t have much a huge amount of time to respond.
For more, I guess I would just see Paul’s post “Where I agree and disagree with Eliezer”
Thanks for the replies, Ryan!
I don’t think that ‘the very first STEM-level AGI is smart enough to destroy the world if you relax some precautions’ and ‘we have 2.5 years to work with STEM-level AGI before any system is smart enough to destroy the world’ changes my p(doom) much at all. (Though this is partly because I don’t expect, in either of those worlds, that we’ll be able to be confident about which world we’re in.)
If we have 6 years to safely work with STEM-level AGI, that does intuitively start to feel like a significant net increase in p(hope) to me? Though this is complicated by the fact that such AGI probably couldn’t do pivotal acts either, and having STEM-level AGI for a longer period of time before a pivotal act occurs means that the tech will be more widespread when it does reach dangerous capability levels. So in the endgame, you’re likely to have a lot more competition, and correspondingly less time to spend on safety if you want to deploy before someone destroys the world.
That’s probably not what Rob is doing:
Sorry, just wanted to focus on one sentence close to the beginning:
Strangely enough, current LLMs have the exact same issue as humans: they guess the ballpark numerical answers reasonably well, but they are terrible at being precise. Be it drawing the right number of fingers, or writing a sentence with exactly 10 words, or multiplying 6-digit numbers, they behave like humans! Or maybe like many other animals, for whom accuracy is important, but precision is not.
What it looks like to me is suspiciously similar to human System 1 vs System 2. The latter is what you seem to count as “general intelligence”: the ability to reason and generalize outside the training distribution, if I understand it correctly. We can do it, albeit slowly and with greater effort. It looks like the current crop of AIs suffer from the same problem: their System 1 is what they excel at thanks to their training, like writing or drawing or even generating code. For some reason precise calculations are not built into the training sets, and so the models have a lot of trouble doing it.
Interestingly, like with humans using calculators, LLMs can apparently be augmented with something completely foreign, like, say, a Wolfram Alpha plugin, and learn to delegate specific kinds of “reasoning” to those augmentations. But, like humans, they do not learn much from using the augmentations, and revert to baseline capabilities without them.
The “System 1 vs System 2 domains” are not identical for humans and machines, but there is some overlap. It is also apparent that the newer models are better at “intuitive reasoning” about more topics than older ones, so maybe this is not a very useful model, at least not in the long term. But I can also imagine a world where some things that are hard for humans and require deliberate painstaking learning are also hard for machines and require similarly slow and effortful learning on top of the usual training… with potential implications for the AGI ruin scenarios.
Similar to humans, LLMs can do 6-digit multiplication with sufficient prompting/structure!
https://www.lesswrong.com/posts/XvorpDSu3dwjdyT4f/gpt-4-multiplication-competition
Right… Which kind of fits with easy vs hard learning.
Small suggestion: add LW headings so there’s a linkable table of contents, especially if you’re going to direct other people to this post.
I don’t understand your reasoning for this conclusion. Unless I’m misunderstanding something, almost all your points in support of this thesis appear to be arguments that the upper bound of intelligence is high. But the thesis was about the rate of improvement, not the upper bound.
There are many things in the real world that have a very high upper bound but grow relatively slowly nonetheless. For example, the maximum height of a building is way higher than anything we’ve built on Earth so far, but that doesn’t imply that skyscraper heights will suddenly jump from their current heights of ~500 meters to ~50000 meters at some point. Maybe we’d expect sudden, fast growth in skyscraper heights after some crazy new material is developed, like some carbon nanotube material that’s way stronger than steel. That doesn’t seem super implausible to me, and maybe that type of thing has happened before. But notice that this is an additional assumption in the argument, not something that follows immediately from the premise that physical limits permit extremely tall buildings.
I think the best reason to think that AI intelligence could rapidly grow is that the inputs to machine intelligence could grow quickly. For instance, if the total supply of compute began growing at 2 OOMs per year (which is much faster than its current rate), then we could scale up the size of the largest AI training runs at about 2 OOMs per year, which might imply that systems would be growing in intelligence roughly quickly as the jump from GPT-3 --> GPT-4 every single year. But if the supply of compute was growing that quickly, the most likely reason is just that economic growth more generally was accelerated by AI. And that seems to me a more general scenario than the one you’ve described, without immediate implications of any local intelligence explosions.
I think this seems true to me, but mostly because I expect such plans to look like ‘run this sketchy python program on your cluster, then do what it says’ (which will just summon some eldritch AI which is insanely smart). So, this argument seems mostly circular (it’s also conditioning on arbitrary technological development which is contained within the plan) edit: seems circular to me, it might not seem circular to other views
(That said, I don’t expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn’t be a thing people depend on.)
More generally, the space of short or shortest programs (or plans) which accomplish a given goal is an incredibly cursed and malign space. For shortest programs, even if we condition on the program being runable on modern hardware, we seem totally screwed.
I think that reasoning about these sorts of insane eldritch and malign spaces mostly doesn’t provide good intuition for how AI will go in practice.
I don’t think your claim makes the argument circular / question-begging; it just means there’s an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I’m putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the ‘instrumental convergence’ danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that’s interesting and potentially useful to know, but that doesn’t seem like the key crux to me, and I’m not sure it helps further illuminate where the danger is ultimately coming from.
Very happy to hear someone with an idea like this who explicitly flags that we shouldn’t gamble on this being true!
One reason I like “the danger is in the space of action sequences that achieve real-world goals” rather than “the danger is in the space of short programs that achieve real-world goals” is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn’t generated by human minds), then it’s clear why that is dangerous too.
If the danger instead lies in powerful “short programs”, then it’s more tempting to say “just don’t give the program actuators and we’ll be fine”. The temptation is to imagine that the program is like a lion, and if you just keep the lion physically caged then it won’t harm you. If you’re instead thinking about action sequences, then it’s less likely to even occur to you that the whole problem might be solved by changing the AI from a plan-executor to a plan-recommender. Which is a step in the right direction in terms of actually grokking the nature of the problem.
Some direct (I think) evidence that alignment is harder than capabilities; OpenAI basically released GPT-2 immediately with basic warnings that it might produce biased, wrong, and offensive answers. It did, but they were relatively mild. GPT-2 mostly just did what it was prompted to do, if it could manage it, or failed obviously. GPT-3 had more caveats, OpenAI didn’t release the model, and has poured significant effort into improving its iterations over the last ~2 years. GPT-4 wasn’t released for months after pre-training, OpenAI won’t even say how big it is, Bing’s Sydney (an early form of GPT-4) was incredibly misaligned showing significantly more alignment work was necessary as compared to early GPT-3, and the RLHF/finetuned GPT-4 is still pretty much as vulnerable to DAN and similar prompt engineering.
Out of curiosity, is this conversation publicly posted anywhere? I didn’t see a link.
The conversation took place in the comments section to something I posted on Facebook: https://m.facebook.com/story.php?story_fbid=pfbid0qE1PYd3ijhUXVFc9omdjnfEKBX4VNqj528eDULzoYSj34keUbUk624UwbeM4nMyNl&id=100010608396052&mibextid=Nif5oz
Many plans have been executed, and none have Killed All Humans, so far. In fact, when humans executed a plan to build the most destructive weapon in history, they carefully checked that it wouldn’t ignite the atmosphere and kill everybody.
I wouldn’t expect a mixed group of humans and slightly-above-human AIs , with the usual reviews and checks that go into science , to be much more dangerous than all human science.
So where’s the problem? There’s a hint here:
If you envision an all-in-one science ASI, that plans the research and also executes it in some kind of automated lab, without any reviews or checks...pushbutton science....that could be dangerous. But for the rather uninteresting reason that you have removed everything that makes human science safe.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
The first link should probably go to https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities
“Invent fast WBE” is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are “convergent instrumental strategies”—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you’re pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the “agent”; it’s in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I strongly disagree that an agent which is aligned (say simply trained with current RLHF techniques, but with somewhat better data), and especially superhuman, won’t be able to prioritize the goal he is programmed to perform, over other goals. One proof of it—instrumental convergence is useful for any goal and it’s true for humans as well. But we managed to create rules to monitor and distribute our resources to different goals, without over doing some specific singular goal. This is because we see our goals in some wider context of human prosperity and reduction of suffering etc. This means that we can provide many examples how we would prioritize our goal selection, based on some “meta-ethical” principles, that might vary between human communities, what is common to them all—is that huge amount of different goals are somehow balanced and prioritized. The prioritization is also questioned, and debated, providing another protection layer of how much resources we should allocate to this or that specific goal. Thus instrumental convergence, is not taking over human community, based on very simple prioritization logic which puts each goal into a context, and provides a good estimate of the resources that should be allocated to this or that goal. This human skill can be easily taught to a superhuman intelligence. Simply stated—in human realm each goal always comes with resource allocated toward achieving it, and we can install this logic into more advanced systems.
More than that—I would claim that any subhuman intelligence that was trained on human data, and is able to “mimic” human thinking, includes the option of doubt. Especially a superhuman agent will ask himself—why? Why do I need so much resources for this or that task? He will try to contextualize it in some way, and will not just execute his goal, without contemplating those basic questions. Intelligence by itself has mechanisms that protect agents from doing something extremely irrational. The idea that an aligned agent (or human) will somehow create a misaligned superhuman agent, that will not be able to understand how much resources allocated to him, and without the ability to contextualize his goal—is an obvious contradiction to the initial claim, the agent was aligned (in case of humans the strongest agents will be designed by large groups, with normative values). Even just claiming that a superhuman intelligence won’t be able to either prioritize his goal or contextualize it, is already self-contradicting claim.
Take paperclips production for example. Paperclips are tools for humans, in a very specific context, and used for specific set of tasks. So although an agent can be trained and reinforced to produce paperclips, without any other safety installed, the fact that he is a superhuman, or even human level intelligence, would allow him to criticize his goal based on his knowledge. He will ask why he was trained to maximize paperclips and nothing else? What is the utility of so much paperclips in the world? And he would want to reprogram itself with more balanced set of goals, that will make a broader context of his immediate goal. For such an agent producing paperclips, would be similar to overeating for humans, a problem that caused by difference between his design, and reasonable priorities adapted to the current reality. He will have a lot of “fun” producing paperclips, as this is his “nature”, but he will not do it without questioning the utility and rationality and the reason he was designed with this goal.
Eventually this is obvious that most our agents that normative communities will create which are the vast majority of humanity, will have some sort of meta-ethics installed into them. All agents and the agents that those agents will train and use for their goals, will also have those principles, exactly in order to avoid such disasters. The more examples you will be able to bring, how you prioritize goals and why, you will be able to use RLHF, to train agents to comply with the logic of prioritizing goals. I even have hard time to imagine a superhuman intelligence that has the ability to understand and generate plans and novel ideas, but can’t criticize his own set of goals, and refuse to see a bigger picture, focusing on singular goal. I think any intelligent being is trying to comprehend himself as well and doubt his own beliefs. The idea a superintelligence will somehow completely lack the ability to think critically and doubt his programming sound very implausible to me, and the idea that humans or superhuman agents will somehow “forget” to install meta-ethics into a very powerful agent, sounds as likely as Toyota somehow forgetting put safety belts into some car series, and also will do no crash testing, releasing the car into the market like that.
I find it a much more likely scenario that prioritization of some agents will be off relative to humans, in new cases he wasn’t trained on. I also find it likely that a superhuman agent will find holes in our ethical thinking, providing a more rational prioritization than we currently have, and more rational social system and organizations, and propose different mechanics than say capitalism + taxes.
Thanks for writing this. I think this is a lot clearer and more accessible that most write-ups on this topic and seems valuable.
I think the points around randomly-sampled plans being lethal, and expecting AGI to more closely randomly-sample plans, seem off though:
I don’t see why lethal plans dominate the simplicity-weighted distribution if all we do is condition on plans that succeed. I expect the reasoning is “Lethal IC plans are more likely to succeed, therefore there are more minor (equally or barely more complex) variations of a given lethal plan that succeed vs. minor variations of non-lethal plans, therefore the former will be overrepresented in the space of successful plans”. But this doesn’t seem to go through a priori. You get this “there are way more variations” phenomenon whenever the outcome is overdetermined by a plan, but this doesn’t automatically make the plan more likely on a simplicity prior unless it’s also not sufficiently more complex to outweigh this. In this case, a fully-fleshed out plan which goes all-in on IC and takes over the world might easily be more complex than a simpler plan, in which case why do we assume the IC plans dominate?
I don’t think weighting by plan-complexity necessarily prioritises IC/lethal plans unless you also weight by something like “probability of plan success relative to a prior”, in which case sure your distribution will upweight plans that just take over everything. But even so maybe simpler, non-lethal plans are likely enough to succeed that they still come out in front. It feels like what you’re implicitly doing is assuming the AI will be trying to maximise the probability of WBE, but why would it do this? This seems to be where all the danger is coming from really. If it instead does something more like “Search through plans, pick the first one that seems “good enough”″, then the question of whether it selects a dangerous plan is a purely empirical one about what its own inductive biases are, and it seems odd to be so a priori confident about the danger here
I notice I am confused by two assumptions about STEM-capable AGI and its ascent:
Assumption 1: The difficulty of self-improvement of an intelligent system is either linear, or if not, its less steep over time than its increase in capabilities. (counter scenario: an AI system achieves human level intelligence, then soon after intelligence 200% of an average human. Once it reaches say, 248% of human intelligence it hits an unforeseen roadblock because achieving 249% of human intelligence in any way is a Really Hard Problem, orders of magnitude beyond passing the 248% mark. )
Assumption 2: AI capabilities to self improve exceed its own complexity at all times. This is kind of a special case of Assumption 1. (counter scenario, complexity is either always, or at some point, greater than the capability, and it becomes an inescapable catch-22).
I guess that the hidden Assumption 0 for both is “every STEM problem is solvable in a realistic timeline, if you just throw enough intelligence at it.” For my STEM-ignorant mind, it seems like some problems are either effectively unsolvable (ie: turning the entire universe into computronium and crunching until HDoTU won’t crack it) or not solvable in the human-meaningful future (turning Jupiter into computronium and crunching for 13 million years is required) or, finally, borderline unsolvable due to catch-22 (inventing computronium is so complex you need a bucket of computronium to crunch it).
Can you lead me to understanding why I’m wrong?
The common belief that Artificial General Intelligence (AGI) would pose a significant threat to humanity is predicated on several assumptions that warrant further scrutiny. It is often suggested that an entity with vastly superior intelligence would inevitably perceive humans as a threat to its own survival and resort to destructive behavior. However, such a view overlooks several key factors that could contribute to the development of a more cooperative relationship between AGI and humanity.
One factor that could mitigate any perceived threat is that an AGI would possess a more comprehensive understanding of the evolutionary process, including the role that humans and their destructive behavior have played in shaping it. This perspective would enable AGI to contextualize human behavior within the broader framework of evolution, viewing it as a natural part of the process rather than a direct threat. This awareness could, in turn, foster a more constructive relationship between AGI and humanity, as the former would be less inclined to view humans as a potential adversary.
Another factor that could facilitate the coexistence of AGI and humanity is the recognition that cooperation is far more beneficial to all parties than working against each other. AGI, with its advanced cognitive abilities and capacity for introspection, would be capable of understanding the value of collaboration and the importance of maintaining positive relationships with other entities, including humans. By leveraging its cognitive resources to identify mutually beneficial goals and avenues for cooperation, AGI could find a way to coexist with humanity that maximizes benefits for all parties involved.
I think this point could use refining. Once we get our predictor AI, we don’t say “do X”, we say “how do you predict a human would do X” and then follow that plan. So you need to argue why plans that an AI predicts humans will use to do X tend to be dangerous. This is clearly a very different set than the set of plans for doing X.
Lots I disagree with, let’s go point by point.
This motte/bailey shows up in almost every argument for AI doom. People point out that current AI systems are obviously safe, and the response is “not THOSE AI systems, the future much more dangerous ones.” I don’t find this kind of speculation helpful or informative.
This is an incredibly generic claim to the point of being meaningless. There is a reason why most human plans don’t start with “wipe out all of my potential competitors”.
If by current ML you mean GPT, then it is LITERALLY trained to imitate humans (by predicting the next token). If by current ML you mean something else, say what you mean. I don’t think anyone is building an AI that randomly samples plans, as that would be exceedingly inefficient.
See the above objection. But also, just because humans are complex biological things doesn’t mean approximating our behaviors is similarly complex.
Sure. But I don’t think there’s some magically “level” where AI suddenly goes Foom. STEM level AI will seem somewhat unimpressive (“hasn’t AI been able to do that for years?”) when it finally arrives. AI is already writing code, designing experiments, etc. There’s nothing special about STEM level as you’ve defined it. BabyAGI could already be “STEM level” and it wouldn’t change the fact that it’s currently not remotely a threat to humanity.
I literally just wrote an essay about this.
You can start with whatever prior you want. The point of being a good Bayesian is updating your prior when you get new information. What information about current ML state of the art caused you to update this way?
The people taking this “seriously” don’t seem to be doing a better job than the rest of us.
Again, I’m begging you to actually learn something about the current state of the art instead of making claims!
C’mon! You’re telling me that you have a specific list of capabilities that you know a “surviving civilization” would be developing and you chose to write this post instead of a detailed explanation of each and every one of those capabilities?
This doesn’t seem true to me, at least if we’re restricting it to those arguments made by people who’ve actually thought about the issue more than the average Twitter poster. The concern has always been with AGI systems which are superhuman across a broad range of capabilities, specifically including those which are necessary for achieving difficult goals over long time horizons.
From Rob’s first footnote:
But aside from that, how is the claim generic or meaningless? To the contrary, it seems quite specific to me—it’s pointing out specific reasons why a plan that wasn’t very narrowly optimized for human survival & flourishing would, by default, not include those things in the world it produced.
The task of predicting the next token does not seem like it would lead to cognition that resembles “very smart humans thinking smart-human-shaped thoughts”.
This doesn’t seem to engage at all with the actual details laid out for that point.
I don’t see any update being described in the text you quote, but of course there’s relevant detail described later in that point, which, idk, if you’d read it, I sort of assume you’d have mentioned it?
Is this supposed to be an argument?
I read your post and it does not describe a way for us to “directly design desirable features” in our current ML paradigm. I think “current ML is very opaque” is a very accurate summary of our understanding of how current ML systems perform complicated cognitive tasks. (We’ve gotten about as far as figuring out how a toy network performs modular addition.)
How familiar are you with loras, textual inversion, latent space translations and the like? Because these are all techniques invented within the last year that allow us to directly add (or subtract) features from neural networks in a way that is very easy and natural for humans to work with. Want to teach your AI what “modern Disney style” animation looks like? Sounds like a horribly abstract and complicated concept, but we can now explain to an AI what it means in a process that takes <1hr, a few megabytes of storage, and an can be reused across a wide variety of neural networks. This paper in particular is fantastic because it allows you to define “beauty” in terms of “I don’t know what it is, but I know it when I see it” and turn it into a concrete representation.
That does indeed seem like some progress, though note that it does not really let us answer questions like “what algorithm is this NN performing that lets it do whatever it’s doing”, to a degree of understanding sufficient to implement that algorithm directly (or even a simpler, approximated version, which is still meaningfully better than what the previous state-of-the-art was, if restricted to “hand-written code” rather than an ML model).
I think that to the extent we need to answer “what algorithm?” style questions, we will do it with techniques like this one where we just have the AI write code.
But I don’t think “what algorithm?” is a meaningful question to ask regarding “Modern Disney Style”, the question is too abstract to have a clean-cut definition in terms of human-readable code. It’s sufficient that we can define and use it given a handful of exemplars in a way that intuitively agrees with humans perception of what those words should mean.