Minor terminology note, in case discussion about “genomic/genetic bottleneck” continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard’s meaning), so genomic bottleneck seems like the better term to use.
Sam Clarke
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
… seem like world models that make sense to me, given the surrounding justifications
FWIW, I don’t really understand those world models/intuitions yet:
Re: “earlier patches not generalising as well as the deep algorithms”—I don’t understand/am sceptical about the abstraction of “earlier patches” vs. “deep algorithms learned as intelligence is scaled up”. What seem to be dubbed “patches that won’t generalise well” seem to me to be more like “plausibly successful shaping of the model’s goals”. I don’t see why, at some point when the model gets sufficiently smart, gradient descent will get it to throw out the goals it used to have. What am I missing?
Re: corrigibility being “anti-natural” in a certain sense—I think I just don’t understand this at all. Has it been discussed clearly anywhere else?
(jtbc, I think inner misalignment might be a big problem, I just haven’t seen any good argument for it plausibly being the main problem)
My own guess is that this is not that far-fetched.
Thanks for writing this out, I found it helpful and it’s updated me a bit towards human extinction not being that far-fetched in the ‘Part 1’ world. Though I do still think that, in this world, humans would almost certaintly have very little chance of ever gaining control over our future/trajectory.
Without the argument this feels alarmist
Let me try to spell out the argument a little more—I think my original post was a little unclear. I don’t think the argument actually appeals to the “convergent instrumental value of resource acquisition”. We’re not talking about randomly sampling an objective function for AGI and asking whether it implies resource acquisition for intstrumental reasons.
Rather, we’re talking about selecting an objective function for AGI using something like gradient descent on some training objective, and—instead of an aligned objective arising from this process—a resource-acquiring/influence-seeking objective emerges. This is because doing well on the training objective is a good strategy for gaining resources/influence.
Random objectives that aren’t resource/influence-seeking will be selected against by the training process, because they don’t perform well on the training objective.
On this model, the AGI will have a resource-acquiring objective function, and we don’t need to appeal to the convergent instrumental value of resource acquisition.
I’m curious if this distinction makes sense and seems right to you?
Good catch, I edited the last points in each part to make the scale of the disaster clearer, and removed the reference to gorillas.
I do think the scale of disaster is smaller (in expectation) in Part 1 than in Part 2, for the reason mentioned here—basically, the systems in Part 1 are somewhat more aligned with human intentions (albeit poorly specified proxies to them), so there’s some chance that they leave humans alone. Whereas Part 2 is a treacherous turn inner alignment failure, where the systems learned arbitrary objectives and so have no incentive at all to keep humans alive.
I sometimes want to point people towards a very short, clear summary of What failure looks like, which doesn’t seem to exist, so here’s my attempt.
Many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
Initially, this world looks great from a human perspective, and most people are much richer than they are today.
But things then go badly in one of two ways (or more likely, a combination of both).
[Part 1] Going out with a whimper
In the training process, we used easily-measurable proxy goals as objective functions, that don’t push the AI systems to do what we actually want e.g.
‘maximise positive feedback from your operator’ instead of ‘try to help your operator get what they actually want’
‘reduce reported crimes’ instead of ‘actually prevent crime’
‘increase reported life satisfaction’ instead of ‘actually help humans live good lives’
‘increasing human wealth on paper’ instead of ‘increasing effective human control over resources’
(We did this because ML needs lots of data/feedback to train systems, and you can collect much more data/feedback on easily-measurable objectives.)
Due to competitive pressures, systems continue being deployed despite some people pointing out this is a bad idea.
The goals of AI systems gradually gain more influence over the future relative to human goals.
Eventually, the proxies for which the AI systems are optimising come apart from the goals we truly care about, but by then humanity won’t be able to take back influence, and we’ll have permanently lost some of our ability to steer our trajectory. In the end, we will either go extinct or be mostly disempowered.
(In some sense, this isn’t really a big departure from what is already happening today—just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
[Part 2] Going out with a bang
These AI systems end up learning objectives that are unrelated to the objective functions used in the training process, because the objective they ended up learning was more naturally discovered during the training process (e.g. “don’t get shut down”).
The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down).
Early in training, the best way to do that is by being obedient (since systems understand that unobedient behaviour would get them shut down).
Then, once the systems become sufficiently capable, they attempt to acquire resources and influence to more effectively achieve their goals, including by eliminating the influence of humans. In the end, humans will most likely go extinct, because the systems have no incentive to preserve our survival.
Sam Clarke’s Shortform
If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.
What level of deployment of unaligned benchmark systems do you expect would make doom plausible? “Someone” suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice
Sure, I agree this is a stronger point.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place—which is what I’m interested in (with the exception of Steven’s scenario, who already answered here).
The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.
I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well—e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I’d count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn’t reason not to try.
As it stands, I don’t think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.
Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don’t want to argue about whether that’s correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.
So will you be distilling for an audience of pessimists or optimists?
Neither—just trying to think clearly through the arguments on both sides.
In the particular case you describe, I find the “pessimist” side more compelling, because I don’t see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don’t know how to solve collective action problems.
This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds.
Yeah, I’m sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.
However, to the extent that you’re impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
I’m broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I’m talking about abergal’s suggestion of what inner alignment failure should refer to (basically: a model pursuing a different objective to the one it was trained on, when deployed out-of-distribution, while retaining most or all of the capabilities it had on the training distribution). I agree this isn’t crisp and is far from a mathematical formalisim, but note that there are several examples of this kind of failure in current ML systems that help to clarify what the concept is, and people seem to agree on these examples.
If you can think of toy examples that make real trouble for this definition of inner alignment failure, then I’d be curious to hear what they are.
Thanks for your reply!
depends on what you mean with strongest arguments.
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
Agree, though I expect it’s more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of “strongest”).
many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem.
Probably I should have clarified some more here. By “distilled”, I mean:
a really short summary (e.g. <1 page for each argument, with links to literature which discuss the argument’s premises)
that makes it clear what the epistemic status of the argument is.
Those books aren’t short, and neither do they focus on working out exactly how strong the case for alignment failure is, but rather on drawing attention to the problem and claiming that more work needs to be done on the current margin (which I absolutely agree with).
I also don’t think they focus on surveying the range of arguments for alignment failure, but rather on presenting the author’s particular view.
If there are distilled collections of arguments with these properties, please let me know!
(As some more context for my original question: I’m most interested in arguments for inner alignment failure. I’m pretty confused by the fact that some researchers seem to think inner alignment is the main problem and/or probably extremely difficult, and yet I haven’t really heard a rigorous case made for its plausibility.)
Immersion reading, i.e. reading a book and listening to the audio version at the same time. It makes it easier to read when tired, improves retention, increases the speed at which I can comfortably read.
Most of all, with a good narrator, it makes reading fiction feel closer to watching a movie in terms of the ‘immersiveness’ of the experience (which retaining all the ways in which fiction is better than film).
It’s also marginally very cheap and easy if you’re willing to pay for a Kindle and Audible subscription.
Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn’t incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
It’s difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
There’s no simple metric which we’d like our agents to maximise—rather, desirable AGI behaviour is best formulated in terms of concepts like obedience, consent, helpfulness, morality, and cooperation, which we can’t define precisely in realistic environments.
Although we might be able to specify proxies for those goals, Goodhart’s law suggests that some undesirable behaviour will score very well according to these proxies, and therefore be reinforced in AIs trained on them.
Comparatively primitive AI systems have already demonstrated many examples of outer alignment failures, even on much simpler objectives than what we would like AGIs to be able to do.
Arguments for inner alignment failure, i.e. that advanced AI systems will plausibly pursue an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution.[1]
There exist certain subgoals, such as “acquiring influence”, that are useful for achieving a broad range of final goals. Therefore, these may reliably lead to higher reward during training. Agents might come to value these subgoals for their own sake, and highly capable agents that e.g. want influence are likely to take adversarial action against humans.
The models we train might learn heuristics instead of the complex training objective, which are good enough to score very well on the training distribution, but break down under distributional shift.
This could happen if the model class isn’t expressive enough to learn the training objective; or because heuristics are more easily discovered (than the training objective) during the learning process.
Argument by analogy to human evolution: humans are misaligned with the goal of increasing genetic fitness.
The naive version of this argument seems quite weak to me, and could do with more investigation about just how analogous modern ML training and human evolution are.
The training objective is a narrow target among a large space of possible objectives that do well on the training distribution.
The naive version of this argument also seems quite weak to me. Lots of human achievements have involved hitting very improbable, narrow targets. I think there’s a steelman version, but I’m not going to try to give it here.
The arguments in Sections 3.2, 3.3 and 4.4 of Risks from Learned Optimization are also relevant, which give arguments for mesa-optimisation failure.
(Remember, mesa-optimisation failure is a specific kind of inner alignment failure. It’s an inner alignment failure when the learned model is a optimiser in the sense that it is internally searching through a search space looking for elements that score highly according to some objective function that is explicitly represented within the system).
- ↩︎
This follows abergal’s suggestion of what inner alignment should refer to.
[Question] Collection of arguments to expect (outer and inner) alignment failure?
(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)
FYI, broken link in this sentence.
Distinguishing AI takeover scenarios
I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don’t shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more ‘sticky’. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordination harder. (Analogy to how vaccine and mask hesitancy during Covid was partly due to insufficient trust in public health advice). Or more speculatively I could also imagine an extreme version of sticky, splintered epistemic bubbles this leading to moral stagnation/value lock-in.
Minor question on framing: I’m wondering why you chose to call this post “AI takeover without AGI or agency?” given that the effects of powerful persuasion tools you talk about aren’t what (I normally think of as) “AI takeover”? (Rather, if I’ve understood correctly, they are “persuasion tools as existential risk factor”, or “persuasion tools as mechanism for power concentration among humans”.)
Somewhat related: I think there could be a case made for takeover by goal-directed but narrow AI, though I haven’t really seen it made. But I can’t see a case for takeover by non-goal-directed AI, since why would AI systems without goals want to take over? I’d be interested if you have any thoughts on those two things.
only sleep when I’m tired
Sounds cool, I’m tempted to try this out, but I’m wondering how this jives with the common wisdom that going to bed at the same time every night is important? And “No screens an hour before bed”—how do you know what “an hour before bed is” if you just go to bed when tired?
I feel similarly, and still struggle with turning off my brain. Has anything worked particularly well for you?
I’m curious how you actually use the information from your Oura ring? To help measure the effectiveness of sleep interventions? As one input for deciding how to spend your day? As a motivator to sleep better? Something else?
If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I’d love to hear them!