Critical review of Christiano’s disagreements with Yudkowsky

This is a review of Paul Christiano’s article “where I agree and disagree with Eliezer”. Written for the LessWrong 2022 Review.

In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn’t have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this review, I will dub the camps gravitating to the different ends of this axis “Prosers” (after prosaic alignment) and “Poets”[2]. Christiano is a Proser, and so are most people in AI safety groups in the industry. Yudkowsky is a typical Poet [sort-of Poet, but there’s an important departure from my characterization below], people in MIRI and the agent foundations community tend to also be such.

Prosers tend to be more optimistic, lend more credence to slow takeoff, and place more value on empirical research and solving problems by reproducing them in the lab and iterating on the design. Poets tend to be more pessimistic, lend more credence to fast takeoff, and place more value on theoretical research and solving problems on paper before they become observable in existing AI systems. Few people are absolute purists in those respects: almost nobody in the community believes that e.g. empirical research or solving problems on paper in advance is completely worthless.

In this article, Christiano lists his agreements and disagreements with Yudkowsky. The resulting list can serve as a reasonable starting point for understanding the differences of Proser and Poet positions. In this regard it is not perfect: the tone and many of the details are influenced by Christiano’s reactions to Yudkowsky’s personal idiosyncrasies and also by the specific content of Yudkwosky’s article “AGI Ruin” to which Christiano is responding. Moreover, it is in places hard to follow because Christiano responds to Yudkowsky without restating Yudkowsky’s position first. Nevertheless, it does touch on most of the key points of contention.

In this review, I will try to identify the main generators of Christiano’s disagreements with Yudkowsky and add my personal commentary. Since I can be classified as a Poet myself, my commentary is mostly critical. This doesn’t mean I agree with Yudkowsky everywhere. On many points I have significant uncertainty. On some, I disagree with both Christiano and Yudkowsky[3].

Takeoff Speeds

See also “Yudkowsky and Christiano discuss Takeoff Speeds”.

Christiano believes that AI progress will (probably) be gradual, smooth, and relatively predictable, with each advance increasing capabilities by a little, receiving widespread economic use, and adopted by multiple actors before it is compounded by the next advance, all the way to transformative AI (TAI). This scenario is known as “slow takeoff”. Yudkowsky believes that AI progress will (probably) be erratic, involve sudden capability jumps, important advances that have only minor economic impact and winner-takes-all[4] dynamics. That scenario is known as “fast takeoff”[5].

This disagreement is upsteam of multiple other disagreements. For example:

  • In slow takeoff scenarios there’s more you can gain from experimentation and iteration (disagreement #1 in Christiano’s list), because you have AI systems similar enough to TAI for long enough before TAI arrives. In fast takeoff, the opposite is true. [EDIT: See more discussion in my followup article.]

  • The notion of “pivotal act” (disagreements #5 and #6) makes more sense in a fast takeoff world. If the takeoff is sufficiently fast, there will be one actor that creates TAI in a world where no other AI is close to transformative. The kind of AI that’s created then determines the entire future, and hence whatever this AI does constitutes a “pivotal act”.

It also figures in disagreements #2, #4, #7, #9 and #10.

Specifically regarding pivotal acts, I share some of Christiano’s dislike for the framing, both because it is assumption-laden and because it is liable to derail the debate into unproductive political arguments. Instead of talking about a “pivotal act”, I prefer to talk about an “AI defense system” [against unaligned AI]. The latter leaves more room about what shape this defense system takes (deferring to the IMO genuine uncertainty about what is realistic) while highlighting that solving technical AI alignment requires designed aligned AI which is capable enough to be used in such a defense system.

Moreover, Yudkowsky’s idea of what a pivotal act can look like (e.g. the (in)famous example “melt all GPUs”) is premised on the corrigibility pathway to alignment, whereas I consider ambitious value learning more promising (see also my Physicalist Superimitation proposal). This is a point where I disagree with Christiano and Yudkowsky both.

Coming back to takeoff speeds, here are some reasons that I think fast-ish takeoff is fairly plausible:

Missing Ingredients

Prosers tend to place a lot of importance on scaling existing techniques, and believe most of the remaining path to TAI consists of many minor improvements. Such a view naturally supports the slow takeoff. On the other hand, my theorizing leads me to identify multiple key algorithmic properties[6] that humans seem to have but current SOTA AI doesn’t. Achieving these properties is likely to require qualitative advances that will likely corresponds to “spurts” of progress.

On the other hand, this view also leads me to longer timelines than either Christiano’s or Yudkowsky’s. This is one of the reasons why I am overall more optimistic than Yudkowsky (the other reason is, optimism about my own research agenda).

Recursive Self-Improvement

Christiano thinks that “AI smart enough to improve itself is not a crucial threshold” (disagreement #4). On his model, AI starts to substantially contribute to AI R&D at a point of time in which human-driven AI progress is ordinary (not exceedingly fast), and since the AI-R&D ability of AI at this time is at most at parity with human ability, AI-driven AI progress continues at the same slow pace, only very gradually speeding up due to the positive feedback.

Christiano’s key assumption is that there is nothing special about making AI good at self-improvement compared to any other AI ability. On the other hand, I have theoretical models that support the opposite conclusion. The latter suggests a scenario where powerful self-improvement abilities are unlocked early relatively to the overall capability growth, and immediately become the dominant force in further advancement.

In particular, Christiano earlier advocated for a “hyperbola” model of AI progress. The hyperbola is a solution to the differential equation , which exemplifies growth that explodes over a finite timescale while staying smooth throughout. On the subject of technological paradigm shifts, he argued modeling them as taking the maximum over multiple smooth curves: the new paradigm starts out worse than the old, and hence progress at the point where it overtakes is still continuous, if not differentiable.

My suggestion is combining the two by modeling recursive self-improvement as the maximum of two hyperbolae: at the point where the new hyperbola (representing AI qualitatively-optimally designed for self-improvement) overtakes the old (representing AI helping AI progress in “mundane” ways, such as tooling), we might get a very large increase in the derivative, bringing the time of the singularity much closer than it seemed earlier.

Attitude Towards Prosaic Alignment

Yudkowsky thinks that “there is no plan”. Christiano thinks that there is a plan, and the plan is (the sort of approaches usually bundled under) prosaic alignment. At least to a first approximation: Christiano also makes some abstract points in favor of optimism, such as the ability to experiment and iterate and the disjunctive paths to alignment (disagreement #21). He also claims that Yudkowsky has not truly engaged with this plan: see disagreements #13-15, #17-20 and #24.

Here are some reasons why I’m pessimistic about prosaic alignment:

Mesaoptimization

Mesaoptimization refers to agentic behavior inside an AI system directed towards a goal which was never specified by the designer. It is a concern because agentic behavior (i.e. competently building world models, using them to create goal-directed plans, and then executing those plans while continuing to update) seems instrumentally convergent: it is an effective means of optimizing any objective that the agent is locally incentivized to pursue. In particular, a training process selecting for a particular objective might generate agents pursuing other goals. Such agents would optimize for the training objective during training (because it’s required of them to survive) but deviate from it during deployment.

There is considerable uncertainty about how probable mesaoptimization is under various condition. My view is that mesaoptimization becomes more concerning the more the AI’s capabilities are achieved through “brute force” methods rather than principled algorithmic design. If we understood how agents work, we would be able to design one with a goal that we know how to specify[7]. Conversely, as long as we don’t understand how agents work, we might end up with biased algorithms that only become transformative when they produce agents “by accident”.

As a concrete example, Christiano’s observation that the universal prior is malign can be interpreted as mesaoptimization that arises in inference systems whose prior privileges their own point of view. On the other hand, infra-Bayesian physicalist agents arguably don’t suffer from this problem because they don’t have that self-privilege bias.

Factored Cognition

One of the way Christiano hopes to address mesaoptimization and other risks is by the cluster of approaches surrounding “Factored Cognition”. Christiano doesn’t discuss it in detail here, but he gestures in that direction:

Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated).

The idea of Factor Cognition is decomposing the desired goal into small steps that can be validated separately by human overseers looking at small pieces of information. This curbs the AIs ability to attack human overseers by drastically reducing the attack surface: the less information the overseer has to review, the less degrees of freedom the AI has to pull tricks on that judge.

I am skeptical about this solution. For one thing, the remaining attack surface might still be sufficient. Even if the AI can’t attack through the overseer’s exposure, it might be able to attack through some vulnerability in the computer system in which it is embedded. For another, I’m skeptical that such factorization is possible at all (in the context of creating an AI defense system).

The motivating analogy for factored cognition is formal proof validation, where validating a proof amounts to separately validating each individual step. However, I don’t think that formal proofs is a good model of general reasoning. Ironically, I see Christiano as repeating here the same mistake that plagues much of MIRI’s research. For the same reason, I’m also skeptical of the project to formalize heuristic arguments within ELK[8].

General reasoning doesn’t work by searching for formal proofs. It works by performing experiments (both real and thought), building models to explain the result, and exploiting those models to decide what to do (or think) next. When you’re reading an actual mathematical proof, you’re using a model that you learned about that formal or semiformal language. However, evaluating an informal natural language argument is much less mechanical. A sufficiently complex informal argument might require you learning a model about a new semiformal language. And it is impossible to learn such a model from seeing only a small piece.

Competitiveness

Many prosaic alignment proposals rely on imitation learning (e.g. IDA and “learning the prior”). To some extent, this is vindicated by the recent success of foundation models. A crucial question is whether this scales well enough to be competitive with other AI designs. It is possible that any realistic solution to alignment will carry a capability penalty, but we want to avoid that as much as possible. (The larger this penalty the more we have to rely on government policy and overall civilization sanity to prevent unaligned TAI from coming first.)

The weakness of pure imitation learning is that you’re training the model to predict the source data, without providing any information about which features of the training data are actually important. For example, when a language model answers a question, it might use different ways to phrase the answer, different styles and even different spellings (e.g. British vs. American). If the user’s goal is getting a factually correct answer, all those other aspects are unimportant. But, for the cross-entropy loss function, they matter as much as the factual content. (Worse, producing a wrong answer can be incentivized if this is the sort of error the training data is likely to contain.)

As a result, imitating humans well enough to be able to solve particular tasks is much harder than just being able to solve particular tasks: the former requires correctly capturing all irrelevant human behaviors that are equally important in the cross-entropy sense. This means that a well-designed agent aimed at the same task would succeed at in with much fewer resources.

Of course, it is possible to start with weak imitation learning and then use RL fine-tuning or other techniques to mutate it into a powerful agent. Indeed, RL fine-tuning of language models is already standard in practice. However, this comes at the cost of imitation’s alleged safety properties: the AI is no longer a simulation of humans but something else entirely, with the usual outer and inner alignment problems.

I think that doubts about competitiveness is part of the motivation for Christiano’s Elucidating Latent Knowledge (ELK) programme. However, ELK still hasn’t produced an end-to-end solution even by Christiano’s own standards (AFAIK).

Unknown Unknowns

The last-but-not-least problem I want to highlight is the “unknown unknowns” of deep learning. We don’t understand deep learning’s generalization properties, we don’t know which conditions the data and the loss function need to have for good generalization, and we don’t know which properties influence the sample complexity and how. While some progress has occurred, both theoretical (e.g. neural tangent kernel, singular learning theory) and empirical (e.g. scaling laws), we are still far from seeing the full picture.

Semiformal reasoning about generalization by alignment researchers tends to use something akin to bounded Solomonoff induction for inspiration (when it appeals to any mathematical model at all). While this is great starting point to think about relevant questions, we also know that it’s not what deep learning is really doing. (Because the simple versions of bounded Solomonoff induction are computationally intractable.) It certainly doesn’t explain phenomena like grokking or adversarial examples.

How do the true generalization properties of deep learning affect the safety of prosaic alignment protocols, compared to the existing informal analysis? AFAICT, we have no idea. It is easily possible that practical implementations of those protocols will misgeneralize in surprising ways, with catastrophic consequences.

Christiano is well aware of this issue:

Right now I think that relevant questions about ML generalization are in fact pretty subtle; we can learn a lot about them in advance but right now just mostly don’t know.

I think that Prosers hope to “learn a lot about them in advance” in the process of designing more and more powerful AI systems. I agree that we will learn some. I’m worried that, if we don’t stop, we won’t learn enough in time.

The Metadebate

Christiano repeatedly accuses Yudkowsky as being grossly overconfident in his views. Christiano sees many of Yudkowsky’s claims as possible but uncertain at best and quite unlikely at worst. According to Christiano, Yudkowsky presents insufficient arguments to support his points, often gesturing towards his own hard-to-convey intuition. At the same time, Yudkowsky’s empirical record of predictions and R&D projects is not impressive enough to convince us that Yudkowksy’s intuition is trustworthy. (See disagreements #8-14, #20, #22, #23 and #26, and also the closing section.)

I am quite sympathetic to this criticism. In particular, Christiano mentions Yudkowsky’s wrong prediction that the Higgs boson won’t be discovered, which IMO was incredibly biased. (If you asked me in 2009 whether the Higgs boson will be discovered, I would be almost certain that it will.) And, Yudkowsky does deliver a lot of claims with high confidence and only vague argumentation.

On the other hand, Prosers also seem to me often overconfident, usually in the opposite direction. That said, I find Christiano less guilty of overconfidence than most.

Be that as it may, I think that Yudkwosky has an important relevant point that Christiano fails to address here. I can’t locate the exact quote, but the idea is: if your rocket design fails to account for something important, the chance it will cause the rocket to explode is much higher than the chance is will make the rocket much more fuel-efficient. When aiming for a narrow target (which human flourishing is), unknowns reduce your chances of success, not increase it. So, given the uncertainty we have about many relevant issues, we should very worried indeed, even if we don’t have a strong argument that one of those issues will definitely kill us.

Both in this article and in writing by Prosers more generally, I find a missing mood. Namely, that our current trajectory with AI is not something remotely sane for civilization to do. Christiano writes that

I don’t think surviving worlds have a plan in the sense Eliezer is looking for. Based on what Eliezer says I don’t feel like he has a clear or accurate picture of what successful “plans” look like in the real world. I don’t see any particular reason to defer to Eliezer at all on this point.

I don’t know what sense of plan Christiano ascribes to Yudkowsky, and I don’t claim that we should defer to Yudkowsky as an expert on plans. However, at present we are rushing forward with a technology that we poorly understand, whose consequences are (as admitted by its own leading developers) going to be of historically unprecedented proportions, with barely any tools to predict or control those consequences that are not speculative and debatable. While it is reasonable to discuss which plan is the most promising even if no plan leads to a reasonably cautious trajectory, we should also point out that we are nowhere near to a reasonably cautious trajectory.

The long-term consequences of AI on the universe are likely to be greater than those of a supernova explosion (which happens far from any budding civilization). Imagine that some company would announce that they’re going to induce a controlled supernova explosion somewhere near the solar system, but don’t worry, they can somehow direct it so it won’t harm the Earth. Btw, the theory behind the technology is not understood. And, there are some arguments that the redirection will fail and every living thing on Earth will die, but those are just speculation. And, it worked okay when they tried it with some nukes. Well, actually it did incinerate an area it wasn’t supposed to, but they patched it in the next version. Welll, actually some issues remain but they promise to patch them soon. Would you be reassured?

There are indeed many uncertainties in the discussion. The sane conclusion is: let’s stop advancing AI capabilities and think long and hard first. It’s true that sometimes experiments are needed to learn more. But, right now, we are very far from exhausting the progress that can be made with theory and experimenting-on-already-known-algorithms alone. There is no good reason to rush forward.

Miscellaneous Comments

Readers only interested in a high-level review of the Proser-Poet debate as reflected in Christiano’s article can stop here. The rest is some comments I have on specific points in the article which don’t fit into any specific overarching theme and are not very review-y. All quotes are Christiano.

Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error;

I agree that experiments can be valuable. However, experiments are especially valuable given a solid theoretical foundation that can be used to interpret and extrapolate the results of those experiments. And such foundation is sorely missing at present.

One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.

This depends on how the system is designed. For example, if you’re training the AI on some data-rich domain A and then applying it to data-scarce domain B, it is possible for the AI to fail in domain B on purpose, even though it secretly has the capability to succeed there.

By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.

This is by no mean obvious, because it might be much easier to formally specify objectives in AI research (and thereby producing relevant training data) than formally specifying objectives in alignment research.

I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.

I agree that in principle we can deliberately shape ML training, but the lack of theoretical basis means we don’t have great tools for it.

As to animal breeding: For one thing, it’s possible that people would learn to game the metrics and the entire thing would go completely off-rails. For another, breeding friendly humans might be much easier than aligning AI, because friendly humans already occur in the wild, whereas aligned superhuman AI does not.

AI systems reasoning about the code of other AI systems is not likely to be an important dynamic for early cooperation between AIs. Those AI systems look very likely to be messy, such that the only way AI systems will reason about their own or others’ code is by looking at behavior and using the same kinds of tools and reasoning strategies as humans.

AI systems today are not that messy, and many ingredients are well-documented. I expect it to remain so. It appears that Christiano is conflating the map and the territory here. The fact we don’t understand how our own AIs work doesn’t mean the AI won’t understand how itself and other AIs work.

Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals.

You might get an AI that plays competitively most of the time, but then in a critical moment it behaves differently, such that irreversible consequences result (e.g. convincing a human to let a dangerous AI out of the box). Even if after the critical moment, SGD immediately changes the model to fix the behavior, it is already too late.

Highlighted Agreements

The tone of this review is mostly critical because there are genuine points of contention, but also because disagreeing comments tend to have more substance than agreeing comments and are therefore more alluring. To balance it a little, I wish to highlight a few more points where I find myself mostly agreeing with Christiano’s criticism of Yudkowsky.

Eliezer seems confident about the difficulty of alignment based largely on his own experiences working on the problem. But in fact society has spent very little total effort working on the problem, and MIRI itself would probably be unable to solve or even make significant progress on the large majority of problems that existing research fields routinely solve. So I think right now we mostly don’t know how hard the problem is (but it may well be very hard, and even if it’s easy we may well fail to solve it). For example, the fact that MIRI tried and failed to find a “coherent formula for corrigibility” is not much evidence that corrigibility is “unworkable.”

As I mentioned before, one of the reasons I’m more optimistic than Yudkowsky is that I believe my own research agenda will some the problems MIRI so far failed to solve. (That said, I’m pessimistic specifically about formalizing corrigibility.)

Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.

While I believe that there are major challenges with making the verification-based alignment protocols actually foolproof, it is true and relevant that verification is easier than generation and Yudkowsky fails to acknowledge that.

Eliezer says that his list of lethalities is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong.

I also think that’s wrong. While the AI alignment community leaves a lot to be desired, Yudkowsky’s claim that “humanity still has only one gamepiece” [which is Yudkowsky] is quite overstated.

  1. ^

    Which doesn’t necessarily mean that each of them is on the very end of the spectrum.

  2. ^

    Introducing such monikers carries the danger of pushing the community towards toxic “Us vs. Them” mentality. On the other hand, it is hard to paint an accurate descriptive of the discourse without acknowledging the clusters. I hope the names I proposed here are sufficiently ridiculous to avoid the pitfall. In any case, I wish to assert that both sides (and other sides) have valuable contributions to the discussion, and personally I learned a lot from both Christiano and Yudkowsky. Also, anything I attribute here to Prosers or Poets might not apply to specific individuals who are “Prosaic” or “Poetic” respectively in other respects.

  3. ^

    But, since this is a review of Christiano’s article, I won’t try too hard to highlight those points.

  4. ^

    The AI is taking the “all”, not its creator.

  5. ^

    The terminology “slow” and “fast” is about the shape of the AI advancement curve, not about the duration from present until TAI (the latter is usually referred to as the “timeline”).

  6. ^

    I intentionally leave out the details.

  7. ^

    That would still leave the “outer alignment” problem of formally specifying an aligned goal sufficient for enacting an AI defense system.

  8. ^

    That said, relatively to most technical alignment research, that project is actually exceptionally good. I predict that it will probably fail, but it’s still definitely worth trying.