Critical review of Christiano’s disagreements with Yudkowsky
This is a review of Paul Christiano’s article “where I agree and disagree with Eliezer”. Written for the LessWrong 2022 Review.
In the existential AI safety community, there is an ongoing debate between positions situated differently on some axis which doesn’t have a common agreed-upon name, but where Christiano and Yudkowsky can be regarded as representatives of the two directions[1]. For the sake of this review, I will dub the camps gravitating to the different ends of this axis “Prosers” (after prosaic alignment) and “Poets”[2]. Christiano is a Proser, and so are most people in AI safety groups in the industry. Yudkowsky is a typical Poet [sort-of Poet, but there’s an important departure from my characterization below], people in MIRI and the agent foundations community tend to also be such.
Prosers tend to be more optimistic, lend more credence to slow takeoff, and place more value on empirical research and solving problems by reproducing them in the lab and iterating on the design. Poets tend to be more pessimistic, lend more credence to fast takeoff, and place more value on theoretical research and solving problems on paper before they become observable in existing AI systems. Few people are absolute purists in those respects: almost nobody in the community believes that e.g. empirical research or solving problems on paper in advance is completely worthless.
In this article, Christiano lists his agreements and disagreements with Yudkowsky. The resulting list can serve as a reasonable starting point for understanding the differences of Proser and Poet positions. In this regard it is not perfect: the tone and many of the details are influenced by Christiano’s reactions to Yudkowsky’s personal idiosyncrasies and also by the specific content of Yudkwosky’s article “AGI Ruin” to which Christiano is responding. Moreover, it is in places hard to follow because Christiano responds to Yudkowsky without restating Yudkowsky’s position first. Nevertheless, it does touch on most of the key points of contention.
In this review, I will try to identify the main generators of Christiano’s disagreements with Yudkowsky and add my personal commentary. Since I can be classified as a Poet myself, my commentary is mostly critical. This doesn’t mean I agree with Yudkowsky everywhere. On many points I have significant uncertainty. On some, I disagree with both Christiano and Yudkowsky[3].
Takeoff Speeds
See also “Yudkowsky and Christiano discuss Takeoff Speeds”.
Christiano believes that AI progress will (probably) be gradual, smooth, and relatively predictable, with each advance increasing capabilities by a little, receiving widespread economic use, and adopted by multiple actors before it is compounded by the next advance, all the way to transformative AI (TAI). This scenario is known as “slow takeoff”. Yudkowsky believes that AI progress will (probably) be erratic, involve sudden capability jumps, important advances that have only minor economic impact and winner-takes-all[4] dynamics. That scenario is known as “fast takeoff”[5].
This disagreement is upsteam of multiple other disagreements. For example:
In slow takeoff scenarios there’s more you can gain from experimentation and iteration (disagreement #1 in Christiano’s list), because you have AI systems similar enough to TAI for long enough before TAI arrives. In fast takeoff, the opposite is true. [EDIT: See more discussion in my followup article.]
The notion of “pivotal act” (disagreements #5 and #6) makes more sense in a fast takeoff world. If the takeoff is sufficiently fast, there will be one actor that creates TAI in a world where no other AI is close to transformative. The kind of AI that’s created then determines the entire future, and hence whatever this AI does constitutes a “pivotal act”.
It also figures in disagreements #2, #4, #7, #9 and #10.
Specifically regarding pivotal acts, I share some of Christiano’s dislike for the framing, both because it is assumption-laden and because it is liable to derail the debate into unproductive political arguments. Instead of talking about a “pivotal act”, I prefer to talk about an “AI defense system” [against unaligned AI]. The latter leaves more room about what shape this defense system takes (deferring to the IMO genuine uncertainty about what is realistic) while highlighting that solving technical AI alignment requires designed aligned AI which is capable enough to be used in such a defense system.
Moreover, Yudkowsky’s idea of what a pivotal act can look like (e.g. the (in)famous example “melt all GPUs”) is premised on the corrigibility pathway to alignment, whereas I consider ambitious value learning more promising (see also my Physicalist Superimitation proposal). This is a point where I disagree with Christiano and Yudkowsky both.
Coming back to takeoff speeds, here are some reasons that I think fast-ish takeoff is fairly plausible:
Missing Ingredients
Prosers tend to place a lot of importance on scaling existing techniques, and believe most of the remaining path to TAI consists of many minor improvements. Such a view naturally supports the slow takeoff. On the other hand, my theorizing leads me to identify multiple key algorithmic properties[6] that humans seem to have but current SOTA AI doesn’t. Achieving these properties is likely to require qualitative advances that will likely corresponds to “spurts” of progress.
On the other hand, this view also leads me to longer timelines than either Christiano’s or Yudkowsky’s. This is one of the reasons why I am overall more optimistic than Yudkowsky (the other reason is, optimism about my own research agenda).
Recursive Self-Improvement
Christiano thinks that “AI smart enough to improve itself is not a crucial threshold” (disagreement #4). On his model, AI starts to substantially contribute to AI R&D at a point of time in which human-driven AI progress is ordinary (not exceedingly fast), and since the AI-R&D ability of AI at this time is at most at parity with human ability, AI-driven AI progress continues at the same slow pace, only very gradually speeding up due to the positive feedback.
Christiano’s key assumption is that there is nothing special about making AI good at self-improvement compared to any other AI ability. On the other hand, I have theoretical models that support the opposite conclusion. The latter suggests a scenario where powerful self-improvement abilities are unlocked early relatively to the overall capability growth, and immediately become the dominant force in further advancement.
In particular, Christiano earlier advocated for a “hyperbola” model of AI progress. The hyperbola is a solution to the differential equation , which exemplifies growth that explodes over a finite timescale while staying smooth throughout. On the subject of technological paradigm shifts, he argued modeling them as taking the maximum over multiple smooth curves: the new paradigm starts out worse than the old, and hence progress at the point where it overtakes is still continuous, if not differentiable.
My suggestion is combining the two by modeling recursive self-improvement as the maximum of two hyperbolae: at the point where the new hyperbola (representing AI qualitatively-optimally designed for self-improvement) overtakes the old (representing AI helping AI progress in “mundane” ways, such as tooling), we might get a very large increase in the derivative, bringing the time of the singularity much closer than it seemed earlier.
Attitude Towards Prosaic Alignment
Yudkowsky thinks that “there is no plan”. Christiano thinks that there is a plan, and the plan is (the sort of approaches usually bundled under) prosaic alignment. At least to a first approximation: Christiano also makes some abstract points in favor of optimism, such as the ability to experiment and iterate and the disjunctive paths to alignment (disagreement #21). He also claims that Yudkowsky has not truly engaged with this plan: see disagreements #13-15, #17-20 and #24.
Here are some reasons why I’m pessimistic about prosaic alignment:
Mesaoptimization
Mesaoptimization refers to agentic behavior inside an AI system directed towards a goal which was never specified by the designer. It is a concern because agentic behavior (i.e. competently building world models, using them to create goal-directed plans, and then executing those plans while continuing to update) seems instrumentally convergent: it is an effective means of optimizing any objective that the agent is locally incentivized to pursue. In particular, a training process selecting for a particular objective might generate agents pursuing other goals. Such agents would optimize for the training objective during training (because it’s required of them to survive) but deviate from it during deployment.
There is considerable uncertainty about how probable mesaoptimization is under various condition. My view is that mesaoptimization becomes more concerning the more the AI’s capabilities are achieved through “brute force” methods rather than principled algorithmic design. If we understood how agents work, we would be able to design one with a goal that we know how to specify[7]. Conversely, as long as we don’t understand how agents work, we might end up with biased algorithms that only become transformative when they produce agents “by accident”.
As a concrete example, Christiano’s observation that the universal prior is malign can be interpreted as mesaoptimization that arises in inference systems whose prior privileges their own point of view. On the other hand, infra-Bayesian physicalist agents arguably don’t suffer from this problem because they don’t have that self-privilege bias.
Factored Cognition
One of the way Christiano hopes to address mesaoptimization and other risks is by the cluster of approaches surrounding “Factored Cognition”. Christiano doesn’t discuss it in detail here, but he gestures in that direction:
Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated).
The idea of Factor Cognition is decomposing the desired goal into small steps that can be validated separately by human overseers looking at small pieces of information. This curbs the AIs ability to attack human overseers by drastically reducing the attack surface: the less information the overseer has to review, the less degrees of freedom the AI has to pull tricks on that judge.
I am skeptical about this solution. For one thing, the remaining attack surface might still be sufficient. Even if the AI can’t attack through the overseer’s exposure, it might be able to attack through some vulnerability in the computer system in which it is embedded. For another, I’m skeptical that such factorization is possible at all (in the context of creating an AI defense system).
The motivating analogy for factored cognition is formal proof validation, where validating a proof amounts to separately validating each individual step. However, I don’t think that formal proofs is a good model of general reasoning. Ironically, I see Christiano as repeating here the same mistake that plagues much of MIRI’s research. For the same reason, I’m also skeptical of the project to formalize heuristic arguments within ELK[8].
General reasoning doesn’t work by searching for formal proofs. It works by performing experiments (both real and thought), building models to explain the result, and exploiting those models to decide what to do (or think) next. When you’re reading an actual mathematical proof, you’re using a model that you learned about that formal or semiformal language. However, evaluating an informal natural language argument is much less mechanical. A sufficiently complex informal argument might require you learning a model about a new semiformal language. And it is impossible to learn such a model from seeing only a small piece.
Competitiveness
Many prosaic alignment proposals rely on imitation learning (e.g. IDA and “learning the prior”). To some extent, this is vindicated by the recent success of foundation models. A crucial question is whether this scales well enough to be competitive with other AI designs. It is possible that any realistic solution to alignment will carry a capability penalty, but we want to avoid that as much as possible. (The larger this penalty the more we have to rely on government policy and overall civilization sanity to prevent unaligned TAI from coming first.)
The weakness of pure imitation learning is that you’re training the model to predict the source data, without providing any information about which features of the training data are actually important. For example, when a language model answers a question, it might use different ways to phrase the answer, different styles and even different spellings (e.g. British vs. American). If the user’s goal is getting a factually correct answer, all those other aspects are unimportant. But, for the cross-entropy loss function, they matter as much as the factual content. (Worse, producing a wrong answer can be incentivized if this is the sort of error the training data is likely to contain.)
As a result, imitating humans well enough to be able to solve particular tasks is much harder than just being able to solve particular tasks: the former requires correctly capturing all irrelevant human behaviors that are equally important in the cross-entropy sense. This means that a well-designed agent aimed at the same task would succeed at in with much fewer resources.
Of course, it is possible to start with weak imitation learning and then use RL fine-tuning or other techniques to mutate it into a powerful agent. Indeed, RL fine-tuning of language models is already standard in practice. However, this comes at the cost of imitation’s alleged safety properties: the AI is no longer a simulation of humans but something else entirely, with the usual outer and inner alignment problems.
I think that doubts about competitiveness is part of the motivation for Christiano’s Elucidating Latent Knowledge (ELK) programme. However, ELK still hasn’t produced an end-to-end solution even by Christiano’s own standards (AFAIK).
Unknown Unknowns
The last-but-not-least problem I want to highlight is the “unknown unknowns” of deep learning. We don’t understand deep learning’s generalization properties, we don’t know which conditions the data and the loss function need to have for good generalization, and we don’t know which properties influence the sample complexity and how. While some progress has occurred, both theoretical (e.g. neural tangent kernel, singular learning theory) and empirical (e.g. scaling laws), we are still far from seeing the full picture.
Semiformal reasoning about generalization by alignment researchers tends to use something akin to bounded Solomonoff induction for inspiration (when it appeals to any mathematical model at all). While this is great starting point to think about relevant questions, we also know that it’s not what deep learning is really doing. (Because the simple versions of bounded Solomonoff induction are computationally intractable.) It certainly doesn’t explain phenomena like grokking or adversarial examples.
How do the true generalization properties of deep learning affect the safety of prosaic alignment protocols, compared to the existing informal analysis? AFAICT, we have no idea. It is easily possible that practical implementations of those protocols will misgeneralize in surprising ways, with catastrophic consequences.
Christiano is well aware of this issue:
Right now I think that relevant questions about ML generalization are in fact pretty subtle; we can learn a lot about them in advance but right now just mostly don’t know.
I think that Prosers hope to “learn a lot about them in advance” in the process of designing more and more powerful AI systems. I agree that we will learn some. I’m worried that, if we don’t stop, we won’t learn enough in time.
The Metadebate
Christiano repeatedly accuses Yudkowsky as being grossly overconfident in his views. Christiano sees many of Yudkowsky’s claims as possible but uncertain at best and quite unlikely at worst. According to Christiano, Yudkowsky presents insufficient arguments to support his points, often gesturing towards his own hard-to-convey intuition. At the same time, Yudkowsky’s empirical record of predictions and R&D projects is not impressive enough to convince us that Yudkowksy’s intuition is trustworthy. (See disagreements #8-14, #20, #22, #23 and #26, and also the closing section.)
I am quite sympathetic to this criticism. In particular, Christiano mentions Yudkowsky’s wrong prediction that the Higgs boson won’t be discovered, which IMO was incredibly biased. (If you asked me in 2009 whether the Higgs boson will be discovered, I would be almost certain that it will.) And, Yudkowsky does deliver a lot of claims with high confidence and only vague argumentation.
On the other hand, Prosers also seem to me often overconfident, usually in the opposite direction. That said, I find Christiano less guilty of overconfidence than most.
Be that as it may, I think that Yudkwosky has an important relevant point that Christiano fails to address here. I can’t locate the exact quote, but the idea is: if your rocket design fails to account for something important, the chance it will cause the rocket to explode is much higher than the chance is will make the rocket much more fuel-efficient. When aiming for a narrow target (which human flourishing is), unknowns reduce your chances of success, not increase it. So, given the uncertainty we have about many relevant issues, we should very worried indeed, even if we don’t have a strong argument that one of those issues will definitely kill us.
Both in this article and in writing by Prosers more generally, I find a missing mood. Namely, that our current trajectory with AI is not something remotely sane for civilization to do. Christiano writes that
I don’t think surviving worlds have a plan in the sense Eliezer is looking for. Based on what Eliezer says I don’t feel like he has a clear or accurate picture of what successful “plans” look like in the real world. I don’t see any particular reason to defer to Eliezer at all on this point.
I don’t know what sense of plan Christiano ascribes to Yudkowsky, and I don’t claim that we should defer to Yudkowsky as an expert on plans. However, at present we are rushing forward with a technology that we poorly understand, whose consequences are (as admitted by its own leading developers) going to be of historically unprecedented proportions, with barely any tools to predict or control those consequences that are not speculative and debatable. While it is reasonable to discuss which plan is the most promising even if no plan leads to a reasonably cautious trajectory, we should also point out that we are nowhere near to a reasonably cautious trajectory.
The long-term consequences of AI on the universe are likely to be greater than those of a supernova explosion (which happens far from any budding civilization). Imagine that some company would announce that they’re going to induce a controlled supernova explosion somewhere near the solar system, but don’t worry, they can somehow direct it so it won’t harm the Earth. Btw, the theory behind the technology is not understood. And, there are some arguments that the redirection will fail and every living thing on Earth will die, but those are just speculation. And, it worked okay when they tried it with some nukes. Well, actually it did incinerate an area it wasn’t supposed to, but they patched it in the next version. Welll, actually some issues remain but they promise to patch them soon. Would you be reassured?
There are indeed many uncertainties in the discussion. The sane conclusion is: let’s stop advancing AI capabilities and think long and hard first. It’s true that sometimes experiments are needed to learn more. But, right now, we are very far from exhausting the progress that can be made with theory and experimenting-on-already-known-algorithms alone. There is no good reason to rush forward.
Miscellaneous Comments
Readers only interested in a high-level review of the Proser-Poet debate as reflected in Christiano’s article can stop here. The rest is some comments I have on specific points in the article which don’t fit into any specific overarching theme and are not very review-y. All quotes are Christiano.
Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error;
I agree that experiments can be valuable. However, experiments are especially valuable given a solid theoretical foundation that can be used to interpret and extrapolate the results of those experiments. And such foundation is sorely missing at present.
One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff.
This depends on how the system is designed. For example, if you’re training the AI on some data-rich domain A and then applying it to data-scarce domain B, it is possible for the AI to fail in domain B on purpose, even though it secretly has the capability to succeed there.
By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.
This is by no mean obvious, because it might be much easier to formally specify objectives in AI research (and thereby producing relevant training data) than formally specifying objectives in alignment research.
I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
I agree that in principle we can deliberately shape ML training, but the lack of theoretical basis means we don’t have great tools for it.
As to animal breeding: For one thing, it’s possible that people would learn to game the metrics and the entire thing would go completely off-rails. For another, breeding friendly humans might be much easier than aligning AI, because friendly humans already occur in the wild, whereas aligned superhuman AI does not.
AI systems reasoning about the code of other AI systems is not likely to be an important dynamic for early cooperation between AIs. Those AI systems look very likely to be messy, such that the only way AI systems will reason about their own or others’ code is by looking at behavior and using the same kinds of tools and reasoning strategies as humans.
AI systems today are not that messy, and many ingredients are well-documented. I expect it to remain so. It appears that Christiano is conflating the map and the territory here. The fact we don’t understand how our own AIs work doesn’t mean the AI won’t understand how itself and other AIs work.
Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals.
You might get an AI that plays competitively most of the time, but then in a critical moment it behaves differently, such that irreversible consequences result (e.g. convincing a human to let a dangerous AI out of the box). Even if after the critical moment, SGD immediately changes the model to fix the behavior, it is already too late.
Highlighted Agreements
The tone of this review is mostly critical because there are genuine points of contention, but also because disagreeing comments tend to have more substance than agreeing comments and are therefore more alluring. To balance it a little, I wish to highlight a few more points where I find myself mostly agreeing with Christiano’s criticism of Yudkowsky.
Eliezer seems confident about the difficulty of alignment based largely on his own experiences working on the problem. But in fact society has spent very little total effort working on the problem, and MIRI itself would probably be unable to solve or even make significant progress on the large majority of problems that existing research fields routinely solve. So I think right now we mostly don’t know how hard the problem is (but it may well be very hard, and even if it’s easy we may well fail to solve it). For example, the fact that MIRI tried and failed to find a “coherent formula for corrigibility” is not much evidence that corrigibility is “unworkable.”
As I mentioned before, one of the reasons I’m more optimistic than Yudkowsky is that I believe my own research agenda will some the problems MIRI so far failed to solve. (That said, I’m pessimistic specifically about formalizing corrigibility.)
Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.
While I believe that there are major challenges with making the verification-based alignment protocols actually foolproof, it is true and relevant that verification is easier than generation and Yudkowsky fails to acknowledge that.
Eliezer says that his list of lethalities is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong.
I also think that’s wrong. While the AI alignment community leaves a lot to be desired, Yudkowsky’s claim that “humanity still has only one gamepiece” [which is Yudkowsky] is quite overstated.
- ^
Which doesn’t necessarily mean that each of them is on the very end of the spectrum.
- ^
Introducing such monikers carries the danger of pushing the community towards toxic “Us vs. Them” mentality. On the other hand, it is hard to paint an accurate descriptive of the discourse without acknowledging the clusters. I hope the names I proposed here are sufficiently ridiculous to avoid the pitfall. In any case, I wish to assert that both sides (and other sides) have valuable contributions to the discussion, and personally I learned a lot from both Christiano and Yudkowsky. Also, anything I attribute here to Prosers or Poets might not apply to specific individuals who are “Prosaic” or “Poetic” respectively in other respects.
- ^
But, since this is a review of Christiano’s article, I won’t try too hard to highlight those points.
- ^
The AI is taking the “all”, not its creator.
- ^
The terminology “slow” and “fast” is about the shape of the AI advancement curve, not about the duration from present until TAI (the latter is usually referred to as the “timeline”).
- ^
I intentionally leave out the details.
- ^
That would still leave the “outer alignment” problem of formally specifying an aligned goal sufficient for enacting an AI defense system.
- ^
That said, relatively to most technical alignment research, that project is actually exceptionally good. I predict that it will probably fail, but it’s still definitely worth trying.
- AI Alignment Metastrategy by 31 Dec 2023 12:06 UTC; 117 points) (
- Talent Needs of Technical AI Safety Teams by 24 May 2024 0:36 UTC; 115 points) (
- Talent Needs of Technical AI Safety Teams by 24 May 2024 0:46 UTC; 51 points) (EA Forum;
- MATS AI Safety Strategy Curriculum v2 by 7 Oct 2024 22:44 UTC; 42 points) (
I disagree with my characterization as thinking problems can be solved on paper, and with the name “Poet”. I think the problems can’t be solved by twiddling systems weak enough to be passively safe, and hoping their behavior generalizes up to dangerous levels. I don’t think paper solutions will work either, and humanity needs to back off and augment intelligence before proceeding. I do not take the position that we need a global shutdown of this research field because I think that guessing stuff without trying it is easy, but because guessing it even with some safe weak lesser tries is still impossibly hard. My message to humanity is “back off and augment” not “back off and solve it with a clever theory”.
Thank you for the clarification.
How do you expect augmented humanity will solve the problem? Will it be something other than “guessing it with some safe weak lesser tries / clever theory”?
They can solve it however they like, once they’re past the point of expecting things to work that sometimes don’t work. I have guesses but any group that still needs my hints should wait and augment harder.
I think this is somewhat harmful to there being a field of (MIRI-style) Agent Foundations. It seems pretty bad to require that people attempting to start in the field have to work out the foundations themselves, I don’t think any scientific fields have worked this way in the past.
Maybe the view is that if people can’t work out the basics then they won’t be able to make progress, but this doesn’t seem at all clear to me. Many physicists in the 20th century were unable to derive the basics of quantum mechanics or general relativity, but once they were given the foundations they were able to do useful work. I think the skills of working out foundations of a new field can be different than building on those foundations.
Also, maybe these “hints” aren’t that useful and so it’s not worth sharing. Or (more likely in my view) the hints are tied up with dangerous information such that sharing increases risk, and you want to have more signal on someone’s ability to do good work before taking that risk.
If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular. If folks still feel that they need low-confidence hints after that then I think they will probably not benefit much from hearing Eliezer’s current guesses, and I suspect may be trying to guess the teacher’s passwords rather than solve the problem.
Is there actually anyone who has read all those words and thus understands how Eliezer would tackle the problem? (Do you, for example?)
This is not a rhetorical question, nor a trivial one; for example, I notice that @Vanessa Kosoy apparently misunderstood some major part of it (as evidenced by this comment thread), and she’s been a MIRI researcher for 8 years (right? or so LinkedIn says, anyhow). So… who does get it?
I think this is a reasonable response, but also, independently of whether Eliezer successfully got the generators of his thought-process across, the volume of words still seems like substantial evidence that it’s reasonably for Eliezer to not think that marginally writing more will drastically change things from his perspective.
Sure. My comment was not intended to bear on the question of whether it’s useful for Eliezer to write more words or not—I was only responding directly to Ben.
EDIT: Although, of course, if the already-written words haven’t been effective, then that is also evidence that writing more words won’t help. So, either way, I agree with your view.
Scientific breakthroughs live on the margins, so if he has guesses on how to achieve alignment sharing them could make a huge difference.
I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”. Is the standard “the underlying philosophy was successfully communicated to lots of people”? If so I’ll note as background that it is pretty standard that such underlying philosophies of solving problems are hard to communicate — someone who reads Shannon or Pearl’s writing does not become equal to them, someone who read Faraday’s notes couldn’t pick up his powerful insights, until Maxwell formalized them neatly[1], and (for picking up philosophies broadly) someone who reads Aurelius or Hume or Mill cannot produce the sorts of works those people would have produced had they been alive for longer or in the present day.
Is the standard “people who read Eliezer’s writing go on to do research that Eliezer judges to be remotely on the right track”? Then I think a couple of folks have done stuff here that Eliezer would view as at all on the right track, including lots of MIRI researchers, Wei Dai, John Wentworth, the mesa-optimizers paper, etc. It would be a bit of effort to give me an overview of all the folks who took parts of Eliezer’s writing and did useful things as a direct result of the specific ideas, and then assess how successful that counts as.
My current guess is that Eliezer’s research dreams have not been that successfully picked up by others relative to what I would have predicted 10 years ago. I am not confident why this is — perhaps other institutional support has been lacking for people to work on it, perhaps there have been active forces in the direction of “not particularly trying to understand AI” due to the massive success of engineering approaches over scientific approaches, perhaps Eliezer’s particular approach has flaws that make it not very productive.
I think my primary guess for why there has been less progress on Eliezer’s research dreams is that the subject domain is itself very annoying to get in contact with, due to us not having a variety of superintelligences to play with, and my anticipation that when we do get to that stage our lives will soon be over, so it’s much harder to make any progress on these problems than it is in the vast majority of other domains humans have been successful in.
Nate Soares has a blogpost that discusses this point that I found insightful, here’s a quote.
Huh? Isn’t this a question for you, not for me? You wrote this:
So, the standard is… whatever you had in mind there?
Oh. Here’s a compressed gloss of this conversation as I understand it.
Eliezer: People should give up on producing alignment theory directly and instead should build augmented humans to produce aligned agents.
Vanessa: And how do you think the augmented humans will go about doing so?
Eliezer: I mean, the point here is that’s a problem for them to solve. Insofar as they need my help they have not augmented enough. (Ben reads into Eliezer’s comment: And insofar as you wish to elicit my approach, I have already written over a million words on this, a few marginal guesses ain’t gonna do much.)
Peter: It seems wrong to not share your guesses here, people should not have to build the foundations themselves without help from Eliezer.
Ben: Eliezer has spent many years writing up and helping build the foundations, this comment seems very confusing given that context.
Said: But have Eliezer’s writings worked?
Ben: (I’m not quite sure what the implied relationship is of this question to the question of whether Eliezer has tried very hard to help build a foundation for alignment research, but I will answer the question directly.) To the incredibly high standard that ~nobody ever succeeds at, it has not succeeded. To the lower standard that scientists sometimes succeed at, it has worked a little but has not been sufficient.
My first guess was that you’re implicitly asking because, if it has not been successful, then you think Eliezer still ought to answer questions like Vanessa’s. I am not confident in this guess.
If you’re simply asking for clarification on whether Eliezer’s writing works to convey his philosophy and approach to the alignment problem, I have now told you two standards to which I would evaluate this question and how I think it does on those standards. Do you understand why I wrote what I wrote in my reply to Peter?
Uh… I think you’re somewhat overcomplicating things. Again, you wrote (emphasis mine):
In other words, you wrote that if people want X, they can do Y. (Implying that doing Y will, with non-trivial probability, cause people to gain X.[1]) My question was simply whether there exists anyone who has done Y and now, as a consequence, has X.
There are basically two possible answers to this:
“Yes, persons A, B, C have all done Y, and now, as a consequence, have X.”
Or
“No, there exists no such person who has done Y and now, as a consequence, has X.”
(Which may be because nobody has done Y, or it may be because some people have done Y, but have not gained X.)
There’s not any need to bring in questions of standards or trying hard or… anything like that.
So, to substitute the values back in for the variables, I could ask:
Would you say that you “understand how Eliezer would tackle the problem”?
and:
Have you “read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular”?
Likewise, the same question about Vanessa: does she “understand … etc.”, and has she “read the … etc.”?
Again, what I mean by all of these words is just whatever you meant when you wrote them.
“Implying” in the Gricean sense, of course, not in the logical sense. Strictly speaking, if we drop all Gricean assumptions, we could read your original claim as being analogous to, e.g., “if folks want to grow ten feet fall, they can paint their toenails blue”—which is entirely true, and yet does not, strictly speaking, entail any causal claims. I assumed that this kind of thing is not what you meant, because… that would make your comment basically pointless.
Ah, but I think you have assumed slightly too much in what I meant. I simply meant to say that if someone wants Eliezer’s help in their goal to “work out the foundations” to “a field of (MIRI-style) Agent Foundations” the better way to understand Eliezer’s perspective on these difficult questions is to read the high-effort and thoughtful blog posts and papers he produced that’s intended to communicate his perspective on these questions, rather than ask him for some quick guesses on how one could in-principle solve the problems. I did not mean to imply that (either) way would necessarily work, as the goal itself is hard to achieve, I simply meant that one approach is clearly much more likely than the other to achieve that goal.
(That said, again to answer your question, my current guess is that Nate Soares is an example of a person who read those works and then came to share a lot of Eliezer’s approach to solving the problems. Though I’m honestly not sure how much he would put down to the factors of (a) reading the writing (b) working directly with Eliezer (c) trying to solve the problem himself and coming up with similar approaches. I also think that Wei Dai at least understood enough to make substantial progress on an open research question in decision theory, and similar things can be said of Scott Garrabrant re: logical uncertainty and others at MIRI.)
Fwiw this doesn’t feel like a super helpful comment to me. I think there might be a nearby one that’s more useful, but this felt kinda coy for the sake of being coy.
Yeah, I feel this is quite similar to OpenAI’s plan to defer alignment to future AI researchers, except worse, because if we grant that the plan proposed actually made the augmented humans stably aligned with our values, then it would be far easier to do scalable oversight, because we have a bunch of advantages around controlling AIs, like the fact that it would be socially acceptable to control AI in ways that wouldn’t be socially acceptable to do if it involved humans, the incentives to control AI are much stronger than controlling humans, etc.
I truly feel like Eliezer has reinvented a plan that OpenAI/Anthropic are already doing, except worse, which is deferring alignment work to future intelligences, and Eliezer doesn’t realize this, so the comments treat it as though it’s something new rather than an already done plan, just with AI swapped out for humans.
It’s not just coy, it’s reinventing an idea that’s already there, except worse, and he doesn’t tell you that if you swap the human for AI, it’s already being done.
Link for why AI is easier to control than humans below:
https://optimists.ai/2023/11/28/ai-is-easy-to-control/
fwiw, this seems false to me and not particularly related to what I was saying.
Even a small probability of solving alignment should have big expected utility modulo exfohazard. So why not share your guesses?
I don’t know whether augmentation is the right step after backing off or not, but I do know that the simpler “back off” is a much better message to send to humanity than that. More digestible, more likely to be heard, more likely to be understood, doesn’t cause people to peg you as a rational tech bro, doesn’t at all sound like the beginning of a sci-fi apocalypse plot line. I could go on.
I feel like this “back off and augment” is downstream of an implicit theory of intelligence that is specifically unsuited to dealing with how existing examples of intelligence seem to work. Epistemic status: the idea used to make sense to me and apparently no longer does, in a way that seems related to the ways i’ve updated my theories of cognition over the past years.
Very roughly, networking cognitive agents stacks up to cognitive agency at the next level up easier than expected and life has evolved to exploit this dynamic from very early on across scales. It’s a gestalt observation and apparently very difficult to articulate into a rational argument. I could point to memory in gene regulatory networks, Michael Levin’s work in nonneural cognition, trainability of computational ecological models (they can apparently be trained to solve sudoku), long term trends in cultural-cognitive evolution, and theoretical difficulties with traditional models of biological evolution—but I don’t know how to make the constellation of data points easily distinguishable from pareidolia.
Is there a reason why post-‘augmented’ individuals would even pay attention to the existing writings/opinions/desires/etc… of anyone, or anything, up to now?
Or is this literally suggesting to leave everything in their future hands?
Yep, this is basically OpenAI’s alignment plan, but worse. IMO I’m pretty bullish on that plan, but yes this is pretty clearly already done, and I’m rather surprised by Eliezer’s comment here.
Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment. Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place. On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
The ‘message’ surprised me since it seems to run counter to the whole point of LW.
That non-super-geniuses, mostly just moderately above average folks, can participate and have some chance of producing genuinely novel insights, that future people will actually care to remember. Based on the principle of the supposed wisdom of the userbase ‘masses’ rubbing their ideas together enough times.
Plus a few just-merely-geniuses shepherding them.
But if this method can’t produce any meaningful results in the long term...
OpenAI never advocated for the aforementioned so it isn’t as surprising if they adopt the everything hinges on the future ubermensch plan.
Maybe. But it wouldn’t make sense to judge an approach to a technical problem, alignment, based on what philosophy it was produced with. If we tried that philosophy and it didn’t work, that’s a reasonable thing to say and advocate for.
I don’t think Eliezer’s reasoning for that conclusion is nearly adequate, and we still have almost no idea how hard alignment is, because the conversation has broken down.
Would you say the point of MIRI was/is to create theory that would later lead to safe experiments (but that it hasn’t happened yet)? Sort of like how the Manhattan project discovered enough physics to not nuke themselves, and then started experimenting? 🤔
I think two issues here should be discussed separately:
technical feasibility
whether this or that route to intelligence augmentation should be actually undertaken
I suspect that intelligence augmentation is much more feasible in short-term than people usually assume. Namely, I think that enabling people to tightly couple themselves with specialized electronic devices via high-end non-invasive BCI is likely to do a lot in this sense.
This route should be much safer, much quicker, and much cheaper than Neuralink-like approach, and I think it can still do a lot.
Even though the approach with non-invasive BCI is much safer than Neuralink, the risks on the personal level are nevertheless formidable.
On the social level, we don’t really know if the resulting augmented humans/hybrid human-electronic entities will be “Friendly”.
So, should we try it? My personal AI timelines are rather short, and existing risks are formidable...
So, I would advocate organizing an exploratory project of this kind to see if this is indeed technically feasible on a short-term time scale (my expectation is that a small group can obtain measurable progress here within months, not years), and ponder various safety issues deeper before scaling it or before sharing the obtained technological advances in a more public fashion...
Have you written more about why you think this is (to quote you) much more feasible in short-term than people usually assume / can you point me to writeups by others in this regard?
Yes, I think there are four components here:
how good is non-invasive reading from the brain
how good is non-invasive writing into the brain or brain modulation, especially when assisted by feedback from the reading
what are the risks, and how manageable they are
what are the possible set-ups to use this (ranging from relatively softcore set-ups like electronic versions of nootropics, stimulants, and psychedelics, to more hardcore setups like tightly integrated information processing by a biological brain and an electronic device together)
Starting from the question of possible set-ups, I was thinking about this on and off since the peak of my “rave and psytrance days” which was long ago, and I wrote a possible design spec about this 10-12 years ago, and, of course, this is one of many possible designs, and I am sure other designs of this kind exist, but this one is one possible example of how this can be done: https://github.com/anhinga/2021-notes/tree/main/mind-games
The most crucial question is how good is non-invasive reading from the brain. We are seeing rapid progress in this sense in the last few years. I noticed the first promised report from 2019 and made a note of it at the bottom of mind-games/post-2-measuring-conscious-state.md, but these days we are inundated by this kind of reports of progress and successes in non-invasive reading from the brain via various channels, so these days it’s more “yes, we can do a lot even with something as simple as a high-end EEG, but can we do enough with a superconvenient low-end consumer-grade headband EEG or in-the-ear EEG, so that it’s not just non-invasive, but actually non-interfering with convenience”.
So, in the sense of non-invasive reading, there are reasons for optimism.
With writing and modulation, audio-visual channels are not just information carrying, but very psychoactive. For example, following MIT reports on curative properties of 40hz audio-visual impacts I self-experimented with 40hz sound (mostly in the form 40hz sine wave test tone) and found it strongly stimulating and also acoustically priming.
In this sense, the information still going into the brain in the ordinary fashion via audio-visual channel, but there is also strong neuromodulation. And then, if one simultaneously reads from the brain, one has real-time feedback and can tune the impact a lot (but this is associated with potentially increased risks). I wrote more about this in mind-games/post-4-closing-the-loop.md
People also explore transcranial magnetic stimulation, transcranial direct current, and especially transcranial ultrasound in recent years. Speaking of transcranial ultrasound, there is a long series of posts between Oct 6 and Dec 11 on this substack, and it covers both the potential of this, and evaluates risks:
https://sarahconstantin.substack.com/p/ultrasound-neuromodulation (Oct 6)
a lot of posts on this topic in between
https://sarahconstantin.substack.com/p/risks-of-ultrasound-neuromodulation (Dec 11)
Now, it’s a good point to move to risks.
It’s really easy to cause full-blown seizures just by being a bit too aggressive with strobe lights (I did it to myself once many years ago with a light-and-sound machine (an inexpensive eyeglasses with flashing lights) by disobeying the instructions to keep my eyes closed, because it was so boring to keep them closed, and the visuals were so pretty when the eyes were open, and getting prettier every few seconds, and even prettier, and then … you know).
When dealing with a closed feedback loop the risks are very formidable (even if there is no “AI” on the electronic side of things, and there might be one). I start to discuss the appropriate safety protocols in mind-games/post-4-closing-the-loop.md
The post on risks of transcranial ultrasound by Sarah Constantin does make me quite apprehensive (e.g. there is a company named Prophetic AI, which hopes to have an EEG/transcranial ultrasound headband stabilizing lucid dreams, and I am sure it’s doable, but am I comfortable with the risks here? It’s a strange situation where the risks are “officially low”, but it’s less clear whether they are all that low in reality).
So, yes, if we can navigate the risks, I think that our capabilities to read from the brain are now very powerful (and we can read from the body, polygraph-style too, but particularly minding the risk of the feedback situation here), and we can achieve at minimum pretty effective cognitive modulation via audio-visual channels, and probably much more...
Of course, people will try to have narrowly crafted AIs on the electronic side (these days the thought is rather obvious), so if one pushes harder one can really optimize a joint cognitive process by a human and a narrow AI interacting with each other, but can this be done in a safe and beneficial manner?
Basically, non-invasiveness of interfaces should not lull us into the false sense of safety in these kinds of experiments, and I think one needs to keep a laser-sharp focus on risk management, but other than that, from the purely technical viewpoint, pieces seem to be ready.
The question of whether there is a jump specifically at the autonomous research threshold (let’s call that “AGI”) is muddled by the discussion of what happens prior to that threshold. The reasons for the jump there in particular are very different from reasons for jumps elsewhere, and it doesn’t seem relevant to discuss presence or absence of such jumps elsewhere in connection to the jump at this particular threshold.
I expect gradual improvement all the way to AGI, then technical feasibility of a jump from that particular level to superintelligence in a matter of months, if the AGI is allowed to do its thing. But the reasons for expecting gradual improvement prior to AGI and expecting a jump after AGI seem unrelated. There are convergent scaling laws that different architectures seem to share in quantitative detail, always constrained in practical application by slowly changing available hardware and investment, thus sudden jumps are unlikely for long stretches of time, possibly up to AGI. And then there is serial speed advantage of AIs that accelerates technological history across the board, which doesn’t influence progress prior to AIs becoming autonomously competent at research, but then suddenly gets to influence it, making use of existing hardware/investment more efficiently to extract much more competence out of it.
Figuring out how to generate much higher quality general data (as in RL and self-play) is a wildcard that might disrupt gradual improvement before AGI, but then at this point it’s probably also sufficient to reach AGI, given how capable existing systems are that only use natural data. So the distinction is mostly in difficulty of stopping at a system capable of autonomous research but not yet significantly more competent than humans, which is important for plans that want to bootstrap defense against misaligned AI. It’s still gradual predictable improvement followed by a jump to superintelligence (if this particular jump is not interrupted).
Scaling laws are an important phenomena and probably deeply tied with the nature of intelligence.
I do take issue with the assertion that scaling laws imply slow takeoff. One key takeaway of the modern ML revolution is that specific details of architectures-in-the-narrow-sense* is mostly not that important and compute and data dominate.
The natural implication is that scaling laws are a function of the data distribution—and mostly not of the architecture. Just because we see a ‘smooth, slow’ scaling law on text data doesn’t mean that this will generalize to other domains/situations/ horizons. In fact, I think we should mostly expect this not to be the case.
*I think the jump from architectures-in-the-narrow-sense don’t matter to architectures-in-the-broad-sense don’t matter is often made. I think this obviously not suppored by the evidence we have sofar (despite many claims to the contrary) and likely wrong.
Even architectures-in-the-narrow-sense don’t show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.
So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?
I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don’t match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.
Is it true that scaling laws are independent of architecture? I don’t know much about scaling laws but that seems surely wrong to me.
e.g. how does RNN scaling compare to transformer scaling
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Hence generation of higher quality data is a plausible way of disrupting the way scaling laws govern slow takeoff. What this data needs to provide is general cognitive competence that therefore applies to the physical world, but that competence doesn’t need to involve initial familiarity with the human world.
So it could be formal proofs on a reasonable distribution of topics, or a superscaled RL system in an environment that sufficiently elicits general reasoning. If the backbone of a dataset shapes representations towards competence, it might transfer to other areas. Thus we get an alien mind that mostly uses natural data as a tool to speak good English and anticipate popular opinion, not as the essential fabric of its own nature.
In the current not-knowing-what-we-are-doing regime, I’m guessing the safer AGIs are scaffolded natural data LLMs, or failing that model-based RL systems that develop in contact with the human world or data. Model-free RL that relies on a synthetic environment to generate enough data risks growing up more alien. Less clear with reasoning that originates in synthetic data for math, grounded in the physical world through natural data being a fraction of datasets for all models in the system (as a kind of multimodality). Such admixing of natural data might even be sufficient to make a model-free RL system less alien.
On this subject, here is my 2 hours long presentation (in 3 parts), going over just about every paragraph in Paul Christiano’s “Where I agree and disagree with Eliezer”:
https://youtu.be/V8R0s8tesM0?si=qrSJP3V_WnoBptkL
https://youtu.be/a2qTNuD1Sn8?si=YHyCr8AC0HkEnN4J
https://youtu.be/8XWbPDvKgM0?si=SvLfL4bhHDO6zDBu
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
I’m really glad you wrote this!
I think you address an important distinction there, but I think there might be a further one to be made- in that how we measure/tell if a model is aligned in the first place.
There seems to be a growing voice which says that if a model’s output seems to be the output we might expect from an aligned AI, then it’s aligned.
I think it’s important to distinguish that from the idea that the model is aligned if you actually have a strong idea of what it’s values are, how it’s gotten them, etc.