Ngo and Yudkowsky on alignment difficulty

This post is the first in a series of transcribed Discord conversations between Richard Ngo and Eliezer Yudkowsky, moderated by Nate Soares. We’ve also added Richard and Nate’s running summaries of the conversation (and others’ replies) from Google Docs.

Later conversation participants include Ajeya Cotra, Beth Barnes, Carl Shulman, Holden Karnofsky, Jaan Tallinn, Paul Christiano, Rob Bensinger, and Rohin Shah.

The transcripts are a complete record of several Discord channels MIRI made for discussion. We tried to edit the transcripts as little as possible, other than to fix typos and a handful of confusingly-worded sentences, to add some paragraph breaks, and to add referenced figures and links. We didn’t end up redacting any substantive content, other than the names of people who would prefer not to be cited. We swapped the order of some chat messages for clarity and conversational flow (indicated with extra timestamps), and in some cases combined logs where the conversation switched channels.

Color key:

Chat by Richard and Eliezer Other chat Google Doc content Inline comments

0. Prefatory comments

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(At Rob’s request I’ll try to keep this brief, but this was an experimental format and some issues cropped up that seem large enough to deserve notes.)

Especially when coming in to the early parts of this dialogue, I had some backed-up hypotheses about “What might be the main sticking point? and how can I address that?” which from the standpoint of a pure dialogue might seem to be causing me to go on digressions, relative to if I was just trying to answer Richard’s own questions. On reading the dialogue, I notice that this looks evasive or like point-missing, like I’m weirdly not just directly answering Richard’s questions.

Often the questions are answered later, or at least I think they are, though it may not be in the first segment of the dialogue. But the larger phenomenon is that I came in with some things I wanted to say, and Richard came in asking questions, and there was a minor accidental mismatch there. It would have looked better if we’d both stated positions first without question marks, say, or if I’d just confined myself to answering questions from Richard. (This is not a huge catastrophe, but it’s something for the reader to keep in mind as a minor hiccup that showed up in the early parts of experimenting with this new format.)

[Yudkowsky][8:32] (Nov. 6 follow-up comment)

(Prompted by some later stumbles in attempts to summarize this dialogue. Summaries seem plausibly a major mode of propagation for a sprawling dialogue like this, and the following request seems like it needs to be very prominent to work—embedded requests later on didn’t work.)

Please don’t summarize this dialogue by saying, “and so Eliezer’s MAIN idea is that” or “and then Eliezer thinks THE KEY POINT is that” or “the PRIMARY argument is that” etcetera. From my perspective, everybody comes in with a different set of sticking points versus things they see as obvious, and the conversation I have changes drastically depending on that. In the old days this used to be the Orthogonality Thesis, Instrumental Convergence, and superintelligence being a possible thing at all; today most OpenPhil-adjacent folks have other sticking points instead.

Please transform:

  • “Eliezer’s main reply is...” → “Eliezer replied that...”

  • “Eliezer thinks the key point is...” → “Eliezer’s point in response was...”

  • “Eliezer thinks a major issue is...” → “Eliezer replied that one issue is...”

  • “Eliezer’s primary argument against this is...” → “Eliezer tried the counterargument that...”

  • “Eliezer’s main scenario for this is...” → “In a conversation in September of 2021, Eliezer sketched a hypothetical where...”

Note also that the transformed statements say what you observed, whereas the untransformed statements are (often incorrect) inferences about my latent state of mind.

(Though “distinguishing relatively unreliable inference from more reliable observation” is not necessarily the key idea here or the one big reason I’m asking for this. That’s just one point I tried making—one argument that I hope might help drive home the larger thesis.)

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

[Ngo][11:00]

Hi all! Looking forward to the discussion.

[Yudkowsky][11:01]

Hi and welcome all. My name is Eliezer and I think alignment is really actually quite extremely difficult. Some people seem to not think this! It’s an important issue so ought to be resolved somehow, which we can hopefully fully do today. (I will however want to take a break after the first 90 minutes, if it goes that far and if Ngo is in sleep-cycle shape to continue past that.)

[Ngo][11:02]

A break in 90 minutes or so sounds good.

Here’s one way to kick things off: I agree that humans trying to align arbitrarily capable AIs seems very difficult. One reason that I’m more optimistic (or at least, not confident that we’ll have to face the full very difficult version of the problem) is that at a certain point AIs will be doing most of the work.

When you talk about alignment being difficult, what types of AIs are you thinking about aligning?

[Yudkowsky][11:04]

On my model of the Other Person, a lot of times when somebody thinks alignment shouldn’t be that hard, they think there’s some particular thing you can do to align an AGI, which isn’t that hard, and their model is missing one of the foundational difficulties for why you can’t do (easily or at all) one step of their procedure. So one of my own conversational processes might be to poke around looking for a step that the other person doesn’t realize is hard. That said, I’ll try to directly answer your own question first.

[Ngo][11:07]

I don’t think I’m confident that there’s any particular thing you can do to align an AGI. Instead I feel fairly uncertain over a broad range of possibilities for how hard the problem turns out to be.

And on some of the most important variables, it seems like evidence from the last decade pushes towards updating that the problem will be easier.

[Yudkowsky][11:09]

I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up. This requires that the first actor or actors to build AGI, be able to do something with that AGI which prevents the world from being destroyed; if it didn’t require superintelligence, we could go do that thing right now, but no such human-doable act apparently exists so far as I can tell.

So we want the least dangerous, most easily aligned thing-to-do-with-an-AGI, but it does have to be a pretty powerful act to prevent the automatic destruction of Earth after 3 months or 2 years. It has to “flip the gameboard” rather than letting the suicidal game play out. We need to align the AGI that performs this pivotal act, to perform that pivotal act without killing everybody.

Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.

Since any such nanosystems would have to operate in the full open world containing lots of complicated details, this would require tons and tons of alignment work, is not the pivotal act easiest to align, and we should do some other thing instead. But the other thing I have in mind is also outside the Overton Window, just like this is. So I use “melt all GPUs” to talk about the requisite power level and the Overton Window problem level, both of which seem around the right levels to me, but the actual thing I have in mind is more alignable; and this way, I can reply to anyone who says “How dare you?!” by saying “Don’t worry, I don’t actually plan on doing that.”

[Ngo][11:14]

One way that we could take this discussion is by discussing the pivotal act “make progress on the alignment problem faster than humans can”.

[Yudkowsky][11:15]

This sounds to me like it requires extreme levels of alignment and operating in extremely dangerous regimes, such that, if you could do that, it would seem much more sensible to do some other pivotal act first, using a lower level of alignment tech.

[Ngo][11:16]

Okay, this seems like a crux on my end.

[Yudkowsky][11:16]

In particular, I would hope that—in unlikely cases where we survive at all—we were able to survive by operating a superintelligence only in the lethally dangerous, but still less dangerous, regime of “engineering nanosystems”.

Whereas “solve alignment for us” seems to require operating in the even more dangerous regimes of “write AI code for us” and “model human psychology in tremendous detail”.

[Ngo][11:17]

What makes these regimes so dangerous? Is it that it’s very hard for humans to exercise oversight?

One thing that makes these regimes seem less dangerous to me is that they’re broadly in the domain of “solving intellectual problems” rather than “achieving outcomes in the world”.

[Yudkowsky][11:19][11:21]

Every AI output effectuates outcomes in the world. If you have a powerful unaligned mind hooked up to outputs that can start causal chains that effectuate dangerous things, it doesn’t matter whether the comments on the code say “intellectual problems” or not.

The danger of “solving an intellectual problem” is when it requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things.

I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% “don’t think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs” and 2% “actually think about this dangerous topic but please don’t come up with a strategy inside it that kills us”.

[Ngo][11:21][11:22]

Let me try and be more precise about the distinction. It seems to me that systems which have been primarily trained to make predictions about the world would by default lack a lot of the cognitive machinery which humans use to take actions which pursue our goals.

Perhaps another way of phrasing my point is something like: it doesn’t seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic.

Is this a crux for you?

(obviously “agentic” is quite underspecified here, so maybe it’d be useful to dig into that first)

[Yudkowsky][11:27][11:33]

I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me… how can I put this in a properly general way… that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that. I have sometimes given more specific names to this problem setup, but I think people have become confused by the terms I usually use, which is why I’m dancing around them.

In particular, just as I have a model of the Other Person’s Beliefs in which they think alignment is easy because they don’t know about difficulties I see as very deep and fundamental and hard to avoid, I also have a model in which people think “why not just build an AI which does X but not Y?” because they don’t realize what X and Y have in common, which is something that draws deeply on having deep models of intelligence. And it is hard to convey this deep theoretical grasp.

But you can also see powerful practical hints that these things are much more correlated than, eg, Robin Hanson was imagining during the FOOM debate, because Robin did not think something like GPT-3 should exist; Robin thought you should need to train lots of specific domains that didn’t generalize. I argued then with Robin that it was something of a hint that humans had visual cortex and cerebellar cortex but not Car Design Cortex, in order to design cars. Then in real life, it proved that reality was far to the Eliezer side of Eliezer on the Eliezer-Robin axis, and things like GPT-3 were built with less architectural complexity and generalized more than I was arguing to Robin that complex architectures should generalize over domains.

The metaphor I sometimes use is that it is very hard to build a system that drives cars painted red, but is not at all adjacent to a system that could, with a few alterations, prove to be very good at driving a car painted blue. The “drive a red car” problem and the “drive a blue car” problem have too much in common. You can maybe ask, “Align a system so that it has the capability to drive red cars, but refuses to drive blue cars.” You can’t make a system that is very good at driving red-painted cars, but lacks the basic capability to drive blue-painted cars because you never trained it on that. The patterns found by gradient descent, by genetic algorithms, or by other plausible methods of optimization, for driving red cars, would be patterns very close to the ones needed to drive blue cars. When you optimize for red cars you get the blue car capability whether you like it or not.

[Ngo][11:32]

Does your model of intelligence rule out building AIs which make dramatic progress in mathematics without killing us all?

[Yudkowsky][11:34][11:39]

If it were possible to perform some pivotal act that saved the world with an AI that just made progress on proving mathematical theorems, without, eg, needing to explain those theorems to humans, I’d be extremely interested in that as a potential pivotal act. We wouldn’t be out of the woods, and I wouldn’t actually know how to build an AI like that without killing everybody, but it would immediately trump everything else as the obvious line of research to pursue.

Parenthetically, there is very very little which my model of intelligence rules out. I think we all die because we cannot do certain dangerous things correctly, on the very first try in the dangerous regimes where one mistake kills you, and do them before proliferation of much easier technologies kills us. If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.

(The Textbook has the equivalent of “use ReLUs instead of sigmoids” everywhere, and avoids all the clever-sounding things that will work at subhuman levels and blow up when you run them at superintelligent levels.)

[Ngo][11:36][11:40]

Hmm, so suppose we train an AI to prove mathematical theorems when given them, perhaps via some sort of adversarial setter-solver training process.

By default I have the intuition that this AI could become extremely good at proving theorems—far beyond human level—without having goals about real-world outcomes.

It seems to me that in your model of intelligence, being able to do tasks like mathematics is closely coupled with trying to achieve real-world outcomes. But I’d actually take GPT-3 as some evidence against this position (although still evidence in favour of your position over Hanson’s), since it seems able to do a bunch of reasoning tasks while still not being very agentic.

There’s some alternative world where we weren’t able to train language models to do reasoning tasks without first training them to perform tasks in complex RL environments, and in that world I’d be significantly less optimistic.

[Yudkowsky][11:41]

I put to you that there is a predictable bias in your estimates, where you don’t know about the Deep Stuff that is required to prove theorems, so you imagine that certain cognitive capabilities are more disjoint than they actually are. If you knew about the things that humans are using to reuse their reasoning about chipped handaxes and other humans, to prove math theorems, you would see it as more plausible that proving math theorems would generalize to chipping handaxes and manipulating humans.

GPT-3 is a… complicated story, on my view of it and intelligence. We’re looking at an interaction between tons and tons of memorized shallow patterns. GPT-3 is very unlike the way that natural selection built humans.

[Ngo][11:44]

I agree with that last point. But this is also one of the reasons that I previously claimed that AIs could be more intelligent than humans while being less agentic, because there are systematic differences between the way in which natural selection built humans, and the way in which we’ll train AGIs.

[Yudkowsky][11:45]

My current suspicion is that Stack More Layers alone is not going to take us to GPT-6 which is a true AGI; and this is because of the way that GPT-3 is, in your own terminology, “not agentic”, and which is, in my terminology, not having gradient descent on GPT-3 run across sufficiently deep problem-solving patterns.

[Ngo][11:46]

Okay, that helps me understand your position better.

So here’s one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn’t been so tight, then individual humans would have been significantly less capable of performing novel tasks.

[Yudkowsky][11:50]

I agree.

[Ngo][11:50]

In my terminology, this is a reason that humans are “more agentic” than we otherwise would have been.

[Yudkowsky][11:50]

This seems indisputable.

[Ngo][11:51]

Another important difference: humans were trained in environments where we had to run around surviving all day, rather than solving maths problems etc.

[Yudkowsky][11:51]

I continue to nod.

[Ngo][11:52]

Supposing I agree that reaching a certain level of intelligence will require AIs with the “deep problem-solving patterns” you talk about, which lead AIs to try to achieve real-world goals. It still seems to me that there’s likely a lot of space between that level of intelligence, and human intelligence.

And if that’s the case, then we could build AIs which help us solve the alignment problem before we build AIs which instantiate sufficiently deep problem-solving patterns that they decide to take over the world.

Nor does it seem like the reason humans want to take over the world is because of a deep fact about our intelligence. It seems to me that humans want to take over the world mainly because that’s very similar to things we evolved to do (like taking over our tribe).

[Yudkowsky][11:57]

So here’s the part that I agree with: If there were one theorem only mildly far out of human reach, like proving the ABC Conjecture (if you think it hasn’t already been proven), and providing a machine-readable proof of this theorem would immediately save the world—say, aliens will give us an aligned superintelligence, as soon as we provide them with this machine-readable proof—then there would exist a plausible though not certain road to saving the world, which would be to try to build a shallow mind that proved the ABC Conjecture by memorizing tons of relatively shallow patterns for mathematical proofs learned through self-play; without that system ever abstracting math as deeply as humans do, but the sheer width of memory and sheer depth of search sufficing to do the job. I am not sure, to be clear, that this would work. But my model of intelligence does not rule it out.

[Ngo][11:58]

(I’m actually thinking of a mind which understands maths more deeply than humans—but perhaps only understands maths, or perhaps also a range of other sciences better than humans.)

[Yudkowsky][12:00]

Parts I disagree with: That “help us solve alignment” bears any significant overlap with “provide us a machine-readable proof of the ABC Conjecture without thinking too deeply about it”. That humans want to take over the world only because it resembles things we evolved to do.

[Ngo][12:01]

I definitely agree that humans don’t only want to take over the world because it resembles things we evolved to do.

[Yudkowsky][12:02]

Alas, eliminating 5 reasons why something would go wrong doesn’t help much if there’s 2 remaining reasons something would go wrong that are much harder to eliminate!

[Ngo][12:02]

But if we imagine having a human-level intelligence which hadn’t evolved primarily to do things that reasonably closely resembled taking over the world, then I expect that we could ask that intelligence questions in a fairly safe way.

And that’s also true for an intelligence that is noticeably above human level.

So one question is: how far above human level could we get before a system which has only been trained to do things like answer questions and understand the world will decide to take over the world?

[Yudkowsky][12:04]

I think this is one of the very rare cases where the intelligence difference between “village idiot” and “Einstein”, which I’d usually see as very narrow, makes a structural difference! I think you can get some outputs from a village-idiot-level AGI, which got there by training on domains exclusively like math, and this will proooobably not destroy the world (if you were right about that, about what was going on inside). I have more concern about the Einstein level.

[Ngo][12:05]

Let’s focus on the Einstein level then.

Human brains have been optimised very little for doing science.

This suggests that building an AI which is Einstein-level at doing science is significantly easier than building an AI which is Einstein-level at taking over the world (or other things which humans evolved to do).

[Yudkowsky][12:08]

I think there’s a certain broad sense in which I agree with the literal truth of what you just said. You will systematically overestimate how much easier, or how far you can push the science part without getting the taking-over-the-world part, for as long as your model is ignorant of what they have in common.

[Ngo][12:08]

Maybe this is a good time to dig into the details of what they have in common, then.

[Yudkowsky][12:09][12:11]][12:13]

I feel like I haven’t had much luck with trying to explain that on previous occasions. Not to you, to others too.

There are shallow topics like why p-zombies can’t be real and how quantum mechanics works and why science ought to be using likelihood functions instead of p-values, and I can barely explain those to some people, but then there are some things that are apparently much harder to explain than that and which defeat my abilities as an explainer.

That’s why I’ve been trying to point out that, even if you don’t know the specifics, there’s an estimation bias that you can realize should exist in principle.
Of course, I also haven’t had much luck in saying to people, “Well, even if you don’t know the truth about X that would let you see Y, can you not see by abstract reasoning that knowing any truth about X would predictably cause you to update in the direction of Y”—people don’t seem to actually internalize that much either. Not you, other discussions.

[Ngo][12:10][12:11][12:13]

Makes sense. Are there ways that I could try to make this easier? E.g. I could do my best to explain what I think your position is.

Given what you’ve said I’m not optimistic about this helping much.

But insofar as this is the key set of intuitions which has been informing your responses, it seems worth a shot.

Another approach would be to focus on our predictions for how AI capabilities will play out over the next few years.

I take your point about my estimation bias. To me it feels like there’s also a bias going the other way, which is that as long as we don’t know the mechanisms by which different human capabilities work, we’ll tend to lump them together as one thing.

[Yudkowsky][12:14]

Yup. If you didn’t know about visual cortex and auditory cortex, or about eyes and ears, you would assume much more that any sentience ought to both see and hear.

[Ngo][12:16]

So then my position is something like: human pursuit of goals is driven by emotions and reward signals which are deeply evolutionarily ingrained, and without those we’d be much safer but not that much worse at pattern recognition.

[Yudkowsky][12:17]

If there’s a pivotal act you can get just by supreme acts of pattern recognition, that’s right up there with “pivotal act composed solely of math” for things that would obviously instantly become the prime direction of research.

[Ngo][12:18]

To me it seems like maths is much more about pattern recognition than, say, being a CEO. Being a CEO requires coherence over long periods of time; long-term memory; motivation; metacognition; etc.

[Yudkowsky][12:18][12:23]

(One occasionally-argued line of research can be summarized from a certain standpoint as “how about a pivotal act composed entirely of predicting text” and to this my reply is “you’re trying to get fully general AGI capabilities by predicting text that is about deep /​ ‘agentic’ reasoning, and that doesn’t actually help”.)

Human math is very much about goals. People want to prove subtheorems on the way to proving theorems. We might be able to make a different kind of mathematician that works more like GPT-3 in the dangerously inscrutable parts that are all noninspectable vectors of floating-point numbers, but even there you’d need some Alpha-Zero-like outer framework to supply the direction of search.

That outer framework might be able to be powerful enough without being reflective, though. So it would plausibly be much easier to build a mathematician that was capable of superhuman formal theorem-proving but not agentic. The reality of the world might tell us “lolnope” but my model of intelligence doesn’t mandate that. That’s why, if you gave me a pivotal act composed entirely of “output a machine-readable proof of this theorem and the world is saved”, I would pivot there! It actually does seem like it would be a lot easier!

[Ngo][12:21][12:25]

Okay, so if I attempt to rephrase your argument:

Your position: There’s a set of fundamental similarities between tasks like doing maths, doing alignment research, and taking over the world. In all of these cases, agents based on techniques similar to modern ML which are very good at them will need to make use of deep problem-solving patterns which include goal-oriented reasoning. So while it’s possible to beat humans at some of these tasks without those core competencies, people usually overestimate the extent to which that’s possible.

[Yudkowsky][12:25]

Remember, a lot of my concern is about what happens first, especially if it happens soon enough that future AGI bears any resemblance whatsoever to modern ML; not about what can be done in principle.

[Soares][12:26]

(Note: it’s been 85 min, and we’re planning to take a break at 90min, so this seems like a good point for a little bit more clarifying back-and-forth on Richard’s summary before a break.)

[Ngo][12:26]

I’ll edit to say “plausible for ML techniques”?

(and “extent to which that’s plausible”)

[Yudkowsky][12:28]

I think that obvious-to-me future outgrowths of modern ML paradigms are extremely liable to, if they can learn how to do sufficiently superhuman X, generalize to taking over the world. How fast this happens does depend on X. It would plausibly happen relatively slower (at higher levels) with theorem-proving as the X, and with architectures that carefully stuck to gradient-descent-memorization over shallow network architectures to do a pattern-recognition part with search factored out (sort of, this is not generally safe, this is not a general formula for safe things!); rather than imposing anything like the genetic bottleneck you validly pointed out as a reason why humans generalize. Profitable X, and all X I can think of that would actually save the world, seem much more problematic.

[Ngo][12:30]

Okay, happy to take a break here.

[Soares][12:30]

Great timing!

[Ngo][12:30]

We can do a bit of meta discussion afterwards; my initial instinct is to push on the question of how similar Eliezer thinks alignment research is to theorem-proving.

[Yudkowsky][12:30]

Yup. This is my lunch break (actually my first-food-of-day break on a 600-calorie diet) so I can be back in 45min if you’re still up for that.

[Ngo][12:31]

Sure.

Also, if any of the spectators are reading in real time, and have suggestions or comments, I’d be interested in hearing them.

[Yudkowsky][12:31]

I’m also cheerful about spectators posting suggestions or comments during the break.

[Soares][12:32]

Sounds good. I declare us on a break for 45min, at which point we’ll reconvene (for another 90, by default).

Floor’s open to suggestions & commentary.

1.2. Requirements for science

[Yudkowsky][12:50]

I seem to be done early if people (mainly Richard) want to resume in 10min (30m break)

[Ngo][12:51]

Yepp, happy to do so

[Soares][12:57]

Some quick commentary from me:

  • It seems to me like we’re exploring a crux in the vicinity of “should we expect that systems capable of executing a pivotal act would, by default in lieu of significant technical alignment effort, be using their outputs to optimize the future”.

  • I’m curious whether you two agree that this is a crux (but plz don’t get side-tracked answering me).

  • The general discussion seems to be going well to me.

    • In particular, huzzah for careful and articulate efforts to zero in on cruxes.

[Ngo][13:00]

I think that’s a crux for the specific pivotal act of “doing better alignment research”, and maybe some other pivotal acts, but not all (or necessarily most) of them.

[Yudkowsky][13:01]

I should also say out loud that I’ve been working a bit with Ajeya on making an attempt to convey the intuitions behind there being deep patterns that generalize and are liable to be learned, which covered a bunch of ground, taught me how much ground there was, and made me relatively more reluctant to try to re-cover the same ground in this modality.

[Ngo][13:02]

Going forward, a couple of things I’d like to ask Eliezer about:

  • In what ways are the tasks that are most useful for alignment similar or different to proving mathematical theorems (which we agreed might generalise relatively slowly to taking over the world)?

  • What are the deep problem-solving patterns underlying these tasks?

  • Can you summarise my position?

I was going to say that I was most optimistic about #2 in order to get these ideas into a public format

But if that’s going to happen anyway based on Ajeya’s work, then that seems less important

[Yudkowsky][13:03]

I could still try briefly and see what happens.

[Ngo][13:03]

That seems valuable to me, if you’re up for it.

At the same time, I’ll try to summarise some of my own intuitions about intelligence which I expect to be relevant.

[Yudkowsky][13:04]

I’m not sure I could summarize your position in a non-straw way. To me there’s a huge visible distance between “solve alignment for us” and “output machine-readable proofs of theorems” where I can’t give a good account of why you think talking about the latter would tell us much about the former. I don’t know what other pivotal act you think might be easier.

[Ngo][13:06]

I see. I was considering “solving scientific problems” as an alternative to “proving theorems”, with alignment being one (particularly hard) example of a scientific problem.

But decided to start by discussing theorem-proving since it seemed like a clearer-cut case.

[Yudkowsky][13:07]

Can you predict in advance why Eliezer thinks “solving scientific problems” is significantly thornier? (Where alignment is like totally not “a particularly hard example of a scientific problem” except in the sense that it has science in it at all; which is maybe the real crux; but also a more difficult issue.)

[Ngo][13:09]

Based on some of your earlier comments, I’m currently predicting that you think the step where the solutions need to be legible to and judged by humans makes science much thornier than theorem-proving, where the solutions are machine-checkable.

[Yudkowsky][13:10]

That’s one factor. Should I state the other big one or would you rather try to state it first?

[Ngo][13:10]

Requiring a lot of real-world knowledge for science?

If it’s not that, go ahead and say it.

[Yudkowsky][13:11]

That’s one way of stating it. The way I’d put it is that it’s about making up hypotheses about the real world.

Like, the real world is then a thing that the AI is modeling, at all.

Factor 3: On many interpretations of doing science, you would furthermore need to think up experiments. That’s planning, value-of-information, search for an experimental setup whose consequences distinguish between hypotheses (meaning you’re now searching for initial setups that have particular causal consequences).

[Ngo][13:12]

To me “modelling the real world” is a very continuous variable. At one end you have physics equations that are barely separable from maths problems, at the other end you have humans running around in physical bodies.

To me it seems plausible that we could build an agent which solves scientific problems but has very little self-awareness (in the sense of knowing that it’s an AI, knowing that it’s being trained, etc).

I expect that your response to this is that modelling oneself is part of the deep problem-solving patterns which AGIs are very likely to have.

[Yudkowsky][13:15]

There’s a problem of inferring the causes of sensory experience in cognition-that-does-science. (Which, in fact, also appears in the way that humans do math, and is possibly inextricable from math in general; but this is an example of the sort of deep model that says “Whoops I guess you get science from math after all”, not a thing that makes science less dangerous because it’s more like just math.)

You can build an AI that only ever drives red cars, and which, at no point in the process of driving a red car, ever needs to drive a blue car in order to drive a red car. That doesn’t mean its red-car-driving capabilities won’t be extremely close to blue-car-driving capabilities if at any point the internal cognition happens to get pointed towards driving a blue car.

The fact that there’s a deep car-driving pattern which is the same across red cars and blue cars doesn’t mean that the AI has ever driven a blue car, per se, or that it has to drive blue cars to drive red cars. But if blue cars are fire, you sure are playing with that fire.

[Ngo][13:18]

To me, “sensory experience” as in “the video and audio coming in from this body that I’m piloting” and “sensory experience” as in “a file containing the most recent results of the large hadron collider” are very very different.

(I’m not saying we could train an AI scientist just from the latter—but plausibly from data that’s closer to the latter than the former)

[Yudkowsky][13:19]

So there’s separate questions about “does an AGI inseparably need to model itself inside the world to do science” and “did we build something that would be very close to modeling itself, and could easily stumble across that by accident somewhere in the inscrutable floating-point numbers, especially if that was even slightly useful for solving the outer problems”.

[Ngo][13:19]

Hmm, I see

[Yudkowsky][13:20][13:21][13:21]

If you’re trying to build an AI that literally does science only to observations collected without the AI having had a causal impact on those observations, that’s legitimately “more dangerous than math but maybe less dangerous than active science”.

You might still stumble across an active scientist because it was a simple internal solution to something, but the outer problem would be legitimately stripped of an important structural property the same way that pure math not describing Earthly objects is stripped of important structural properties.
And of course my reaction again is, “There is no pivotal act which uses only that cognitive capability.”

[Ngo][13:20][13:21][13:26]

I guess that my (fairly strong) prior here is that something like self-modelling, which is very deeply built into basically every organism, is a very hard thing for an AI to stumble across by accident without significant optimisation pressure in that direction.

But I’m not sure how to argue this except by digging into your views on what the deep problem-solving patterns are. So if you’re still willing to briefly try and explain those, that’d be useful to me.
”Causal impact” again seems like a very continuous variable—it seems like the amount of causal impact you need to do good science is much less than the amount which is needed to, say, be a CEO.

[Yudkowsky][13:26]

The amount doesn’t seem like the key thing, nearly so much as what underlying facilities you need to do whatever amount of it you need.

[Ngo][13:27]

Agreed.

[Yudkowsky][13:27]

If you go back to the 16th century and ask for just one mRNA vaccine, that’s not much of a difference from asking for a million hundred of them.

[Ngo][13:28]

Right, so the additional premise which I’m using here is that the ability to reason about causally impacting the world in order to achieve goals is something that you can have a little bit of.

Or a lot of, and that the difference between these might come down to the training data used.

Which at this point I don’t expect you to agree with.

[Yudkowsky][13:29]

If you have reduced a pivotal act to “look over the data from this hadron collider you neither built nor ran yourself”, that really is a structural step down from “do science” or “build a nanomachine”. But I can’t see any pivotal acts like that, so is that question much of a crux?

If there’s intermediate steps they might be described in my native language like “reason about causal impacts across only this one preprogrammed domain which you didn’t learn in a general way, in only this part of the cognitive architecture that is separable from the rest of the cognitive architecture”.

[Ngo][13:31]

Perhaps another way of phrasing this intermediate step is that the agent has a shallow understanding of how to induce causal impacts.

[Yudkowsky][13:31]

What is “shallow” to you?

[Ngo][13:31]

In a similar way to how you claim that GPT-3 has a shallow understanding of language.

[Yudkowsky][13:32]

So it’s memorized a ton of shallow causal-impact-inducing patterns from a large dataset, and this can be verified by, for example, presenting it with an example mildly outside the dataset and watching it fail, which we think will confirm our hypothesis that it didn’t learn any deep ways of solving that dataset.

[Ngo][13:33]

Roughly speaking, yes.

[Yudkowsky][13:34]

Eg, it wouldn’t surprise us at all if GPT-4 had learned to predict “27 * 18” but not “what is the area of a rectangle 27 meters by 18 meters”… is what I’d like to say, but Codex sure did demonstrate those two were kinda awfully proximal.

[Ngo][13:34]

Here’s one way we could flesh this out. Imagine an agent that loses coherence quickly when it’s trying to act in the world.

So for example, we’ve trained it to do scientific experiments over a period of a few hours or days

And then it’s very good at understanding the experimental data and extracting patterns from it

But upon running it for a week or a month, it loses coherence in a similar way to how GPT-3 loses coherence—e.g. it forgets what it’s doing.

My story for why this might happen is something like: there is a specific skill of having long-term memory, and we never trained our agent to have this skill, and so it has not acquired that skill (even though it can reason in very general and powerful ways in the short term).

This feels similar to the argument I was making before about how an agent might lack self-awareness, if we haven’t trained it specifically to have that.

[Yudkowsky][13:39]

There’s a set of obvious-to-me tactics for doing a pivotal act with minimal danger, which I do not think collectively make the problem safe, and one of these sets of tactics is indeed “Put a limit on the ‘attention window’ or some other internal parameter, ramp it up slowly, don’t ramp it any higher than you needed to solve the problem.”

[Ngo][13:41]

You could indeed do this manually, but my expectation is that you could also do this automatically, by training agents in environments where they don’t benefit from having long attention spans.

[Yudkowsky][13:42]

(Any time one imagines a specific tactic of this kind, if one has the security mindset, one can also imagine all sorts of ways it might go wrong; for example, an attention window can be defeated if there’s any aspect of the attended data or the internal state that ended up depending on past events in a way that leaked info about them. But, depending on how much superintelligence you were throwing around elsewhere, you could maybe get away with that, some of the time.)

[Ngo][13:43]

And that if you put agents in environments where they answer questions but don’t interact much with the physical world, then there will be many different traits which are necessary for achieving goals in the real world which they will lack, because there was little advantage to the optimiser of building those traits in.

[Yudkowsky][13:43]

I’ll observe that TransformerXL built an attention window that generalized, trained it on I think 380 tokens or something like that, and then found that it generalized to 4000 tokens or something like that.

[Ngo][13:43]

Yeah, an order of magnitude of generalisation is not surprising to me.

[Yudkowsky][13:44]

Having observed one order of magnitude, I would personally not be surprised by two orders of magnitude either, after seeing that.

[Ngo][13:45]

I’d be a little surprised, but I assume it would happen eventually.

1.3. Capability dials

[Yudkowsky][13:46]

I have a sense that this is all circling back to the question, “But what is it we do with the intelligence thus weakened?” If you can save the world using a rock, I can build you a very safe rock.

[Ngo][13:46]

Right.

So far I’ve said “alignment research”, but I haven’t been very specific about it.

I guess some context here is that I expect that the first things we do with intelligence similar to this is create great wealth, produce a bunch of useful scientific advances, etc.

And that we’ll be in a world where people take the prospect of AGI much more seriously

[Yudkowsky][13:48]

I mostly expect—albeit with some chance that reality says “So what?” to me and surprises me, because it is not as solidly determined as some other things—that we do not hang around very long in the “weirdly ~human AGI” phase before we get into the “if you crank up this AGI it destroys the world” phase. Less than 5 years, say, to put numbers on things.

It would not surprise me in the least if the world ends before self-driving cars are sold on the mass market. On some quite plausible scenarios which I think have >50% of my probability mass at the moment, research AGI companies would be able to produce prototype car-driving AIs if they spent time on that, given the near-world-ending tech level; but there will be Many Very Serious Questions about this relatively new unproven advancement in machine learning being turned loose on the roads. And their AGI tech will gain the property “can be turned up to destroy the world” before Earth gains the property “you’re allowed to sell self-driving cars on the mass market” because there just won’t be much time.

[Ngo][13:52]

Then I expect that another thing we do with this is produce a very large amount of data which rewards AIs for following human instructions.

[Yudkowsky][13:52]

On other scenarios, of course, self-driving becomes possible by limited AI well before things start to break (further) on AGI. And on some scenarios, the way you got to AGI was via some breakthrough that is already scaling pretty fast, so by the time you can use the tech to get self-driving cars, that tech already ends the world if you turn up the dial, or that event follows very swiftly.

[Ngo][13:53]

When you talk about “cranking up the AGI”, what do you mean?

Using more compute on the same data?

[Yudkowsky][13:53]

Running it with larger bounds on the for loops, over more GPUs, to be concrete about it.

[Ngo][13:53]

In a RL setting, or a supervised, or unsupervised learning setting?

Also: can you elaborate on the for loops?

[Yudkowsky][13:56]

I do not quite think that gradient descent on Stack More Layers alone—as used by OpenAI for GPT-3, say, and as opposed to Deepmind which builds more complex artifacts like Mu Zero or AlphaFold 2 - is liable to be the first path taken to AGI. I am reluctant to speculate more in print about clever ways to AGI, and I think any clever person out there will, if they are really clever and not just a fancier kind of stupid, not talk either about what they think is missing from Stack More Layers or how you would really get AGI. That said, the way that you cannot just run GPT-3 at a greater search depth, the way you can run Mu Zero at a greater search depth, is part of why I think that AGI is not likely to look exactly like GPT-3; the thing that kills us is likely to be a thing that can get more dangerous when you turn up a dial on it, not a thing that intrinsically has no dials that can make it more dangerous.

1.4. Consequentialist goals vs. deontologist goals

[Ngo][13:59]

Hmm, okay. Let’s take a quick step back and think about what would be useful for the last half hour.

I want to flag that my intuitions about pivotal acts are not very specific; I’m quite uncertain about how the geopolitics of that situation would work, as well as the timeframe between somewhere-near-human-level AGI and existential risk AGI.

So we could talk more about this, but I expect there’d be a lot of me saying “well we can’t rule out that X happens”, which is perhaps not the most productive mode of discourse.

A second option is digging into your intuitions about how cognition works.

[Yudkowsky][14:03]

Well, obviously, in the limit of alignment not being accessible to our civilization, and my successfully building a model weaker than reality which nonetheless correctly rules out alignment being accessible to our civilization, I could spend the rest of my short remaining lifetime arguing with people whose models are weak enough to induce some area of ignorance where for all they know you could align a thing. But that is predictably how conversations go in possible worlds where the Earth is doomed; so somebody wiser on the meta-level, though also ignorant on the object-level, might prefer to ask: “Where do you think your knowledge, rather than your ignorance, says that alignment ought to be doable and you will be surprised if it is not?”

[Ngo][14:07]

That’s a fair point. Although it seems like a structural property of the “pivotal act” framing, which builds in doom by default.

[Yudkowsky][14:08]

We could talk about that, if you think it’s a crux. Though I’m also not thinking that this whole conversation gets done in a day, so maybe for publishability reasons we should try to focus more on one line of discussion?

But I do think that lots of people get their optimism by supposing that the world can be saved by doing less dangerous things with an AGI. So it’s a big ol’ crux of mine on priors.

[Ngo][14:09]

Agreed that one line of discussion is better; I’m happy to work within the pivotal act framing for current purposes.

A third option is that I make some claims about how cognition works, and we see how much you agree with them.

[Yudkowsky][14:12]

(Though it’s something of a restatement, a reason I’m not going into “my intuitions about how cognition works” is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

I’m cheerful about hearing your own claims about cognition and disagreeing with them.

[Ngo][14:12]

Great

Okay, so one claim is that something like deontology is a fairly natural way for minds to operate.

[Yudkowsky][14:14]

(“If that were true,” he thought at once, “bureaucracies and books of regulations would be a lot more efficient than they are in real life.”)

[Ngo][14:14]

Hmm, although I think this was probably not a very useful phrasing, let me think about how to rephrase it.

Okay, so in our earlier email discussion, we talked about the concept of “obedience”.

To me it seems like it is just as plausible for a mind to have a concept like “obedience” as its rough goal, as a concept like maximising paperclips.

If we imagine training an agent on a large amount of data which pointed in the rough direction of rewarding obedience, for example, then I imagine that by default obedience would be a constraint of comparable strength to, say, the human survival instinct.

(Which is obviously not strong enough to stop humans doing a bunch of things that contradict it—but it’s a pretty good starting point.)

[Yudkowsky][14:18]

Heh. You mean of comparable strength to the human instinct to explicitly maximize inclusive genetic fitness?

[Ngo][14:19]

Genetic fitness wasn’t a concept that our ancestors were able to understand, so it makes sense that they weren’t pointed directly towards it.

(And nor did they understand how to achieve it.)

[Yudkowsky][14:19]

Even in that paradigm, except insofar as you expect gradient descent to work very differently from gene-search optimization—which, admittedly, it does—when you optimize really hard on a thing, you get contextual correlates to it, not the thing you optimized on.

This is of course one of the Big Fundamental Problems that I expect in alignment.

[Ngo][14:20]

Right, so the main correlate that I’ve seen discussed is “do what would make the human give you a high rating, not what the human actually wants”

One thing I’m curious about is the extent to which you’re concerned about this specific correlate, versus correlates in general.

[Yudkowsky][14:21]

That said, I also see basic structural reasons why paperclips would be much easier to train than “obedience”, even if we could magically instill simple inner desires that perfectly reflected the simple outer algorithm we saw ourselves as running over many particular instances of a loss function.

[Ngo][14:22]

I’d be interested in hearing what those are.

[Yudkowsky][14:22]

well, first of all, why is a book of regulations so much more unwieldy than a hunter-gatherer?

if deontology is just as good as consequentialism, y’know.

(do you want to try replying or should I just say?)

[Ngo][14:23]

Go ahead

I should probably clarify that I agree that you can’t just replace consequentialism with deontology

The claim is more like: when it comes to high-level concepts, it’s not clear to me why high-level consequentialist goals are more natural than high-level deontological goals.

[Yudkowsky][14:24]

I reply that reality is complicated, so when you pump a simple goal through complicated reality you get complicated behaviors required to achieve the goal. If you think of reality as a complicated function Input->Probability(Output), then even to get a simple Output or a simple partition on Output or a high expected score in a simple function over Output, you may need very complicated Input.

Humans don’t trust each other. They imagine, “Well, if I just give this bureaucrat a goal, perhaps they won’t reason honestly about what it takes to achieve that goal! Oh no! Therefore I will instead, being the trustworthy and accurate person that I am, reason myself about constraints and requirements on the bureaucrat’s actions, such that, if the bureaucrat obeys these regulations, I expect the outcome of their action will be what I want.”

But (compared to a general intelligence that observes and models complicated reality and does its own search to pick actions) an actually-effective book of regulations (implemented by some nonhuman mind with a large enough and perfect enough memory to memorize it) would tend to involve a (physically unmanageable) vast number of rules saying “if you observe this, do that” to follow all the crinkles of complicated reality as it can be inferred from observation.

[Ngo][14:28]

(Though it’s something of a restatement, a reason I’m not going into “my intuitions about how cognition works” is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it’s still a while away, I’m wondering whether it’s still useful to have a rough outline of these intuitions even if it’s in a form that very few people will internalise)

[Yudkowsky][14:30]

(As a side note: do you have a rough guess for when your work with Ajeya will be made public? If it’s still a while away, I’m wondering whether it’s still useful to have a rough outline of these intuitions even if it’s in a form that very few people will internalise)

Plausibly useful, but not to be attempted today, I think?

[Ngo][14:30]

Agreed.

[Yudkowsky][14:30]

(We are now theoretically in overtime, which is okay for me, but for you it is 11:30pm (I think?) and so it is on you to call when to halt, now or later.)

[Ngo][14:32]

Yeah, it’s 11.30 for me. I think probably best to halt here. I agree with all the things you just said about reality being complicated, and why consequentialism is therefore valuable. My “deontology” claim (which was, in its original formulation, far too general—apologies for that) was originally intended as a way of poking into your intuitions about which types of cognition are natural or unnatural, which I think is the topic we’ve been circling around for a while.

[Yudkowsky][14:33]

Yup, and a place to resume next time might be why I think “obedience” is unnatural compared to “paperclips”—though that is a thing that probably requires taking that stab at what underlies surface competencies.

[Ngo][14:34]

Right. I do think that even a vague gesture at that would be reasonably helpful (assuming that this doesn’t already exist online?)

[Yudkowsky][14:34]

Not yet afaik, and I don’t want to point you to Ajeya’s stuff even if she were ok with that, because then this in-context conversation won’t make sense to others.

[Ngo][14:35]

For my part I should think more about pivotal acts that I’d be willing to specifically defend.

In any case, thanks for the discussion 🙂

Let me know if there’s a particular time that suits you for a follow-up; otherwise we can sort it out later.

[Soares][14:37]

(y’all are doing all my jobs for me)

[Yudkowsky][14:37]

could try Tuesday at this same time—though I may be in worse shape for dietary reasons, still, seems worth trying.

[Soares][14:37]

(wfm)

[Ngo][14:39]

Tuesday not ideal, any others work?

[Yudkowsky][14:39]

Wednesday?

[Ngo][14:40]

Yes, Wednesday would be good

[Yudkowsky][14:40]

let’s call it tentatively for that

[Soares][14:41]

Great! Thanks for the chats.

[Ngo][14:41]

Thanks both!

[Yudkowsky][14:41]

Thanks, Richard!

2. Follow-ups

2.1. Richard Ngo’s summary

[Tallinn][0:35] (Sep. 6)

just caught up here & wanted to thank nate, eliezer and (especially) richard for doing this! it’s great to see eliezer’s model being probed so intensively. i’ve learned a few new things (such as the genetic bottleneck being plausibly a big factor in human cognition). FWIW, a minor comment re deontology (as that’s fresh on my mind): in my view deontology is more about coordination than optimisation: deontological agents are more trustworthy, as they’re much easier to reason about (in the same way how functional/​declarative code is easier to reason about than imperative code). hence my steelman of bureaucracies (as well as social norms): humans just (correctly) prefer their fellow optimisers (including non-human optimisers) to be deontological for trust/​coordination reasons, and are happy to pay the resulting competence tax.

[Ngo][3:10] (Sep. 8)

Thanks Jaan! I agree that greater trust is a good reason to want agents which are deontological at some high level.

I’ve attempted a summary of the key points so far; comments welcome: [GDocs link]

[Ngo] (Sep. 8 Google Doc)

1st discussion

(Mostly summaries not quotations)

Eliezer, summarized by Richard: “To avoid catastrophe, whoever builds AGI first will have to a) align it to some extent, and b) decide not to scale it up beyond the point where their alignment techniques fail, and c) do some pivotal act that prevents others from scaling it up to that level. But our alignment techniques will not be good enough our alignment techniques will be very far from adequate on our current trajectory, our alignment techniques will be very far from adequate to create an AI that safely performs any such pivotal act.”

[Yudkowsky][11:05] (Sep. 8 comment)

will not be good enough

Are not presently on course to be good enough, missing by not a little. “Will not be good enough” is literally declaring for lying down and dying.

[Yudkowsky][16:03] (Sep. 9 comment)

will [be very far from adequate]

Same problem as the last time I commented. I am not making an unconditional prediction about future failure as would be implied by the word “will”. Conditional on current courses of action or their near neighboring courses, we seem to be well over an order of magnitude away from surviving, unless a miracle occurs. It’s still in the end a result of people doing what they seem to be doing, not an inevitability.

[Ngo][5:10] (Sep. 10 comment)

Ah, I see. Does adding “on our current trajectory” fix this?

[Yudkowsky][10:46] (Sep. 10 comment)

Yes.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: “Consider the pivotal act of ‘make a breakthrough in alignment research’. It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes).”

Eliezer, summarized by Richard: “There’s a deep connection between solving intellectual problems and taking over the world—the former requires a powerful mind to think about domains that, when solved, render very cognitively accessible strategies that can do dangerous things. Even mathematical research is a goal-oriented task which involves identifying then pursuing instrumental subgoals—and if brains which evolved to hunt on the savannah can quickly learn to do mathematics, then it’s also plausible that AIs trained to do mathematics could quickly learn a range of other skills. Since almost nobody understands the deep similarities in the cognition required for these different tasks, the distance between AIs that are able to perform fundamental scientific research, and dangerously agentic AGIs, is smaller than almost anybody expects.”

[Yudkowsky][11:05] (Sep. 8 comment)

There’s a deep connection between solving intellectual problems and taking over the world

There’s a deep connection by default between chipping flint handaxes and taking over the world, if you happen to learn how to chip handaxes in a very general way. “Intellectual” problems aren’t special in this way. And maybe you could avert the default, but that would take some work and you’d have to do it before easier default ML techniques destroyed the world.

[Ngo] (Sep. 8 Google Doc)

Richard, summarized by Richard: “Our lack of understanding about how intelligence works also makes it easy to assume that traits which co-occur humans will also co-occur in future AIs. But human brains are badly-optimised for tasks like scientific research, and well-optimised for seeking power over the world, for reasons including a) evolving while embodied in a harsh environment; b) the genetic bottleneck; c) social environments which rewarded power-seeking. By contrast, training neural networks on tasks like mathematical or scientific research optimises them much less for seeking power. For example, GPT-3 has knowledge and reasoning capabilities but little agency, and loses coherence when run for longer timeframes.”

[Tallinn][4:19] (Sep. 8 comment)

[well-optimised for] seeking power

male-female differences might be a datapoint here (annoying as it is to lean on pinker’s point :))

[Yudkowsky][11:31] (Sep. 8 comment)

I don’t think a female Eliezer Yudkowsky doesn’t try to save /​ optimize /​ takeover the world. Men may do that for nonsmart reasons; smart men and women follow the same reasoning when they are smart enough. Eg Anna Salamon and many others.

[Ngo] (Sep. 8 Google Doc)

Eliezer, summarized by Richard: “Firstly, there’s a big difference between most scientific research and the sort of pivotal act that we’re talking about—you need to explain how AIs with a given skill can be used to actually prevent dangerous AGIs from being built. Secondly, insofar as GPT-3 has little agency, that’s because it has memorised many shallow patterns in a way which won’t directly scale up to general intelligence. Intelligence instead consists of deep problem-solving patterns which link understanding and agency at a fundamental level.”

3. September 8 conversation

3.1. The Brazilian university anecdote

[Yudkowsky][11:00]

(I am here.)

[Ngo][11:01]

Me too.

[Soares][11:01]

Welcome back!

(I’ll mostly stay out of the way again.)

[Ngo][11:02]

Cool. Eliezer, did you read the summary—and if so, do you roughly endorse it?

Also, I’ve been thinking about the best way to approach discussing your intuitions about cognition. My guess is that starting with the obedience vs paperclips thread is likely to be less useful than starting somewhere else—e.g. the description you gave near the beginning of the last discussion, about “searching for states that get fed into a result function and then a result-scoring function”.

[Yudkowsky][11:06]

made a couple of comments about phrasings in the doc

So, from my perspective, there’s this thing where… it’s really quite hard to teach certain general points by talking at people, as opposed to more specific points. Like, they’re trying to build a perpetual motion machine, and even if you can manage to argue them into believing their first design is wrong, they go looking for a new design, and the new design is complicated enough that they can no longer be convinced that they’re wrong because they managed to make a more complicated error whose refutation they couldn’t keep track of anymore.

Teaching people to see an underlying structure in a lot of places is a very hard thing to teach in this way. Richard Feynman gave an example of the mental motion in his story that ends “Look at the water!”, where people learned in classrooms about how “a medium with an index” is supposed to polarize light reflected from it, but they didn’t realize that sunlight coming off of water would be polarized. My guess is that doing this properly requires homework exercises; and that, unfortunately from my own standpoint, it happens to be a place where I have extra math talent, the same way that eg Marcello is more talented at formally proving theorems than I happen to be; and that people without the extra math talent, have to do a lot more exercises than I did, and I don’t have a good sense of which exercises to give them.

[Ngo][11:13]

I’m sympathetic to this, and can try to turn off skeptical-discussion-mode and turn on learning-mode, if you think that’ll help.

[Yudkowsky][11:14]

There’s a general insight you can have about how arithmetic is commutative, and for some people you can show them 1 + 2 = 2 + 1 and their native insight suffices to generalize over the 1 and the 2 to any other numbers you could put in there, and they realize that strings of numbers can be rearranged and all end up equivalent. For somebody else, when they’re a kid, you might have to show them 2 apples and 1 apple being put on the table in a different order but ending up with the same number of apples, and then you might have to show them again with adding up bills in different denominations, in case they didn’t generalize from apples to money. I can actually remember being a child young enough that I tried to add 3 to 5 by counting “5, 6, 7” and I thought there was some clever enough way to do that to actually get 7, if you tried hard.

Being able to see “consequentialism” is like that, from my perspective.

[Ngo][11:15]

Another possibility: can you trace the origins of this belief, and how it came out of your previous beliefs?

[Yudkowsky][11:15]

I don’t know what homework exercises to give people to make them able to see “consequentialism” all over the place, instead of inventing slightly new forms of consequentialist cognition and going “Well, now that isn’t consequentialism, right?”

Trying to say “searching for states that get fed into an input-result function and then a result-scoring function” was one attempt of mine to describe the dangerous thing in a way that would maybe sound abstract enough that people would try to generalize it more.

[Ngo][11:17]

Another possibility: can you describe the closest thing to real consequentialism in humans, and how it came about in us?

[Yudkowsky][11:18][11:21]

Ok, so, part of the problem is that… before you do enough homework exercises for whatever your level of talent is (and even I, at one point, had done little enough homework that I thought there might be a clever way to add 3 and 5 in order to get to 7), you tend to think that only the very crisp formal thing that’s been presented to you, is the “real” thing.

Why would your engine have to obey the laws of thermodynamics? You’re not building one of those Carnot engines you saw in the physics textbook!

Humans contain fragments of consequentialism, or bits and pieces whose interactions add up to partially imperfectly shadow consequentialism, and the critical thing is being able to see that the reason why humans’ outputs ‘work’, in a sense, is because these structures are what is doing the work, and the work gets done because of how they shadow consequentialism and only insofar as they shadow consequentialism.

Put a human in one environment, it gets food. Put a human in a different environment, it gets food again. Wow, different initial conditions, same output! There must be things inside the human that, whatever else they do, are also along the way somehow effectively searching for motor signals such that food is the end result!

[Ngo][11:20]

To me it feels like you’re trying to nudge me (and by extension whoever reads this transcript) out of a specific failure mode. If I had to guess, something like: “I understand what Eliezer is talking about so now I’m justified in disagreeing with it”, or perhaps “Eliezer’s explanation didn’t make sense to me and so I’m justified in thinking that his concepts don’t make sense”. Is that right?

[Yudkowsky][11:22]

More like… from my perspective, even after I talk people out of one specific perpetual motion machine being possible, they go off and try to invent a different, more complicated perpetual motion machine.

And I am not sure what to do about that. It has been going on for a very long time from my perspective.

In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to—they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they’d spent a lot of time being exposed to over and over and over again in lots of blog posts.

Maybe there’s no way to make somebody understand why corrigibility is “unnatural” except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell’s attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.

Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, “Oh, well, I’ll just build an agent that’s good at optimizing things but doesn’t use these explicit expected utilities that are the source of the problem!”

And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.

And I have tried to write that page once or twice (eg “coherent decisions imply consistent utilities”) but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they’d have to do because this is in fact a place where I have a particular talent.

I don’t know how to solve this problem, which is why I’m falling back on talking about it at the meta-level.

[Ngo][11:30]

I’m reminded of a LW post called “Write a thousand roads to Rome”, which iirc argues in favour of trying to explain the same thing from as many angles as possible in the hope that one of them will stick.

[Soares][11:31]

(Suggestion, not-necessarily-good: having named this problem on the meta-level, attempt to have the object-level debate, while flagging instances of this as it comes up.)

[Ngo][11:31]

I endorse Nate’s suggestion.

And will try to keep the difficulty of the meta-level problem in mind and respond accordingly.

[Yudkowsky][11:33]

That (Nate’s suggestion) is probably the correct thing to do. I name it out loud because sometimes being told about the meta-problem actually does help on the object problem. It seems to help me a lot and others somewhat less, but it does help others at all, for many others.

3.2. Brain functions and outcome pumps

[Yudkowsky][11:34]

So, do you have a particular question you would ask about input-seeking cognitions? I did try to say why I mentioned those at all (it’s a different road to Rome on “consequentialism”).

[Ngo][11:36]

Let’s see. So the visual cortex is an example of quite impressive cognition in humans and many other animals. But I’d call this “pattern-recognition” rather than “searching for high-scoring results”.

[Yudkowsky][11:37]

Yup! And it is no coincidence that there are no whole animals formed entirely out of nothing but a visual cortex!

[Ngo][11:37]

Okay, cool. So you’d agree that the visual cortex is doing something that’s qualitatively quite different from the thing that animals overall are doing.

Then another question is: can you characterise searching for high-scoring results in non-human animals? Do they do it? Or are you mainly talking about humans and AGIs?

[Yudkowsky][11:39]

Also by the time you get to like the temporal lobes or something, there is probably some significant amount of “what could I be seeing that would produce this visual field?” that is searching through hypothesis-space for hypotheses with high plausibility scores, and for sure at the human level, humans will start to think, “Well, could I be seeing this? No, that theory has the following problem. How could I repair that theory?” But it is plausible that there is no low-level analogue of this in a monkey’s temporal cortex; and even more plausible that the parts of the visual cortex, if any, which do anything analogous to this, are doing it in a relatively local and definitely very domain-specific way.

Oh, that’s the cerebellum and motor cortex and so on, if we’re talking about a cat or whatever. They have to find motor plans that result in their catching the mouse.

Just because the visual cortex isn’t (obviously) running a search doesn’t mean the rest of the animal isn’t running any searches.

(On the meta-level, I notice myself hiccuping “But how could you not see that when looking at a cat?” and wondering what exercises would be required to teach that.)

[Ngo][11:41]

Well, I see something when I look at a cat, but I don’t know how well it corresponds to the concepts you’re using. So just taking it slowly for now.

I have the intuition, by the way, that the motor cortex is in some sense doing a similar thing to the visual cortex—just in reverse. So instead of taking low-level inputs and producing high-level outputs, it’s taking high-level inputs and producing low-level outputs. Would you agree with that?

[Yudkowsky][11:43]

It doesn’t directly parse in my ontology because (a) I don’t know what you mean by ‘high-level’ and (b) whole Cartesian agents can be viewed as functions, that doesn’t mean all agents can be viewed as non-searching pattern-recognizers.

That said, all parts of the cerebral cortex have surprisingly similar morphology, so it wouldn’t be at all surprising if the motor cortex is doing something similar to visual cortex. (The cerebellum, on the other hand...)

[Ngo][11:44]

The signal from the visual cortex saying “that is a cat”, and the signal to the motor cortex saying “grab that cup”, are things I’d characterise as high-level.

[Yudkowsky][11:45]

Still less of a native distinction in my ontology, but there’s an informal thing it can sort of wave at, and I can hopefully take that as understood and run with it.

[Ngo][11:45]

The firing of cells in the retina, and firing of motor neurons, are the low-level parts.

Cool. So to a first approximation, we can think about the part in between the cat recognising a mouse, and the cat’s motor cortex producing the specific neural signals required to catch the mouse, as the part where the consequentialism happens?

[Yudkowsky][11:49]

The part between the cat’s eyes seeing the mouse, and the part where the cat’s limbs move to catch the mouse, is the whole cat-agent. The whole cat agent sure is a baby consequentialist /​ searches for mouse-catching motor patterns /​ gets similarly high-scoring end results even as you vary the environment.

The visual cortex is a particular part of this system-viewed-as-a-feedforward-function that is, plausibly, by no means surely, either not very searchy, or does only small local visual-domain-specific searches not aimed per se at catching mice; it has the epistemic nature rather than the planning nature.

Then from one perspective you could reason that “well, most of the consequentialism is in the remaining cat after visual cortex has sent signals onward”. And this is in general a dangerous mode of reasoning that is liable to fail in, say, inspecting every particular neuron for consequentialism and not finding it; but in this particular case, there are significantly more consequentialist parts of the cat than the visual cortex, so I am okay running with it.

[Ngo][11:50]

Ah, the more specific thing I meant to say is: most of the consequentialism is strictly between the visual cortex and the motor cortex. Agree/​disagree?

[Yudkowsky][11:51]

Disagree, I’m rusty on my neuroanatomy but I think the motor cortex may send signals on to the cerebellum rather than the other way around.

(I may also disagree with the actual underlying notion you’re trying to hint at, so possibly not just a “well include the cerebellum then” issue, but I think I should let you respond first.)

[Ngo][11:53]

I don’t know enough neuroanatomy to chase that up, so I was going to try a different tack.

But actually, maybe it’s easier for me to say “let’s include the cerebellum” and see where you think the disagreement ends up.

[Yudkowsky][11:56]

So since cats are not (obviously) (that I have read about) cross-domain consequentialists with imaginations, their consequentialism is in bits and pieces of consequentialism embedded in them all over by the more purely pseudo-consequentialist genetic optimization loop that built them.

A cat who fails to catch a mouse may then get little bits and pieces of catbrain adjusted all over.

And then those adjusted bits and pieces get a pattern lookup later.

Why do these pattern-lookups with no obvious immediate search element, all happen to point towards the same direction of catching the mouse? Because of the past causal history about how what gets looked up, which was tweaked to catch the mouse.

So it is legit harder to point out “the consequentialist parts of the cat” by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it’s not part of that consequentialist loop either.

And yes, the same applies to humans, but humans also do more explicitly searchy things and this is part of the story for why humans have spaceships and cats do not.

[Ngo][12:00]

Okay, this is interesting. So in biological agents we’ve got these three levels of consequentialism: evolution, reinforcement learning, and planning.

[Yudkowsky][12:01]

In biological agents we’ve got evolution + local evolved system-rules that in the past promoted genetic fitness. Two kinds of local rules like this are “operant-conditioning updates from success or failure” and “search through visualized plans”. I wouldn’t characterize these two kinds of rules as “levels”.

[Ngo][12:02]

Okay, I see. And when you talk about searching through visualised plans (the type of thing that humans do) can you say more about what it means for that to be a “search”?

For example, if I imagine writing a poem line-by-line, I may only be planning a few words ahead. But somehow the whole poem, which might be quite long, ends up a highly-optimised product. Is that a central example of planning?

[Yudkowsky][12:04][12:07]

Planning is one way to succeed at search. I think for purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it’s a strong enough search, rather than the danger seeming to come from details of the planning process.

One of my early experiences in successfully generalizing my notion of intelligence, what I’d later verbalize as “computationally efficient finding of actions that produce outcomes high in a preference ordering”, was in writing an (unpublished) story about time-travel in which the universe was globally consistent.

The requirement of global consistency, the way in which all events between Paradox start and Paradox finish had to map the Paradox’s initial conditions onto the endpoint that would go back and produce those exact initial conditions, ended up imposing strong complicated constraints on reality that the Paradox in effect had to navigate using its initial conditions. The time-traveler needed to end up going through certain particular experiences that would produce the state of mind in which he’d take the actions that would end up prodding his future self elsewhere into having those experiences.

The Paradox ended up killing the people who built the time machine, for example, because they would not otherwise have allowed that person to go back in time, or kept the temporal loop open that long for any other reason if they were still alive.

Just having two examples of strongly consequentialist general optimization in front of me—human intelligence, and evolutionary biology—hadn’t been enough for me to properly generalize over a notion of optimization. Having three examples of homework problems I’d worked—human intelligence, evolutionary biology, and the fictional Paradox—caused it to finally click for me.

[Ngo][12:07]

Hmm. So to me, one of the central features of search is that you consider many possibilities. But in this poem example, I may only have explicitly considered a couple of possibilities, because I was only looking ahead a few words at a time. This seems related to the distinction Abram drew a while back between selection and control (https://​​www.alignmentforum.org/​​posts/​​ZDZmopKquzHYPRNxq/​​selection-vs-control). Do you distinguish between them in the same way as he does? Or does “control” of a system (e.g. a football player dribbling a ball down the field) count as search too in your ontology?

[Yudkowsky][12:10][12:11]

I would later try to tell people to “imagine a paperclip maximizer as not being a mind at all, imagine it as a kind of malfunctioning time machine that spits out outputs which will in fact result in larger numbers of paperclips coming to exist later”. I don’t think it clicked because people hadn’t done the same homework problems I had, and didn’t have the same “Aha!” of realizing how part of the notion and danger of intelligence could be seen in such purely material terms.

But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it’s the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!

[Ngo][12:11]

Right, I remember a very similar idea in your writing about Outcome Pumps (https://​​www.lesswrong.com/​​posts/​​4ARaTpNX62uaL86j6/​​the-hidden-complexity-of-wishes).

[Yudkowsky][12:12]

Yup! Alas, the story was written in 2002-2003 when I was a worse writer and the real story that inspired the Outcome Pump never did get published.

[Ngo][12:14]

Okay, so I guess the natural next question is: what is it that makes you think that a strong, effective search isn’t likely to be limited or constrained in some way?

What is it about search processes (like human brains) that makes it hard to train them with blind spots, or deontological overrides, or things like that?

Hmmm, although it feels like this is a question I can probably predict your answer to. (Or maybe not, I wasn’t expecting the time travel.)

[Yudkowsky][12:15]

In one sense, they are! A paperclip-maximizing superintelligence is nowhere near as powerful as a paperclip-maximizing time machine. The time machine can do the equivalent of buying winning lottery tickets from lottery machines that have been thermodynamically randomized; a superintelligence can’t, at least not directly without rigging the lottery or whatever.

But a paperclip-maximizing strong general superintelligence is epistemically and instrumentally efficient, relative to you, or to me. Any time we see it can get at least X paperclips by doing Y, we should expect that it gets X or more paperclips by doing Y or something that leads to even more paperclips than that, because it’s not going to miss the strategy we see.

So in that sense, searching our own brains for how a time machine would get paperclips, asking ourselves how many paperclips are in principle possible and how they could be obtained, is a way of getting our own brains to consider lower bounds on the problem without the implicit stupidity assertions that our brains unwittingly use to constrain story characters. Part of the point of telling people to think about time machines instead of superintelligences was to get past the ways they imagine superintelligences being stupid. Of course that didn’t work either, but it was worth a try.

I don’t think that’s quite what you were asking about, but I want to give you a chance to see if you want to rephrase anything before I try to answer your me-reformulated questions.

[Ngo][12:20]

Yeah, I think what I wanted to ask is more like: why should we expect that, out of the space of possible minds produced by optimisation algorithms like gradient descent, strong general superintelligences are more common than other types of agents which score highly on our loss functions?

[Yudkowsky][12:20][12:23][12:24]

It depends on how hard you optimize! And whether gradient descent on a particular system can even successfully optimize that hard! Many current AIs are trained by gradient descent and yet not superintelligences at all.

But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way to solve all those subproblems is to use patterns which collectively have some coherence and overlap, and the coherence within them generalizes across all the subproblems. Lots of search orderings will stumble across something like that before they stumble across separate solutions for lots of different problems.
I suspect that you cannot get this out of small large amounts of gradient descent on small large layered transformers, and therefore I suspect that GPT-N does not approach superintelligence before the world is ended by systems that look differently, but I could be wrong about that.

[Ngo][12:22][12:23]

Suppose that we optimise hard enough to produce an epistemic subsystem that can make plans much better than any human’s.

My guess is that you’d say that this is possible, but that we’re much more likely to first produce a consequentialist agent which does this (rather than a purely epistemic agent which does this).

[Yudkowsky][12:24]

I am confused by what you think it means to have an “epistemic subsystem” that “makes plans much better than any human’s”. If it searches paths through time and selects high-scoring ones for output, what makes it “epistemic”?

[Ngo][12:25]

Suppose, for instance, that it doesn’t actually carry out the plans, it just writes them down for humans to look at.

[Yudkowsky][12:25]

If it can in fact do the thing that a paperclipping time machine does, what makes it any safer than a paperclipping time machine because we called it “epistemic” or by some other such name?

By what criterion is it selecting the plans that humans look at?

Why did it make a difference that its output was fed through the causal systems called humans on the way to the causal systems called protein synthesizers or the Internet or whatever? If we build a superintelligence to design nanomachines, it makes no obvious difference to its safety whether it sends DNA strings directly to a protein synthesis lab, or humans read the output and retype it manually into an email. Presumably you also don’t think that’s where the safety difference comes from. So where does the safety difference come from?

(note: lunchtime for me in 2 minutes, propose to reconvene in 30m after that)

[Ngo][12:28]

(break for half an hour sounds good)

If we consider the visual cortex at a given point in time, how does it decide which objects to recognise?

Insofar as the visual cortex can be non-consequentialist about which objects it recognises, why couldn’t a planning system be non-consequentialist about which plans it outputs?

[Yudkowsky][12:32]

This does feel to me like another “look at the water” moment, so what do you predict I’ll say about that?

[Ngo][12:34]

I predict that you say something like: in order to produce an agent that can create very good plans, we need to apply a lot of optimisation power to that agent. And if the channel through which we’re applying that optimisation power is “giving feedback on its plans”, then we don’t have a mechanism to ensure that the agent actually learns to optimise for creating really good plans, as opposed to creating plans that receive really good feedback.

[Soares][12:35]

Seems like a fine cliffhanger?

[Ngo][12:35]

Yepp.

[Soares][12:35]

Great. Let’s plan to reconvene in 30min.

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

[Yudkowsky][13:03][13:11]

So the answer you expected from me, translated into my terms, would be, “If you select for the consequence of the humans hitting ‘approve’ on the plan, you’re still navigating the space of inputs for paths through time to probable outcomes (namely the humans hitting ‘approve’), so you’re still doing consequentialism.”

But suppose you manage to avoid that. Suppose you get exactly what you ask for. Then the system is still outputting plans such that, when humans follow them, they take paths through time and end up with outcomes that score high in some scoring function.

My answer is, “What the heck would it mean for a planning system to be non-consequentialist? You’re asking for nonwet water! What’s consequentialist isn’t the system that does the work, it’s the work you’re trying to do! You could imagine it being done by a cognition-free material system like a time machine and it would still be consequentialist because the output is a plan, a path through time!”

And this indeed is a case where I feel a helpless sense of not knowing how I can rephrase things, which exercises you have to get somebody to do, what fictional experience you have to walk somebody through, before they start to look at the water and see a material with an index, before they start to look at the phrase “why couldn’t a planning system be non-consequentialist about which plans it outputs” and go “um”.

My imaginary listener now replies, “Ah, but what if we have plans that don’t end up with outcomes that score high in some function?” and I reply “Then you lie on the ground randomly twitching because any outcome you end up with which is not that is one that you wanted more than that meaning you preferred it more than the outcome of random motor outputs which is optimization toward higher in the preference function which is taking a path through time that leads to particular destinations more than it leads to random noise.”

[Ngo][13:09][13:11]

Yeah, this does seem like a good example of the thing you were trying to explain at the beginning

It still feels like there’s some sort of levels distinction going on here though, let me try to tease out that intuition.

Okay, so suppose I have a planning system that, given a situation and a goal, outputs a plan that leads from that situation to that goal.

And then suppose that we give it, as input, a situation that we’re not actually in, and it outputs a corresponding plan.

It seems to me that there’s a difference between the sense in which that planning system is consequentialist by virtue of making consequentialist plans (as in: if that plan were used in the situation described in its inputs, it would lead to some goal being achieved) versus another hypothetical agent that is just directly trying to achieve goals in the situation it’s actually in.

[Yudkowsky][13:18]

So I’d preface by saying that, if you could build such a system, which is indeed a coherent thing (it seems to me) to describe for the purpose of building it, then there would possibly be a safety difference on the margins, it would be noticeably less dangerous though still dangerous. It would need a special internal structural property that you might not get by gradient descent on a loss function with that structure, just like natural selection on inclusive genetic fitness doesn’t get you explicit fitness optimizers; you could optimize for planning in hypothetical situations, and get something that didn’t explicitly care only and strictly about hypothetical situations. And even if you did get that, the outputs that would kill or brain-corrupt the operators in hypothetical situations might also be fatal to the operators in actual situations. But that is a coherent thing to describe, and the fact that it was not optimizing our own universe, might make it safer.

With that said, I would worry that somebody would think there was some bone-deep difference of agentiness, of something they were empathizing with like personhood, of imagining goals and drives being absent or present in one case or the other, when they imagine a planner that just solves “hypothetical” problems. If you take that planner and feed it the actual world as its hypothetical, tada, it is now that big old dangerous consequentialist you were imagining before, without it having acquired some difference of psychological agency or ‘caring’ or whatever.

So I think there is an important homework exercise to do here, which is something like, “Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don’t make any other internal changes, and feed it actual problems, it’s very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too.”

“See, you thought the source of the danger was this internal property of caring about actual reality, but it wasn’t that, it was the structure of planning!”

[Ngo][13:22]

I think we’re getting closer to the same page now.

Let’s consider this hypothetical planner for a bit. Suppose that it was trained in a way that minimised the, let’s say, adversarial component of its plans.

For example, let’s say that the plans it outputs for any situation are heavily regularised so only the broad details get through.

Hmm, I’m having a bit of trouble describing this, but basically I have an intuition that in this scenario there’s a component of its plan which is cooperative with whoever executes the plan, and a component that’s adversarial.

And I agree that there’s no fundamental difference in type between these two things.

[Yudkowsky][13:27]

“What if this potion we’re brewing has a Good Part and a Bad Part, and we could just keep the Good Parts...”

[Ngo][13:27]

Nor do I think they’re separable. But in some cases, you might expect one to be much larger than the other.

[Soares][13:29]

(I observe that my model of some other listeners, at this point, protest “there is yet a difference between the hypothetical-planner applied to actual problems, and the Big Scary Consequentialist, which is that the hypothetical planner is emitting descriptions of plans that would work if executed, whereas the big scary consequentialist is executing those plans directly.”)

(Not sure that’s a useful point to discuss, or if it helps Richard articulate, but it’s at least a place I expect some reader’s minds to go if/​when this is published.)

[Yudkowsky][13:30]

(That is in fact a difference! The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.)

[Ngo][13:31]

To me it seems that Eliezer’s position is something like: “actually, in almost no training regimes do we get agents that decide which plans to output by spending almost all of their time thinking about the object-level problem, and very little of their time thinking about how to manipulate the humans carrying out the plan”.

[Yudkowsky][13:32]

My position is that the AI does not neatly separate its internals into a Part You Think Of As Good and a Part You Think Of As Bad, because that distinction is sharp in your map but not sharp in the territory or the AI’s map.

From the perspective of a paperclip-maximizing-action-outputting-time-machine, its actions are not “object-level making paperclips” or “manipulating the humans next to the time machine to deceive them about what the machine does”, they’re just physical outputs that go through time and end up with paperclips.

[Ngo][13:34]

@Nate, yeah, that’s a nice way of phrasing one point I was trying to make. And I do agree with Eliezer that these things can be very similar. But I’m claiming that in some cases these things can also be quite different—for instance, when we’re training agents that only get to output a short high-level description of the plan.

[Yudkowsky][13:35]

The danger is in how hard the agent has to work to come up with the plan. I can, for instance, build an agent that very safely outputs a high-level plan for saving the world:

echo “Hey Richard, go save the world!”

So I do have to ask what kind of “high-level” planning output, that saves the world, you are envisioning, and why it was hard to cognitively come up with such that we didn’t just make that high-level plan right now, if humans could follow it. Then I’ll look at the part where the plan was hard to come up with, and say how the agent had to understand lots of complicated things in reality and accurately navigate paths through time for those complicated things, in order to even invent the high-level plan, and hence it was very dangerous if it wasn’t navigating exactly where you hoped. Or, alternatively, I’ll say, “That plan couldn’t save the world: you’re not postulating enough superintelligence to be dangerous, and you’re also not using enough superintelligence to flip the tables on the currently extremely doomed world.”

[Ngo][13:39]

At this point I’m not envisaging a particular planning output that saves the world, I’m just trying to get more clarity on the issue of consequentialism.

[Yudkowsky][13:40]

Look at the water; it’s not the way you’re doing the work that’s dangerous, it’s the work you’re trying to do. What work are you trying to do, never mind how it gets done?

[Ngo][13:41]

I think I agree with you that, in the limit of advanced capabilities, we can’t say much about how the work is being done, we have to primarily reason from the work that we’re trying to do.

But here I’m only talking about systems that are intelligent enough to come up with plans and do research that are beyond the capability of humanity.

And for me the question is: for those systems, can we tilt the way they do the work so they spend 99% of their time trying to solve the object-level problem, and 1% of their time trying to manipulate the humans who are going to carry out the plan? (Where these are not fundamental categories for the AI, they’re just a rough categorisation that emerges after we’ve trained it—the same way that the categories of “physically moving around” and “thinking about things” aren’t fundamentally different categories of action for humans, but the way we’ve evolved means there’s a significant internal split between them.)

[Soares][13:43]

(I suspect Eliezer is not trying to make a claim of the form “in the limit of advanced capabilities, we are relegated to reasoning about what work gets done, not about how it was done”. I suspect some miscommunication. It might be a reasonable time for Richard to attempt to paraphrase Eliezer’s argument?)

(Though it also seems to me like Eliezer responding to the 99%/​1% point may help shed light.)

[Yudkowsky][13:46]

Well, for one thing, I’d note that a system which is designing nanosystems, and spending 1% of its time thinking about how to kill the operators, is lethal. It has to be such a small fraction of thinking that it, like, never completes the whole thought about “well, if I did X, that would kill the operators!”

[Ngo][13:46]

Thanks for that, Nate. I’ll try to paraphrase Eliezer’s argument now.

Eliezer’s position (partly in my own terminology): we’re going to build AIs that can perform very difficult tasks using cognition which we can roughly describe as “searching over many options to find one that meets our criteria”. An AI that can solve these difficult tasks will need to be able to search in a very general and flexible way, and so it will be very difficult to constrain that search into a particular region.

Hmm, that felt like a very generic summary, let me try and think about the more specific claims he’s making.

[Yudkowsky][13:54]

An AI that can solve these difficult tasks will need to be able to

Very very little is universally necessary over the design space. The first AGI that our tech becomes able to build is liable to work in certain easier and simpler ways.

[Ngo][13:55]

Point taken; thanks for catching this misphrasing (this and previous times).

[Yudkowsky][13:56]

Can you, in principle, build a red-car-driver that is totally incapable of driving blue cars? In principle, sure! But the first red-car-driver that gradient descent stumbles over is liable to be a blue-car-driver too.

[Ngo][13:57]

Eliezer, I’m wondering how much of our disagreement is about how high the human level is here.

Or, to put it another way: we can build systems that outperform humans at quite a few tasks by now, without having search abilities that are general enough to even try to take over the world.

[Yudkowsky][13:58]

Indubitably and indeed, this is so.

[Ngo][13:59]

Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers?

And say that we’ll be able to align ones that they outperform us on these tasks before taking over the world, but not on these other tasks?

[Yudkowsky][13:59][14:01]

That doesn’t have a very simple answer, but one aspect there is domain generality which in turn is achieved through novel domain learning.

Humans, you will note, were not aggressively optimized by natural selection to be able to breathe underwater or fly into space. In terms of obvious outer criteria, there is not much outer sign that natural selection produced these creatures much more general than chimpanzees, by training on a much wider range of environments and loss functions.

[Soares][14:00]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

[Ngo][14:03]

(Before we drift too far from it: thanks for the summary! It seemed good to me, and I updated towards the miscommunication I feared not-having-happened.)

(Good to know, thanks for keeping an eye out. To be clear, I didn’t ever interpret Eliezer as making a claim explicitly about the limit of advanced capabilities; instead it just seemed to me that he was thinking about AIs significantly more advanced than the ones I’ve been thinking of. I think I phrased my point poorly.)

[Yudkowsky][14:05][14:10]

There are complicated aspects of this story where natural selection may metaphorically be said to have “had no idea of what it was doing”, eg, after early rises in intelligence possibly produced by sexual selection on neatly chipped flint handaxes or whatever, all the cumulative brain-optimization on chimpanzees reached a point where there was suddenly a sharp selection gradient on relative intelligence at Machiavellian planning against other humans (even more so than in the chimp domain) as a subtask of inclusive genetic fitness, and so continuing to optimize on “inclusive genetic fitness” in the same old savannah, turned out to happen to be optimizing hard on the subtask and internal capability of “outwit other humans”, which optimized hard on “model other humans”, which was a capability that could be reused for modeling the chimp-that-is-this-chimp, which turned the system on itself and made it reflective, which contributed greatly to its intelligence being generalized, even though it was just grinding the same loss function on the same savannah; the system being optimized happened to go there in the course of being optimized even harder for the same thing.

So one can imagine asking the question: Is there a superintelligent AGI that can quickly build nanotech, which has a kind of passive safety in some if not all respects, in virtue of it solving problems like “build a nanotech system which does X” the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability?

And in this regard one does note that there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal’s fitness if there were animalistic ways to do it. They don’t make iron claws for themselves. They never did evolve a tendency to search for iron ore, and burn wood into charcoal that could be used in hardened-clay furnaces.

No animal plays chess, but AIs do, so we can obviously make AIs to do things that animals don’t do. On the other hand, the environment didn’t exactly present any particular species with a challenge of chess-playing either.

Even so, though, even if some animal had evolved to play chess, I fully expect that current AI systems would be able to squish it at chess, because the AI systems are on chips that run faster than neurons and doing crisp calculations and there are things you just can’t do with noisy slow neurons. So that again is not a generally reliable argument about what AIs can do.

[Ngo][14:09][14:11]

Yes, although I note that challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels).

And so the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

[Yudkowsky][14:11]

So we can again ask: Is there a way to make an AI system that is only good at designing nanosystems, which can achieve some complicated but hopefully-specifiable real-world outcomes, without that AI also being superhuman at understanding and manipulating humans?

And I roughly answer, “Perhaps, but not by default, there’s a bunch of subproblems, I don’t actually know how to do it right now, it’s not the easiest way to get an AGI that can build nanotech (and kill you), you’ve got to make the red-car-driver specifically not be able to drive blue cars.” Can I explain how I know that? I’m really not sure I can, in real life where I explain X0 and then the listener doesn’t generalize X0 to X and respecialize it to X1.

It’s like asking me how I could possibly know in 2008, before anybody had observed AlphaFold 2, that superintelligences would be able to crack the protein folding problem on the way to nanotech, which some people did question back in 2008.

Though that was admittedly more of a slam-dunk than this was, and I could not have told you that AlphaFold 2 would become possible at a prehuman level of general intelligence in 2021 specifically, or that it would be synced in time to a couple of years after GPT-2′s level of generality at text.

[Ngo][14:18]

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

[Yudkowsky][14:20]

Definitely, “turns out it’s easier than you thought to use gradient descent’s memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do” is among the more plausible advance-specified miracles we could get.

But it is not what my model says actually happens, and I am not a believer that when your model says you are going to die, you get to start believing in particular miracles. You need to hold your mind open for any miracle and a miracle you didn’t expect or think of in advance, because at this point our last hope is that in fact the future is often quite surprising—though, alas, negative surprises are a tad more frequent than positive ones, when you are trying desperately to navigate using a bad map.

[Ngo][14:22]

Perhaps one metric we could use here is something like: how much extra reward does the consequentialist nanoengineer get from starting to model humans, versus from becoming better at nanoengineering?

[Yudkowsky][14:23]

But that’s not where humans came from. We didn’t get to nuclear power by getting a bunch of fitness from nuclear power plants. We got to nuclear power because if you get a bunch of fitness from chipping flint handaxes and Machiavellian scheming, as found by relatively simple and local hill-climbing, that entrains the same genes that build nuclear power plants.

[Ngo][14:24]

Only in the specific case where you also have the constraint that you keep having to learn new goals every generation.

[Yudkowsky][14:24]

Huh???

[Soares][14:24]

(I think Richard’s saying, “that’s a consequence of the genetic bottleneck”)

[Ngo][14:25]

Right.

Hmm, but I feel like we may have covered this ground before.

Suggestion: I have a couple of other directions I’d like to poke at, and then we could wrap up in 20 or 30 minutes?

[Yudkowsky][14:27]

OK

What are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?

Though I want to mark that this question seemed potentially cruxy to me, though perhaps not for others. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn’t involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2.

I don’t think we can do that. And I would note to the generic Other that if, to them, these both just sound like thinky things, so why can’t you just do that other thinky thing too using the thinky program, this is a case where having any specific model of why we don’t already have this nanoengineer right now would tell you there were specific different thinky things involved.

3.4. Coherence and pivotal acts

[Ngo][14:31]

In either order:

  • I’m curious how the things we’ve been talking about relate to your opinions about meta-level optimisation from the AI foom debate. (I.e. talking about how wrapping around so that there’s no longer any protected level of optimisation leads to dramatic change.)

  • I’m curious how your claims about the “robustness” of consequentialism (i.e. the difficulty of channeling an agent’s thinking in the directions we want it to go) relate to the reliance of humans on culture, and in particular the way in which humans raised without culture are such bad consequentialists.

On the first: if I were to simplify to the extreme, it seems like there are these two core intuitions that you’ve been trying to share for a long time. One is a certain type of recursive improvement, and another is a certain type of consequentialism.

[Yudkowsky][14:32]

The second question didn’t make much sense in my native ontology? Humans raised without culture don’t have access to environmental constants whose presence their genes assume, so they end up as broken machines and then they’re bad consequentialists.

[Ngo][14:35]

Hmm, good point. Okay, question modification: the ways in which humans reason, act, etc, vary greatly depending on which cultures they’re raised in. (I’m mostly thinking about differences over time—e.g. cavemen vs moderns.) My low-fidelity version of your view about consequentialists says that general consequentialists like humans possess a robust search process which isn’t so easily modified.

(Sorry if this doesn’t make much sense in your ontology, I’m getting a bit tired.)

[Yudkowsky][14:36]

What is it that varies that you think I think should predict would stay more constant?

[Ngo][14:37]

Goals, styles of reasoning, deontological constraints, level of conformity.

[Yudkowsky][14:39]

With regards to your first point, my first reaction was, “I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.

“It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.

“Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point.”

Returning to your second point, humans are broken things; if it were possible to build computers while working even worse than humans, we’d be having this conversation at that level of intelligence instead.

[Ngo][14:41]

(Retracted)I entirely agree about humans, but it doesn’t matter that much how broken humans are when the regime of AIs that we’re talking about is the regime that’s directly above humans, and therefore only a bit less broken than humans.

[Yudkowsky][14:41]

Among the things to bear in mind about that, is that we then get tons of weird phenomena that are specific to humans, and you may be very out of luck if you start wishing for the same weird phenomena in AIs. Yes, even if you make some sort of attempt to train it using a loss function.

However, it does seem to me like as we start getting towards the Einstein level instead of the village-idiot level, even though this is usually not much of a difference, we do start to see the atmosphere start to thin already, and the turbulence start to settle down already. Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

[Ngo][14:44]

I don’t think I’m asking for the same weird phenomena. But insofar as a bunch of the phenomena I’ve been talking about have seemed weird according to your account of consequentialism, then the fact that approximately-human-level-consequentialists have lots of weird things about them is a sign that the phenomena I’ve been talking about are less unlikely than you expect.

[Yudkowsky][14:45][14:46]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

I can’t think of anything you can do with somebody just barely smarter than a human, which flips the gameboard, aside of course from “go build a Friendly AI” which I did try to set up to just go do and which would be incredibly hard to align if we wanted an AI to do it instead (full-blown chicken-and-egg, that AI is already fully aligned).

[Ngo][14:45]

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you’ve been talking about?

[Yudkowsky][14:47, moved up in log]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

It’s the sort of path that has only one destination at its end, so there will be many ways to get there.

(Modulo various cases where different decision theories seem reflectively consistent and so on; I want to say “you know what I mean” but maybe people don’t.)

[Ngo][14:47, moved down in log]

I suspect that some of the difference here is that I think you have to be noticeably better than a human at nanoengineering to pull off pivotal acts large enough to make a difference, which is why I am not instead trying to gather the smartest people left alive and doing that pivotal act directly.

Yepp, I think there’s probably some disagreements about geopolitics driving this too. E.g. in my earlier summary document I mentioned some possible pivotal acts:

  • Monitoring all potential AGI projects to an extent that makes it plausible for the US and China to work on a joint project without worrying that the other is privately racing.

  • Provide arguments/​demonstrations/​proofs related to impending existential risk that are sufficiently compelling to scare the key global decision-makers into bottlenecking progress.

I predict that you think these would not be pivotal enough; but I don’t think digging into the geopolitical side of things is the best use of our time.

[Yudkowsky][14:49, moved up in log]

Monitoring all AGI projects—either not politically feasible in real life given the actual way that countries behave in history books instead of fantasy; or at politically feasible levels, does not work well enough to prevent the world from ending once the know-how proliferates. The AI isn’t doing much work here either; why not go do this now, if it’s possible? (Note: please don’t try to go do this now, it backfires badly.)

Provide sufficiently compelling arguments = superhuman manipulation, an incredibly dangerous domain that is just about the worst domain to try to align.

[Ngo][14:49, moved down in log]

With regards to your first point, my first reaction was, “I just have one view of intelligence, what you see me arguing about reflects which points people have proved weirdly obstinate about. In 2008, Robin Hanson was being weirdly obstinate about how capabilities scaled and whether there was even any point in analyzing AIs differently from ems, so I talked about what I saw as the most slam-dunk case for there being Plenty Of Room Above Biology and for stuff going whoosh once it got above the human level.

“It later turned out that capabilities started scaling a whole lot without self-improvement, which is an example of the kind of weird surprise the Future throws at you, and maybe a case where I missed something by arguing with Hanson instead of imagining how I could be wrong in either direction and not just the direction that other people wanted to argue with me about.

“Later on, people were unable to understand why alignment is hard, and got stuck on generalizing the concept I refer to as consequentialism. A theory of why I talked about both things for related reasons would just be a theory of why people got stuck on these two points for related reasons, and I think that theory would mainly be overexplaining an accident because if Yann LeCun had been running effective altruism I would have been explaining different things instead, after the people who talked a lot to EAs got stuck on a different point.”

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off—which is that they defer too much to one very high-level abstraction.

[Yudkowsky][14:52]

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off—which is that they defer too much to one very high-level abstraction.

I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead.

[Ngo][14:51][14:53]

Explicit reflection is one possible later stage of the path; an earlier part of the path is from being optimized to do things difficult enough that you need to stop stepping on your own feet and have different parts of your thoughts work well together.

Can you explain a little more what you mean by “have different parts of your thoughts work well together”? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

And I guess there’s no good way to quantify how important you think the explicit reflection part of the path is, compared with other parts of the path—but any rough indication of whether it’s a more or less crucial component of your view?

[Yudkowsky][14:55]

Can you explain a little more what you mean by “have different parts of your thoughts work well together”? Is this something like the capacity for metacognition; or the global workspace; or self-control; or...?

No, it’s like when you don’t, like, pay five apples for something on Monday, sell it for two oranges on Tuesday, and then trade an orange for an apple.

I have still not figured out the homework exercises to convey to somebody the Word of Power which is “coherence” by which they will be able to look at the water, and see “coherence” in places like a cat walking across the room without tripping over itself.

When you do lots of reasoning about arithmetic correctly, without making a misstep, that long chain of thoughts with many different pieces diverging and ultimately converging, ends up making some statement that is… still true and still about numbers! Wow! How do so many different thoughts add up to having this property? Wouldn’t they wander off and end up being about tribal politics instead, like on the Internet?

And one way you could look at this, is that even though all these thoughts are taking place in a bounded mind, they are shadows of a higher unbounded structure which is the model identified by the Peano axioms; all the things being said are true about the numbers. Even though somebody who was missing the point would at once object that the human contained no mechanism to evaluate each of their statements against all of the numbers, so obviously no human could ever contain a mechanism like that, so obviously you can’t explain their success by saying that each of their statements was true about the same topic of the numbers, because what could possibly implement that mechanism which (in the person’s narrow imagination) is The One Way to implement that structure, which humans don’t have?

But though mathematical reasoning can sometimes go astray, when it works at all, it works because, in fact, even bounded creatures can sometimes manage to obey local relations that in turn add up to a global coherence where all the pieces of reasoning point in the same direction, like photons in a laser lasing, even though there’s no internal mechanism that enforces the global coherence at every point.

To the extent that the outer optimizer trains you out of paying five apples on Monday for something that you trade for two oranges on Tuesday and then trading two oranges for four apples, the outer optimizer is training all the little pieces of yourself to be locally coherent in a way that can be seen as an imperfect bounded shadow of a higher unbounded structure, and then the system is powerful though imperfect because of how the power is present in the coherence and the overlap of the pieces, because of how the higher perfect structure is being imperfectly shadowed. In this case the higher structure I’m talking about is Utility, and doing homework with coherence theorems leads you to appreciate that we only know about one higher structure for this class of problems that has a dozen mathematical spotlights pointing at it saying “look here”, even though people have occasionally looked for alternatives.

And when I try to say this, people are like, “Well, I looked up a theorem, and it talked about being able to identify a unique utility function from an infinite number of choices, but if we don’t have an infinite number of choices, we can’t identify the utility function, so what relevance does this have” and this is a kind of mistake I don’t remember even coming close to making so I do not know how to make people stop doing that and maybe I can’t.

[Soares][15:07]

We’re already pushing our luck on time, so I nominate that we wrap up (after, perhaps, a few more Richard responses if he’s got juice left.)

[Yudkowsky][15:07]

Yeah, was thinking the same.

[Soares][15:07]

As a proposed cliffhanger to feed into the next discussion, my take is that Richard’s comment:

On my first point, it seems to me that your claims about recursive self-improvement were off in a fairly similar way to how I think your claims about consequentialism are off—which is that they defer too much to one very high-level abstraction.

probably contains some juicy part of the disagreement, and I’m interested in Eliezer understanding Richard’s claim to the point of being able to paraphrase it to Richard’s satisfaction.

[Ngo][15:08]

Wrapping up here makes sense.

I endorse the thing Nate just said.

I also get the sense that I have a much better outline now of Eliezer’s views about consequentialism (if not the actual details and texture).

On a meta level, I personally tend to focus more on things like “how should we understand cognition” and not “how should we understand geopolitics and how it affects the level of pivotal action required”.

If someone else were trying to prosecute this disagreement they might say much more about the latter. I’m uncertain how useful it is for me to do so, given that my comparative advantage compared with the rest of the world (and probably Eliezer’s too) is the cognition part.

[Yudkowsky][15:12]

Reconvene… tomorrow? Monday of next week?

[Ngo][15:12]

Monday would work better for me.

You okay with me summarising the discussion so far to [some people — redacted for privacy reasons]?

[Yudkowsky][15:13]

Nate, take a minute to think of your own thoughts there?

[Soares: 👍 👌]

[Soares][15:15]

My take: I think it’s fine to summarize, though generally virtuous to mark summaries as summaries (rather than asserting that your summaries are Eliezer-endorsed or w/​e).

[Ngo: 👍]

[Yudkowsky][15:16]

I think that broadly matches my take. I’m also a bit worried about biases in the text summarizer, and about whether I managed to say anything that Rob or somebody will object to pre-publication, but we ultimately intended this to be seen and I was keeping that in mind, so, yeah, go ahead and summarize.

[Ngo][15:17]

Great, thanks

[Yudkowsky][15:17]

I admit to being curious as to what you thought was said that was important or new, but that’s a question that can be left open to be answered at your leisure, earlier in your day.

[Ngo][15:17]

I admit to being curious as to what you thought was said that was important or new, but that’s a question that can be left open to be answered at your leisure, earlier in your day.

You mean, what I thought was worth summarising?

[Yudkowsky][15:17]

Yeah.

[Ngo][15:18]

Hmm, no particular opinion. I wasn’t going to go out of my way to do so, but since I’m chatting to [some people — redacted for privacy reasons] regularly anyway, it seemed low-cost to fill them in.

At your leisure, I’d be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

[Yudkowsky][15:19]

I don’t know if it’s going to help, but trying it currently seems better than to go on saying nothing.

[Ngo][15:20]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven’t been digging into that area as much)

[Soares][15:21]

(personally, in addition to feeling like less of an expert on geopolitics, it also seems more sensitive for me to make claims about in public, which is another reason I haven’t been digging into that area as much)

(seems reasonable! note, though, that i’d be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we’re already in the area)

[Ngo: 👍]

(tho ofc it is less valuable to spend conversational effort in private discussions, etc.)

[Ngo: 👍]

[Ngo][15:22]

At your leisure, I’d be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

(this question aimed at you too Nate)

Also, thanks Nate for the moderation! I found your interventions well-timed and useful.

[Soares: ❤️]

[Soares][15:23]

(this question aimed at you too Nate)

(noted, thanks, I’ll probably write something up after you’ve had the opportunity to depart for sleep.)

On that note, I declare us adjourned, with intent to reconvene at the same time on Monday.

Thanks again, both.

[Ngo][15:23]

Thanks both 🙂

Oh, actually, one quick point

Would one hour earlier suit, for Monday?

I’ve realised that I’ll be moving to a one-hour-later time zone, and starting at 9pm is slightly suboptimal (but still possible if necessary)

[Soares][15:24]

One hour earlier would work fine for me.

[Yudkowsky][15:25]

Doesn’t work as fine for me because I’ve been trying to avoid any food until 12:30p my time, but on that particular day I may be more caloried than usual from the previous day, and could possibly get away with it. (That whole day could also potentially fail if a minor medical procedure turns out to take more recovery than it did the last time I had it.)

[Ngo][15:26]

Hmm, is this something where you’d have more information on the day? (For the calories thing)

[Yudkowsky][15:27]

(seems reasonable! note, though, that i’d be quite happy to have sensitive sections stricken from the record, insofar as that lets us get more convergence than we otherwise would, while we’re already in the area)

I’m a touch reluctant to have discussions that we intend to delete, because then the larger debate will make less sense once those sections are deleted. Let’s dance around things if we can.

[Ngo: 👍][Soares: 👍]

I mean, I can that day at 10am my time say how I am doing and whether I’m in shape for that day.

[Ngo][15:28]

great. and if at that point it seems net positive to postpone to 11am your time (at the cost of me being a bit less coherent later on) then feel free to say so at the time

on that note, I’m off

[Yudkowsky][15:29]

Good night, heroic debater!

[Soares][16:11]

At your leisure, I’d be curious to know how well the directions of discussion are meeting your goals for what you want to convey when this is published, and whether there are topics you want to focus on more.

The discussions so far are meeting my goals quite well so far! (Slightly better than my expectations, hooray.) Some quick rough notes:

  • I have been enjoying EY explicating his models around consequentialism.

    • The objections Richard has been making are ones I think have been floating around for some time, and I’m quite happy to see explicit discussion on it.

    • Also, I’ve been appreciating the conversational virtue with which the two of you have been exploring it. (Assumption of good intent, charity, curiosity, etc.)

  • I’m excited to dig into Richard’s sense that EY was off about recursive self improvement, and is now off about consequentialism, in a similar way.

    • This also sees to me like a critique that’s been floating around for some time, and I’m looking forward to getting more clarity on it.

  • I’m a bit torn between driving towards clarity on the latter point, and shoring up some of the progress on the former point.

    • One artifact I’d really enjoy having is some sort of “before and after” take, from Richard, contrasting his model of EY’s views before, to his model now.

    • I also have a vague sense that there are some points Eliezer was trying to make, that didn’t quite feel like they were driven home; and dually, some pushback by Richard that didn’t feel quite frontally answered.

      • One thing I may do over the next few days is make a list of those places, and see if I can do any distilling on my own. (No promises, though.)

      • If that goes well, I might enjoy some side-channel back-and-forth with Richard about it, eg during some more convenient-for-Richard hour (or, eg, as a thing to do on Monday if EY’s not in commission at 10a pacific.)

[Ngo][5:40] (next day, Sep. 9)

The discussions so far are [...]

What do you mean by “latter point” and “former point”? (In your 6th bullet point)

[Soares][7:09] (next day, Sep. 9)

What do you mean by “latter point” and “former point”? (In your 6th bullet point)

former = shoring up the consequentialism stuff, latter = digging into your critique re: recursive self improvement etc. (The nesting of the bullets was supposed to help make that clear, but didn’t come out well in this format, oops.)

4. Follow-ups

4.1. Richard Ngo’s summary

[Ngo] (Sep. 10 Google Doc)

2nd discussion

(Mostly summaries not quotations; also hasn’t yet been evaluated by Eliezer)

Eliezer, summarized by Richard: “The A core concept which people have trouble grasping is consequentialism. People try to reason about how AIs will solve problems, and ways in which they might or might not be dangerous. But they don’t realise that the ability to solve a wide range of difficult problems implies that an agent must be doing a powerful search over possible solutions, which is the a core skill required to take actions which greatly affect the world. Making this type of AI safe is like trying to build an AI that drives red cars very well, but can’t drive blue cars—there’s no way you get this by default, because the skills involved are so similar. And because the search process is so general is by default so general, it’ll be very hard to I don’t currently see how to constrain it into any particular region.”

[Yudkowsky][10:48] (Sep. 10 comment)

The

A concept, which some people have had trouble grasping. There seems to be an endless list. I didn’t have to spend much time contemplating consequentialism to derive the consequences. I didn’t spend a lot of time talking about it until people started arguing.

[Yudkowsky][10:50] (Sep. 10 comment)

the

a

[Yudkowsky][10:52] (Sep. 10 comment)

[the search process] is [so general]

“is by default”. The reason I keep emphasizing that things are only true by default is that the work of surviving may look like doing hard nondefault things. I don’t take fatalistic “will happen” stances, I assess difficulties of getting nondefault results.

[Yudkowsky][10:52] (Sep. 10 comment)

it’ll be very hard to

“I don’t currently see how to”

[Ngo] (Sep. 10 Google Doc)

Eliezer, summarized by Richard (continued): “In biological organisms, evolution is one source the ultimate source of consequentialism. A second secondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist—rather, it has many individual traits that, when we view them from this lens, point in the same direction.) A third thing that makes humans in particular consequentialist is planning, Another outcome of evolution, which helps make humans in particular more consequentialist, is planning—especially when we’re aware of concepts like utility functions.”

[Yudkowsky][10:53] (Sep. 10 comment)

one

the ultimate

[Yudkowsky][10:53] (Sep. 10 comment)

second

secondary outcome of evolution

[Yudkowsky][10:55] (Sep. 10 comment)

especially when we’re aware of concepts like utility functions

Very slight effect on human effectiveness in almost all cases because humans have very poor reflectivity.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: “Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. I’d argue that the former is doing consequentialist reasoning without itself being a consequentialist, while the latter is actually a consequentialist. Or more succinctly: consequentialism = problem-solving skills + using those skills to choose actions which achieve goals.”

Eliezer, summarized by Richard: “The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could just be one line of code: if we give the former AI our current scenario as its input, then it becomes the latter. For purposes of understanding alignment difficulty, you want to be thinking on the level of abstraction where you see that in some sense it is the search itself that is dangerous when it’s a strong enough search, rather than the danger seeming to come from details of the planning process. One particularly helpful thought experiment is to think of advanced AI as an ‘outcome pump’ which selects from futures in which a certain outcome occurred, and takes whatever action leads to them.”

[Yudkowsky][10:59] (Sep. 10 comment)

particularly helpful

“attempted explanatory”. I don’t think most readers got it.

I’m a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing. It seems to rhyme with a deeper failure of many EAs to pass the MIRI ITT.

To be a bit blunt and impolite in hopes that long-languishing social processes ever get anywhere, two obvious uncharitable explanations for why some folks may systematically misconstrue MIRI/​Eliezer as believing much more than in reality that various concepts an argument wanders over are Big Ideas to us, when some conversation forces us to go to that place:

(A) It paints a comfortably unflattering picture of MIRI-the-Other as weirdly obsessed with these concepts that seem not so persuasive, or more generally paints the Other as a bunch of weirdos who stumbled across some concept like “consequentialism” and got obsessed with it. In general, to depict the Other as thinking a great deal of some idea (or explanatory thought experiment) is to tie and stake their status to the listener’s view of how much status that idea deserves. So if you say that the Other thinks a great deal of some idea that isn’t obviously high-status, that lowers the Other’s status, which can be a comfortable thing to do.

(cont.)

(B) It paints a more comfortably self-flattering picture of a continuing or persistent disagreement, as a disagreement with somebody who thinks that some random concept is much higher-status than it really is, in which case there isn’t more to done or understood except to duly politely let the other person try to persuade you the concept deserves its high status. As opposed to, “huh, maybe there is a noncentral point that the other person sees themselves as being stopped on and forced to explain to me”, which is a much less self-flattering viewpoint on why the conversation is staying within a place. And correspondingly more of a viewpoint that somebody else is likely to have of us, because it is a comfortable view to them, than a viewpoint that it is comfortable to us to imagine them having.

Taking the viewpoint that somebody else is getting hung up on a relatively noncentral point can also be a flattering self-portrait to somebody who believes that, of course. It doesn’t mean they’re right. But it does mean that you should be aware of how the Other’s story, told from the Other’s viewpoint, is much more liable to be something that the Other finds sensible and perhaps comfortable, even if it implies an unflattering (and untrue-seeming and perhaps untrue) view of yourself, than something that makes the Other seem weird and silly and which it is easy and congruent for you yourself to imagine the Other thinking.

[Ngo][11:18] (Sep. 12 comment)

I’m a little puzzled by how often you write my viewpoint as thinking that whatever I happened to say a sentence about is the Key Thing.

In this case, I emphasised the outcome pump thought experiment because you said that the time-travelling scenario was a key moment for your understanding of optimisation, and the outcome pump seemed to be similar enough and easier to convey in the summary, since you’d already written about it.

I’m also emphasising consequentialism because it seemed like the core idea which kept coming up in our first debate, under the heading of “deep problem-solving patterns”. Although I take your earlier point that you tend to emphasise things that your interlocutor is more skeptical about, not necessarily the things which are most central to your view. But if consequentialism isn’t in fact a very central concept for you, I’d be interested to hear what role it plays.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: “There’s a component of ‘finding a plan which achieves a certain outcome’ which involves actually solving the object-level problem of how someone who is given the plan can achieve the outcome. And there’s another component which is figuring out how to manipulate that person into doing what you want. To me it seems like Eliezer’s argument is that there’s no training regime which leads an AI to spend 99% of its time thinking about the former, and 1% thinking about the latter.”

[Yudkowsky][11:20] (Sep. 10 comment)

no training regime

...that the training regimes we come up with first, in the 3 months or 2 years we have before somebody else destroys the world, will not have this property.

I don’t have any particularly complicated or amazingly insightful theories of why I keep getting depicted as a fatalist; but my world is full of counterfactual functions, not constants. And I am always aware that if we had access to a real Textbook from the Future explaining all of the methods that are actually robust in real life—the equivalent of telling us in advance about all the ReLUs that in real life were only invented and understood a few decades after sigmoids—we could go right ahead and build a superintelligence that thinks 2 + 2 = 5.

All of my assumptions about “I don’t see how to do X” are always labeled as ignorance on my part and a default because we won’t have enough time to actually figure out how to do X. I am constantly maintaining awareness of this because being wrong about it being difficult is a major place where hope potentially comes from, if there’s some idea like ReLUs that robustly vanquishes the difficulty, which I just didn’t think of. Which does not, alas, mean that I am wrong about any particular thing, nor that the infinite source of optimistic ideas that is the wider field of “AI alignment” is going to produce a good idea from the same process that generates all the previous naive optimism through not seeing where the original difficulty comes from or what other difficulties surround obvious naive attempts to solve it.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard (continued): “While this may be true in the limit of increasing intelligence, the most relevant systems are the earliest ones that are above human level. But humans deviate from the consequentialist abstraction you’re talking about in all sorts of ways—for example, being raised in different cultures can make people much more or less consequentialist. So it seems plausible that early AGIs can be superhuman while also deviating strongly from this abstraction—not necessarily in the same ways as humans, but in ways that we push them towards during training.”

Eliezer, summarized by Richard: “Even at the Einstein or von Neumann level these types of deviations start to subside. And the sort of pivotal acts which might realistically work require skills significantly above human level. I think even 1% of the cognition of an AI that can assemble advanced nanotech, thinking about how to kill humans, would doom us. Your other suggestions for pivotal acts (surveillance to restrict AGI proliferation; persuading world leaders to restrict AI development) are not politically feasible in real life, to the level required to prevent the world from ending; or else require alignment in the very dangerous domain of superhuman manipulation.”

Richard, summarized by Richard: “I think we probably also have significant disagreements about geopolitics which affect which acts we expect to be pivotal, but it seems like our comparative advantage is in discussing cognition, so let’s focus on that. We can build systems that outperform humans at quite a few tasks by now, without them needing search abilities that are general enough to even try to take over the world. Putting aside for a moment the question of which tasks are pivotal enough to save the world, which parts of your model draw the line between human-level chess players and human-level galaxy-colonisers, and say that we’ll be able to align ones that significantly outperform us on these tasks before they take over the world, but not on those tasks?”

Eliezer, summarized by Richard: “One aspect there is domain generality which in turn is achieved through novel domain learning. One can imagine asking the question: is there a superintelligent AGI that can quickly build nanotech the way that a beaver solves building dams, in virtue of having a bunch of specialized learning abilities without it ever having a cross-domain general learning ability? But there are many, many, many things that humans do which no other animal does, which you might think would contribute a lot to that animal’s fitness if there were animalistic ways to do it—e.g. mining and smelting iron. (Although comparisons to animals are not generally reliable arguments about what AIs can do—e.g. chess is much easier for chips than neurons.) So my answer is ‘Perhaps, but not by default, there’s a bunch of subproblems, I don’t actually know how to do it right now, it’s not the easiest way to get an AGI that can build nanotech.’ Can I explain how I know that? I’m really not sure I can.

[Yudkowsky][11:26] (Sep. 10 comment)

Can I explain how I know that? I’m really not sure I can.

In original text, this sentence was followed by a long attempt to explain anyways; if deleting that, which is plausibly the correct choice, this lead-in sentence should also be deleted, as otherwise it paints a false picture of how much I would try to explain anyways.

[Ngo][11:15] (Sep. 12 comment)

Makes sense; deleted.

[Ngo] (Sep. 10 Google Doc)

Richard, summarized by Richard: “Challenges which are trivial from a human-engineering perspective can be very challenging from an evolutionary perspective (e.g. spinning wheels). So the evolution of animals-with-a-little-bit-of-help-from-humans might end up in very different places from the evolution of animals-just-by-themselves. And analogously, the ability of humans to fill in the gaps to help less general AIs achieve more might be quite significant.

“On nanotech: what are the most relevant axes of difference between solving protein folding and designing nanotech that, say, self-assembles into a computer?”

Eliezer, summarized by Richard: “This question seemed potentially cruxy to me. I.e., if building protein factories that built nanofactories that built nanomachines that met a certain deep and lofty engineering goal, didn’t involve cognitive challenges different in kind from protein folding, we could maybe just safely go do that using AlphaFold 3, which would be just as safe as AlphaFold 2. I don’t think we can do that. But it is among the more plausible advance-specified miracles we could get. At this point our last hope is that in fact the future is often quite surprising.”

Richard, summarized by Richard: “It seems to me that you’re making the same mistake here as you did with regards to recursive self-improvement in the AI foom debate—namely, putting too much trust in one big abstraction.”

Eliezer, summarized by Richard: “I suppose that is what it could potentially feel like from the inside to not get an abstraction. Robin Hanson kept on asking why I was trusting my abstractions so much, when he was in the process of trusting his worse abstractions instead.”

4.2. Nate Soares’ summary

[Soares] (Sep. 12 Google Doc)

Consequentialism

Ok, here’s a handful of notes. I apologize for not getting them out until midday Sunday. My main intent here is to do some shoring up of the ground we’ve covered. I’m hoping for skims and maybe some light comment back-and-forth as seems appropriate (perhaps similar to Richard’s summary), but don’t think we should derail the main thread over it. If time is tight, I would not be offended for these notes to get little-to-no interaction.

---

My sense is that there’s a few points Eliezer was trying to transmit about consequentialism, that I’m not convinced have been received. I’m going to take a whack at it. I may well be wrong, both about whether Eliezer is in fact attempting to transmit these, and about whether Richard received them; I’m interested in both protests from Eliezer and paraphrases from Richard.

[Soares] (Sep. 12 Google Doc)

1. “The consequentialism is in the plan, not the cognition”.

I think Richard and Eliezer are coming at the concept “consequentialism” from very different angles, as evidenced eg by Richard saying (Nate’s crappy paraphrase:) “where do you think the consequentialism is in a cat?” and Eliezer responding (Nate’s crappy paraphrase:) “the cause of the apparent consequentialism of the cat’s behavior is distributed between its brain and its evolutionary history”.

In particular, I think there’s an argument here that goes something like:

  • Observe that, from our perspective, saving the world seems quite tricky, and seems likely to involve long sequences of clever actions that force the course of history into a narrow band (eg, because if we saw short sequences of dumb actions, we could just get started).

  • Suppose we were presented with a plan that allegedly describes a long sequence of clever actions that would, if executed, force the course of history into some narrow band.

    • For concreteness, suppose it is a plan that allegedly funnels history into the band where we have wealth and acclaim.

  • One plausible happenstance is that the plan is not in fact clever, and would not in fact have a forcing effect on history.

    • For example, perhaps the plan describes founding and managing some silicon valley startup, that would not work in practice.

  • Conditional on the plan having the history-funnelling property, there’s a sense in which it’s scary regardless of its source.

    • For instance, perhaps the plan describes founding and managing some silicon valley startup, and will succeed virtually every time it’s executed, by dint of having very generic descriptions of things like how to identify and respond to competition, including descriptions of methods for superhumanly-good analyses of how to psychoanalyze the competition and put pressure on their weakpoints.

    • In particular, note that one need not believe the plan was generated by some “agent-like” cognitive system that, in a self-contained way, made use of reasoning we’d characterize as “possessing objectives” and “pursuing them in the real world”.

    • More specifically, the scariness is a property of the plan itself. For instance, the fact that this plan accrues wealth and acclaim to the executor, in a wide variety of situations, regardless of what obstacles arise, implies that the plan contains course-correcting mechanisms that keep the plan on-target.

    • In other words, plans that manage to actually funnel history are (the argument goes) liable to have a wide variety of course-correction mechanisms that keep the plan oriented towards some target. And while this course-correcting property tends to be a property of history-funneling plans, the choice of target is of course free, hence the worry.

(Of course, in practice we perhaps shouldn’t be visualizing a single Plan handed to us from an AI or a time machine or whatever, but should instead imagine a system that is reacting to contingencies and replanning in realtime. At the least, this task is easier, as one can adjust only for the contingencies that are beginning to arise, rather than needing to predict them all in advance and/​or describe general contingency-handling mechanisms. But, and feel free to take a moment to predict my response before reading the next sentence, “run this AI that replans autonomously on-the-fly” and “run this AI+human loop that replans+reevaluates on the fly”, are still in this sense “plans”, that still likely have the property of Eliezer!consequentialism, insofar as they work.)

[Soares] (Sep. 12 Google Doc)

There’s a part of this argument I have not yet driven home. Factoring it out into a separate bullet:

2. “If a plan is good enough to work, it’s pretty consequentialist in practice”.

In attempts to collect and distill a handful of scattered arguments of Eliezer’s:

If you ask GPT-3 to generate you a plan for saving the world, it will not manage to generate one that is very detailed. And if you tortured a big language model into giving you a detailed plan for saving the world, the resulting plan would not work. In particular, it would be full of errors like insensitivity to circumstance, suggesting impossible actions, and suggesting actions that run entirely at cross-purposes to one another.

A plan that is sensitive to circumstance, and that describes actions that synergize rather than conflict—like, in Eliezer’s analogy, photons in a laser—is much better able to funnel history into a narrow band.

But, on Eliezer’s view as I understand it, this “the plan is not constantly tripping over its own toes” property, goes hand-in-hand with what he calls “consequentialism”. As a particularly stark and formal instance of the connection, observe that one way a plan can trip over its own toes is if it says “then trade 5 oranges for 2 apples, then trade 2 apples for 4 oranges”. This is clearly an instance of the plan failing to “lase”—of some orange-needing part of the plan working at cross-purposes to some apple-needing part of the plan, or something like that. And this is also a case where it’s easy to see how if a plan is “lasing” with respect to apples and oranges, then it is behaving as if governed by some coherent preference.

And the point as I understand it isn’t “all toe-tripping looks superficially like an inconsistent preference”, but rather “insofar as a plan does manage to chain a bunch of synergistic actions together, it manages to do so precisely insofar as it is Eliezer!consequentialist”.

cf the analogy to information theory, where if you’re staring at a maze and you’re trying to build an accurate representation of that maze in your own head, you will succeed precisely insofar as your process is Bayesian /​ information-theoretic. And, like, this is supposed to feel like a fairly tautological claim: you (almost certainly) can’t get the image of a maze in your head to match the maze in the world by visualizing a maze at random, you have to add visualized-walls using some process that’s correlated with the presence of actual walls. Your maze-visualizing process will work precisely insofar as you have access to & correctly make use of, observations that correlate with the presence of actual walls. You might also visualize extra walls in locations where it’s politically expedient to believe that there’s a wall, and you might also avoid visualizing walls in a bunch of distant regions of the maze because it’s dark and you haven’t got all day, but the resulting visualization in your head is accurate precisely insofar as you’re managing to act kinda like a Bayesian.

Similarly (the analogy goes), a plan works-in-concert and avoids-stepping-on-its-own-toes precisely insofar as it is consequentialist. These are two sides of the same coin, two ways of seeing the same thing.

And, I’m not so much attempting to argue the point here, as to make sure that the shape of the argument (as I understand it) has been understood by Richard. In particular, the shape of the argument I see Eliezer as making is that “clumsy” plans don’t work, and “laser-like plans” work insofar as they are managing to act kinda like a consequentialist.

Rephrasing again: we have a wide variety of mathematical theorems all spotlighting, from different angles, the fact that a plan lacking in clumsiness, is possessing of coherence.

(“And”, my model of Eliezer is quick to note, “this ofc does not mean that all sufficiently intelligent minds must generate very-coherent plans. If you really knew what you were doing, you could design a mind that emits plans that always “trip over themselves” along one particular axis, just as with sufficient mastery you could build a mind that believes 2+2=5 (for some reasonable cashing-out of that claim). But you don’t get this for free—and there’s a sort of “attractor” here, when building cognitive systems, where just as generic training will tend to cause it to have true beliefs, so will generic training tend to cause its plans to lase.”)

(And ofc much of the worry is that all the mathematical theorems that suggest “this plan manages to work precisely insofar as it’s lasing in some direction”, say nothing about which direction it must lase. Hence, if you show me a plan clever enough to force history into some narrow band, I can be fairly confident it’s doing a bunch of lasing, but not at all confident which direction it’s lasing in.)

[Soares] (Sep. 12 Google Doc)

One of my guesses is that Richard does in fact understand this argument (though I personally would benefit from a paraphrase, to test this hypothesis!), and perhaps even buys it, but that Richard gets off the train at a following step, namely that we need plans that “lase”, because ones that don’t aren’t strong enough to save us. (Where in particular, I suspect most of the disagreement is in how far one can get with plans that are more like language-model outputs and less like lasers, rather than in the question of which pivotal acts would put an end to the acute risk period)

But setting that aside for a moment, I want to use the above terminology to restate another point I saw Eliezer as attempting to make: one big trouble with alignment, in the case where we need our plans to be like lasers, is that on the one hand we need our plans to be like lasers, but on the other hand we want them to fail to be like lasers along certain specific dimensions.

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.

But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

As such, on the Eliezer view as I understand it, we can see ourselves as asking for a very unnatural sort of object: a path-through-the-future that is robust enough to funnel history into a narrow band in a very wide array of circumstances, but somehow insensitive to specific breeds of human-initiated attempts to switch which narrow band it’s pointed towards.

Ok. I meandered into trying to re-articulate the point over and over until I had a version distilled enough for my own satisfaction (which is much like arguing the point), apologies for the repetition.

I don’t think debating the claim is the right move at the moment (though I’m happy to hear rejoinders!). Things I would like, though, are: Eliezer saying whether the above is on-track from his perspective (and if not, then poking a few holes); and Richard attempting to paraphrase the above, such that I believe the arguments themselves have been communicated (saying nothing about whether Richard also buys them).

---

[Soares] (Sep. 12 Google Doc)

My Richard-model’s stance on the above points is something like “This all seems kinda plausible, but where Eliezer reads it as arguing that we had better figure out how to handle lasers, I read it as an argument that we’d better save the world without needing to resort to lasers. Perhaps if I thought the world could not be saved except by lasers, I would share many of your concerns, but I do not believe that, and in particular it looks to me like much of the recent progress in the field of AI—from AlphaGo to GPT to AlphaFold—is evidence in favor of the proposition that we’ll be able to save the world without lasers.”

And I recall actual-Eliezer saying the following (more-or-less in response, iiuc, though readers note that I might be misunderstanding and this might be out-of-context):

Definitely, “turns out it’s easier than you thought to use gradient descent’s memorization of zillions of shallow patterns that overlap and recombine into larger cognitive structures, to add up to a consequentialist nanoengineer that only does nanosystems and never does sufficiently general learning to apprehend the big picture containing humans, while still understanding the goal for that pivotal act you wanted to do” is among the more plausible advance-specified miracles we could get.

On my view, and I think on Eliezer’s, the “zillions of shallow patterns”-style AI that we see today, is not going to be sufficient to save the world (nor destroy it). There’s a bunch of reasons that GPT and AlphaZero aren’t destroying the world yet, and one of them is this “shallowness” property. And, yes, maybe we’ll be wrong! I myself have been surprised by how far the shallow pattern memorization has gone (and, for instance, was surprised by GPT), and acknowledge that perhaps I will continue to be surprised. But I continue to predict that the shallow stuff won’t be enough.

I have the sense that lots of folk in the community are, one way or another, saying “Why not consider the problems of aligning systems that memorize zillions of shallow patterns?”. And my answer is, “I still don’t expect those sorts of machines to either kill or save us, I’m still expecting that there’s a phase shift that won’t happen until AI systems start to be able to make plans that are sufficiently deep and laserlike to do scary stuff, and I’m still expecting that the real alignment challenges are in that regime.”

And this seems to me close to the heart of the disagreement: some people (like me!) have an intuition that it’s quite unlikely that figuring out how to get sufficient work out of shallow-memorizers is enough to save us, and I suspect others (perhaps even Richard!) have the sense that the aforementioned “phase shift” is the unlikely scenario, and that I’m focusing on a weird and unlucky corner of the space. (I’m curious whether you endorse this, Richard, or some nearby correction of it.)

In particular, Richard, I am curious whether you endorse something like the following:

  • I’m focusing ~all my efforts on the shallow-memorizers case, because I think shallow-memorizer-alignment will by and large be sufficient, and even if it is not then I expect it’s a good way to prepare ourselves for whatever we’ll turn out to need in practice. In particular I don’t put much stock in the idea that there’s a predictable phase-change that forces us to deal with laser-like planners, nor that predictable problems in that domain give large present reason to worry.

(I suspect not, at least not in precisely this form, and I’m eager for corrections.)

I suspect something in this vicinity constitutes a crux of the disagreement, and I would be thrilled if we could get it distilled down to something as concise as the above. And, for the record, I personally endorse the following counter to the above:

  • I am focusing ~none of my efforts on shallow-memorizer-alignment, as I expect it to be far from sufficient, as I do not expect a singularity until we have more laser-like systems, and I think that the laserlike-planning regime has a host of predictable alignment difficulties that Earth does not seem at all prepared to face (unlike, it seems to me, the shallow-memorizer alignment difficulties), and as such I have large and present worries.

---

[Soares] (Sep. 12 Google Doc)

Ok, and now a few less substantial points:

There’s a point Richard made here:

Oh, interesting. Actually one more question then: to what extent do you think that explicitly reasoning about utility functions and laws of rationality is what makes consequentialists have the properties you’ve been talking about?

that I suspect constituted a miscommunication, especially given that the following sentence appeared in Richard’s summary:

A third thing that makes humans in particular consequentialist is planning, especially when we’re aware of concepts like utility functions.

In particular, I suspect Richard’s model of Eliezer’s model places (or placed, before Richard read Eliezer’s comments on Richard’s summary) some particular emphasis on systems reflecting and thinking about their own strategies, as a method by which the consequentialism and/​or effectiveness gets in. I suspect this is a misunderstanding, and am happy to say more on my model upon request, but am hopeful that the points I made a few pages above have cleared this up.

Finally, I observe that there are a few places where Eliezer keeps beeping when Richard attempts to summarize him, and I suspect it would be useful to do the dorky thing of Richard very explicitly naming Eliezer’s beeps as he understands them, for purposes of getting common knowledge of understanding. For instance, things I think it might be useful for Richard to say verbatim (assuming he believes them, which I suspect, and subject to Eliezer-corrections, b/​c maybe I’m saying things that induce separate beeps):

1. Eliezer doesn’t believe it’s impossible to build AIs that have most any given property, including most any given safety property, including most any desired “non-consequentialist” or “deferential” property you might desire. Rather, Eliezer believes that many desirable safety properties don’t happen by default, and require mastery of minds that likely takes a worrying amount of time to acquire.

2. The points about consequentialism are not particularly central in Eliezer’s view; they seem to him more like obvious background facts; the reason conversation has lingered here in the EA-sphere is that this is a point that many folk in the local community disagree on.

For the record, I think it might also be worth Eliezer acknowledging that Richard probably understands point (1), and that glossing “you don’t get it for free by default and we aren’t on course to have the time to get it” as “you can’t” is quite reasonable when summarizing. (And it might be worth Richard counter-acknowledging that the distinction is actually quite important once you buy the surrounding arguments, as it constitutes the difference between describing the current playing field and laying down to die.) I don’t think any of these are high-priority, but they might be useful if easy :-)

---

Finally, stating the obvious-to-me, none of this is intended as criticism of either party, and all discussing parties have exhibited significant virtue-according-to-Nate throughout this process.

[Yudkowsky][21:27] (Sep. 12)

From Nate’s notes:

For instance, the plan presumably needs to involve all sorts of mechanisms for refocusing the laser in the case where the environment contains fog, and redirecting the laser in the case where the environment contains mirrors (...the analogy is getting a bit strained here, sorry, bear with me), so that it can in fact hit a narrow and distant target. Refocusing and redirecting to stay on target are part and parcel to plans that can hit narrow distant targets.

But the humans shutting the AI down is like scattering the laser, and the humans tweaking the AI so that it plans in a different direction is like them tossing up mirrors that redirect the laser; and we want the plan to fail to correct for those interferences.

--> GOOD ANALOGY.

...or at least it sure conveys to me why corrigibility is anticonvergent /​ anticoherent /​ actually moderately strongly contrary to and not just an orthogonal property of a powerful-plan generator.

But then, I already know why that’s true and how it generalized up to resisting our various attempts to solve small pieces of more important aspects of it—it’s not just true by weak default, it’s true by a stronger default where a roomful of people at a workshop spend several days trying to come up with increasingly complicated ways to describe a system that will let you shut it down (but not steer you through time into shutting it down), and all of those suggested ways get shot down. (And yes, people outside MIRI now and then publish papers saying they totally just solved this problem, but all of those “solutions” are things we considered and dismissed as trivially failing to scale to powerful agents—they didn’t understand what we considered to be the first-order problems in the first place—rather than these being evidence that MIRI just didn’t have smart-enough people at the workshop.)

[Yudkowsky][18:56] (Nov. 5 follow-up comment)

Eg, “Well, we took a system that only learned from reinforcement on situations it had previously been in, and couldn’t use imagination to plan for things it had never seen, and then we found that if we didn’t update it on shut-down situations it wasn’t reinforced to avoid shutdowns!”