Eliezer Yudkowsky comments on Critical review of Christiano’s disagreements with Yudkowsky

Eliezer Yudkowsky 27 Dec 2023 18:39 UTC
LW: 75 AF: 28
25
AF
I disagree with my characterization as thinking problems can be solved on paper, and with the name “Poet”. I think the problems can’t be solved by twiddling systems weak enough to be passively safe, and hoping their behavior generalizes up to dangerous levels. I don’t think paper solutions will work either, and humanity needs to back off and augment intelligence before proceeding. I do not take the position that we need a global shutdown of this research field because I think that guessing stuff without trying it is easy, but because guessing it even with some safe weak lesser tries is still impossibly hard. My message to humanity is “back off and augment” not “back off and solve it with a clever theory”.
What links here?
- Critical review of Christiano’s disagreements with Yudkowsky by Vanessa Kosoy (27 Dec 2023 16:02 UTC; 172 points)
- AI Alignment Metastrategy by Vanessa Kosoy (31 Dec 2023 12:06 UTC; 117 points)
- Vanessa Kosoy 27 Dec 2023 18:53 UTC
  LW: 31 AF: 12
  11
  AF Parent
  Thank you for the clarification.
  How do you expect augmented humanity will solve the problem? Will it be something other than “guessing it with some safe weak lesser tries / clever theory”?
  - Eliezer Yudkowsky 28 Dec 2023 17:53 UTC
    LW: 10 AF: 6
    15
    AF Parent
    They can solve it however they like, once they’re past the point of expecting things to work that sometimes don’t work. I have guesses but any group that still needs my hints should wait and augment harder.
    - peterbarnett 28 Dec 2023 18:21 UTC
      39 points
      14
      Parent
      I have guesses but any group that still needs my hints should wait and augment harder.
      I think this is somewhat harmful to there being a field of (MIRI-style) Agent Foundations. It seems pretty bad to require that people attempting to start in the field have to work out the foundations themselves, I don’t think any scientific fields have worked this way in the past.
      Maybe the view is that if people can’t work out the basics then they won’t be able to make progress, but this doesn’t seem at all clear to me. Many physicists in the 20th century were unable to derive the basics of quantum mechanics or general relativity, but once they were given the foundations they were able to do useful work. I think the skills of working out foundations of a new field can be different than building on those foundations.
      Also, maybe these “hints” aren’t that useful and so it’s not worth sharing. Or (more likely in my view) the hints are tied up with dangerous information such that sharing increases risk, and you want to have more signal on someone’s ability to do good work before taking that risk.
      - Ben Pace 28 Dec 2023 20:48 UTC
        18 points
        7
        Parent
        If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular. If folks still feel that they need low-confidence hints after that then I think they will probably not benefit much from hearing Eliezer’s current guesses, and I suspect may be trying to guess the teacher’s passwords rather than solve the problem.
        Said Achmiz 28 Dec 2023 23:19 UTC
        44 points
        22
        Parent
        Is there actually anyone who has read all those words and thus understands how Eliezer would tackle the problem? (Do you, for example?)
        
        This is not a rhetorical question, nor a trivial one; for example, I notice that @Vanessa Kosoy apparently misunderstood some major part of it (as evidenced by this comment thread), and she’s been a MIRI researcher for 8 years (right? or so LinkedIn says, anyhow). So… who does get it?
        habryka 29 Dec 2023 22:39 UTC
        11 points
        2
        Parent
        I think this is a reasonable response, but also, independently of whether Eliezer successfully got the generators of his thought-process across, the volume of words still seems like substantial evidence that it’s reasonably for Eliezer to not think that marginally writing more will drastically change things from his perspective.
        Said Achmiz 29 Dec 2023 22:44 UTC
        2 points
        0
        Parent
        Sure. My comment was not intended to bear on the question of whether it’s useful for Eliezer to write more words or not—I was only responding directly to Ben.
        
        EDIT: Although, of course, if the already-written words haven’t been effective, then that is also evidence that writing more words won’t help. So, either way, I agree with your view.
        roland 6 Jan 2024 22:38 UTC
        1 point
        0
        Parent
        
        that it’s reasonably for Eliezer to not think that marginally writing more will drastically change things from his perspective.
        
        Scientific breakthroughs live on the margins, so if he has guesses on how to achieve alignment sharing them could make a huge difference.
        Ben Pace 30 Dec 2023 0:08 UTC
        5 points
        2
        Parent
        I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”. Is the standard “the underlying philosophy was successfully communicated to lots of people”? If so I’ll note as background that it is pretty standard that such underlying philosophies of solving problems are hard to communicate — someone who reads Shannon or Pearl’s writing does not become equal to them, someone who read Faraday’s notes couldn’t pick up his powerful insights, until Maxwell formalized them neatly^[1], and (for picking up philosophies broadly) someone who reads Aurelius or Hume or Mill cannot produce the sorts of works those people would have produced had they been alive for longer or in the present day.
        Is the standard “people who read Eliezer’s writing go on to do research that Eliezer judges to be remotely on the right track”? Then I think a couple of folks have done stuff here that Eliezer would view as at all on the right track, including lots of MIRI researchers, Wei Dai, John Wentworth, the mesa-optimizers paper, etc. It would be a bit of effort to give me an overview of all the folks who took parts of Eliezer’s writing and did useful things as a direct result of the specific ideas, and then assess how successful that counts as.
        My current guess is that Eliezer’s research dreams have not been that successfully picked up by others relative to what I would have predicted 10 years ago. I am not confident why this is — perhaps other institutional support has been lacking for people to work on it, perhaps there have been active forces in the direction of “not particularly trying to understand AI” due to the massive success of engineering approaches over scientific approaches, perhaps Eliezer’s particular approach has flaws that make it not very productive.
        I think my primary guess for why there has been less progress on Eliezer’s research dreams is that the subject domain is itself very annoying to get in contact with, due to us not having a variety of superintelligences to play with, and my anticipation that when we do get to that stage our lives will soon be over, so it’s much harder to make any progress on these problems than it is in the vast majority of other domains humans have been successful in.
        ^
        Nate Soares has a blogpost that discusses this point that I found insightful, here’s a quote.
        Faraday discovered a wide array of electromagnetic phenomena, guided by an intuition that he wasn’t able to formalize or transmit except through hundreds of pages of detailed laboratory notes and diagrams; Maxwell later invented the language to describe electromagnetism formally by reading Faraday’s work, and expressed those hundreds of pages of intuitions in three lines.
        Said Achmiz 30 Dec 2023 0:35 UTC
        2 points
        2
        Parent
        
        I am a bit unsure of the standard here for “understands how Eliezer would tackle the problem”.
        
        Huh? Isn’t this a question for you, not for me? You wrote this:
        
        If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular.
        
        So, the standard is… whatever you had in mind there?
        Ben Pace 30 Dec 2023 1:01 UTC
        6 points
        3
        Parent
        Oh. Here’s a compressed gloss of this conversation as I understand it.
        Eliezer: People should give up on producing alignment theory directly and instead should build augmented humans to produce aligned agents.
        Vanessa: And how do you think the augmented humans will go about doing so?
        Eliezer: I mean, the point here is that’s a problem for them to solve. Insofar as they need my help they have not augmented enough. (Ben reads into Eliezer’s comment: And insofar as you wish to elicit my approach, I have already written over a million words on this, a few marginal guesses ain’t gonna do much.)
        Peter: It seems wrong to not share your guesses here, people should not have to build the foundations themselves without help from Eliezer.
        Ben: Eliezer has spent many years writing up and helping build the foundations, this comment seems very confusing given that context.
        Said: But have Eliezer’s writings worked?
        Ben: (I’m not quite sure what the implied relationship is of this question to the question of whether Eliezer has tried very hard to help build a foundation for alignment research, but I will answer the question directly.) To the incredibly high standard that ~nobody ever succeeds at, it has not succeeded. To the lower standard that scientists sometimes succeed at, it has worked a little but has not been sufficient.
        My first guess was that you’re implicitly asking because, if it has not been successful, then you think Eliezer still ought to answer questions like Vanessa’s. I am not confident in this guess.
        If you’re simply asking for clarification on whether Eliezer’s writing works to convey his philosophy and approach to the alignment problem, I have now told you two standards to which I would evaluate this question and how I think it does on those standards. Do you understand why I wrote what I wrote in my reply to Peter?
        Said Achmiz 30 Dec 2023 1:29 UTC
        7 points
        3
        Parent
        Uh… I think you’re somewhat overcomplicating things. Again, you wrote (emphasis mine):
        
        If folks want to understand how Eliezer would tackle the problem they can read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular.
        
        In other words, you wrote that if people want X, they can do Y. (Implying that doing Y will, with non-trivial probability, cause people to gain X.^[1]) My question was simply whether there exists anyone who has done Y and now, as a consequence, has X.
        
        There are basically two possible answers to this:
        
        “Yes, persons A, B, C have all done Y, and now, as a consequence, have X.”
        
        Or
        
        “No, there exists no such person who has done Y and now, as a consequence, has X.”
        
        (Which may be because nobody has done Y, or it may be because some people have done Y, but have not gained X.)
        
        There’s not any need to bring in questions of standards or trying hard or… anything like that.
        
        So, to substitute the values back in for the variables, I could ask:
        
        Would you say that you “understand how Eliezer would tackle the problem”?
        
        and:
        
        Have you “read the over-one-million thoughtfully written words he has published across LessWrong and Arbital and in MIRI papers about how to solve hard problems in general and how to think about AI in particular”?
        
        Likewise, the same question about Vanessa: does she “understand … etc.”, and has she “read the … etc.”?
        
        Again, what I mean by all of these words is just whatever you meant when you wrote them.
        
        ↩︎
        “Implying” in the Gricean sense, of course, not in the logical sense. Strictly speaking, if we drop all Gricean assumptions, we could read your original claim as being analogous to, e.g., “if folks want to grow ten feet fall, they can paint their toenails blue”—which is entirely true, and yet does not, strictly speaking, entail any causal claims. I assumed that this kind of thing is not what you meant, because… that would make your comment basically pointless.
        
        Ben Pace 31 Dec 2023 6:51 UTC
        10 points
        7
        Parent
        Ah, but I think you have assumed slightly too much in what I meant. I simply meant to say that if someone wants Eliezer’s help in their goal to “work out the foundations” to “a field of (MIRI-style) Agent Foundations” the better way to understand Eliezer’s perspective on these difficult questions is to read the high-effort and thoughtful blog posts and papers he produced that’s intended to communicate his perspective on these questions, rather than ask him for some quick guesses on how one could in-principle solve the problems. I did not mean to imply that (either) way would necessarily work, as the goal itself is hard to achieve, I simply meant that one approach is clearly much more likely than the other to achieve that goal.
        (That said, again to answer your question, my current guess is that Nate Soares is an example of a person who read those works and then came to share a lot of Eliezer’s approach to solving the problems. Though I’m honestly not sure how much he would put down to the factors of (a) reading the writing (b) working directly with Eliezer (c) trying to solve the problem himself and coming up with similar approaches. I also think that Wei Dai at least understood enough to make substantial progress on an open research question in decision theory, and similar things can be said of Scott Garrabrant re: logical uncertainty and others at MIRI.)
    - Raemon 28 Dec 2023 20:53 UTC
      LW: 6 AF: 2
      2
      AF Parent
      Fwiw this doesn’t feel like a super helpful comment to me. I think there might be a nearby one that’s more useful, but this felt kinda coy for the sake of being coy.
      - Noosphere89 28 Dec 2023 22:28 UTC
        −8 points
        −7
        Parent
        Yeah, I feel this is quite similar to OpenAI’s plan to defer alignment to future AI researchers, except worse, because if we grant that the plan proposed actually made the augmented humans stably aligned with our values, then it would be far easier to do scalable oversight, because we have a bunch of advantages around controlling AIs, like the fact that it would be socially acceptable to control AI in ways that wouldn’t be socially acceptable to do if it involved humans, the incentives to control AI are much stronger than controlling humans, etc.
        
        I truly feel like Eliezer has reinvented a plan that OpenAI/Anthropic are already doing, except worse, which is deferring alignment work to future intelligences, and Eliezer doesn’t realize this, so the comments treat it as though it’s something new rather than an already done plan, just with AI swapped out for humans.
        
        It’s not just coy, it’s reinventing an idea that’s already there, except worse, and he doesn’t tell you that if you swap the human for AI, it’s already being done.
        
        Link for why AI is easier to control than humans below:
        
        https://optimists.ai/2023/11/28/ai-is-easy-to-control/
        Raemon 28 Dec 2023 23:05 UTC
        3 points
        4
        Parent
        fwiw, this seems false to me and not particularly related to what I was saying.
    - roland 2 Jan 2024 16:12 UTC
      2 points
      1
      Parent
      I have guesses
      Even a small probability of solving alignment should have big expected utility modulo exfohazard. So why not share your guesses?
- sludgepuddle 27 Dec 2023 23:29 UTC
  7 points
  5
  Parent
  I don’t know whether augmentation is the right step after backing off or not, but I do know that the simpler “back off” is a much better message to send to humanity than that. More digestible, more likely to be heard, more likely to be understood, doesn’t cause people to peg you as a rational tech bro, doesn’t at all sound like the beginning of a sci-fi apocalypse plot line. I could go on.
- Aleksi Liimatainen 28 Dec 2023 7:44 UTC
  6 points
  4
  Parent
  I feel like this “back off and augment” is downstream of an implicit theory of intelligence that is specifically unsuited to dealing with how existing examples of intelligence seem to work. Epistemic status: the idea used to make sense to me and apparently no longer does, in a way that seems related to the ways i’ve updated my theories of cognition over the past years.
  
  Very roughly, networking cognitive agents stacks up to cognitive agency at the next level up easier than expected and life has evolved to exploit this dynamic from very early on across scales. It’s a gestalt observation and apparently very difficult to articulate into a rational argument. I could point to memory in gene regulatory networks, Michael Levin’s work in nonneural cognition, trainability of computational ecological models (they can apparently be trained to solve sudoku), long term trends in cultural-cognitive evolution, and theoretical difficulties with traditional models of biological evolution—but I don’t know how to make the constellation of data points easily distinguishable from pareidolia.
- M. Y. Zuo 27 Dec 2023 20:48 UTC
  6 points
  3
  Parent
  My message to humanity is “back off and augment” not “back off and solve it with a clever theory”.
  Is there a reason why post-‘augmented’ individuals would even pay attention to the existing writings/opinions/desires/etc… of anyone, or anything, up to now?
  Or is this literally suggesting to leave everything in their future hands?
  - Noosphere89 27 Dec 2023 21:11 UTC
    −1 points
    −8
    Parent
    Yep, this is basically OpenAI’s alignment plan, but worse. IMO I’m pretty bullish on that plan, but yes this is pretty clearly already done, and I’m rather surprised by Eliezer’s comment here.
    - bideup 27 Dec 2023 23:27 UTC
      10 points
      7
      Parent
      Augmenting humans to do better alignment research seems like a pretty different proposal to building artificial alignment researchers.
      
      The former is about making (presumed-aligned) humans more intelligent, which is a biology problem, while the latter is about making (presumed-intelligent) AIs aligned, which is a computer science problem.
      - Noosphere89 28 Dec 2023 1:39 UTC
        8 points
        3
        Parent
        I think my crux is that if we assume that humans are scalable in intelligence without the assumption that they become misaligned, then it becomes much easier to argue that we’d be able to align AI without having to go through the process, for the reason sketched out by jdp:
        
        I think the crux is an epistemological question that goes something like: “How much can we trust complex systems that can’t be statically analyzed in a reductionistic way?” The answer you give in this post is “way less than what’s necessary to trust a superintelligence”. Before we get into any object level about whether that’s right or not, it should be noted that this same answer would apply to actual biological intelligence enhancement and uploading in actual practice. There is no way you would be comfortable with 300+ IQ humans walking around with normal status drives and animal instincts if you’re shivering cold at the idea of machines smarter than people.
        
        https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/?commentId=7iBb7aF4ctfjLH6AC
        quetzal_rainbow 11 Jan 2024 8:36 UTC
        1 point
        0
        Parent
        I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment. Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place. On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
    - M. Y. Zuo 28 Dec 2023 4:46 UTC
      1 point
      0
      Parent
      The ‘message’ surprised me since it seems to run counter to the whole point of LW.
      That non-super-geniuses, mostly just moderately above average folks, can participate and have some chance of producing genuinely novel insights, that future people will actually care to remember. Based on the principle of the supposed wisdom of the userbase ‘masses’ rubbing their ideas together enough times.
      Plus a few just-merely-geniuses shepherding them.
      But if this method can’t produce any meaningful results in the long term...
      OpenAI never advocated for the aforementioned so it isn’t as surprising if they adopt the everything hinges on the future ubermensch plan.
      - Seth Herd 29 Dec 2023 22:16 UTC
        2 points
        0
        Parent
        Maybe. But it wouldn’t make sense to judge an approach to a technical problem, alignment, based on what philosophy it was produced with. If we tried that philosophy and it didn’t work, that’s a reasonable thing to say and advocate for.
        
        I don’t think Eliezer’s reasoning for that conclusion is nearly adequate, and we still have almost no idea how hard alignment is, because the conversation has broken down.
- Christopher King 28 Dec 2023 23:58 UTC
  5 points
  0
  Parent
  
  I disagree with my characterization as thinking problems can be solved on paper
  
  Would you say the point of MIRI was/is to create theory that would later lead to safe experiments (but that it hasn’t happened yet)? Sort of like how the Manhattan project discovered enough physics to not nuke themselves, and then started experimenting? 🤔
- mishka 27 Dec 2023 22:42 UTC
  5 points
  0
  Parent
  augment intelligence
  
  I think two issues here should be discussed separately:
  - technical feasibility
  - whether this or that route to intelligence augmentation should be actually undertaken
  I suspect that intelligence augmentation is much more feasible in short-term than people usually assume. Namely, I think that enabling people to tightly couple themselves with specialized electronic devices via high-end non-invasive BCI is likely to do a lot in this sense.
  
  This route should be much safer, much quicker, and much cheaper than Neuralink-like approach, and I think it can still do a lot.
  
  Even though the approach with non-invasive BCI is much safer than Neuralink, the risks on the personal level are nevertheless formidable.
  
  On the social level, we don’t really know if the resulting augmented humans/hybrid human-electronic entities will be “Friendly”.
  
  So, should we try it? My personal AI timelines are rather short, and existing risks are formidable...
  
  So, I would advocate organizing an exploratory project of this kind to see if this is indeed technically feasible on a short-term time scale (my expectation is that a small group can obtain measurable progress here within months, not years), and ponder various safety issues deeper before scaling it or before sharing the obtained technological advances in a more public fashion...
  - Mo Putera 28 Dec 2023 11:55 UTC
    2 points
    0
    Parent
    enabling people to tightly couple themselves with specialized electronic devices via high-end non-invasive BCI
    Have you written more about why you think this is (to quote you) much more feasible in short-term than people usually assume / can you point me to writeups by others in this regard?
    - mishka 28 Dec 2023 16:30 UTC
      13 points
      0
      Parent
      Yes, I think there are four components here:
      
      how good is non-invasive reading from the brain
      
      how good is non-invasive writing into the brain or brain modulation, especially when assisted by feedback from the reading
      
      what are the risks, and how manageable they are
      
      what are the possible set-ups to use this (ranging from relatively softcore set-ups like electronic versions of nootropics, stimulants, and psychedelics, to more hardcore setups like tightly integrated information processing by a biological brain and an electronic device together)
      
      Starting from the question of possible set-ups, I was thinking about this on and off since the peak of my “rave and psytrance days” which was long ago, and I wrote a possible design spec about this 10-12 years ago, and, of course, this is one of many possible designs, and I am sure other designs of this kind exist, but this one is one possible example of how this can be done: https://github.com/anhinga/2021-notes/tree/main/mind-games
      
      The most crucial question is how good is non-invasive reading from the brain. We are seeing rapid progress in this sense in the last few years. I noticed the first promised report from 2019 and made a note of it at the bottom of mind-games/post-2-measuring-conscious-state.md, but these days we are inundated by this kind of reports of progress and successes in non-invasive reading from the brain via various channels, so these days it’s more “yes, we can do a lot even with something as simple as a high-end EEG, but can we do enough with a superconvenient low-end consumer-grade headband EEG or in-the-ear EEG, so that it’s not just non-invasive, but actually non-interfering with convenience”.
      
      So, in the sense of non-invasive reading, there are reasons for optimism.
      
      With writing and modulation, audio-visual channels are not just information carrying, but very psychoactive. For example, following MIT reports on curative properties of 40hz audio-visual impacts I self-experimented with 40hz sound (mostly in the form 40hz sine wave test tone) and found it strongly stimulating and also acoustically priming.
      
      In this sense, the information still going into the brain in the ordinary fashion via audio-visual channel, but there is also strong neuromodulation. And then, if one simultaneously reads from the brain, one has real-time feedback and can tune the impact a lot (but this is associated with potentially increased risks). I wrote more about this in mind-games/post-4-closing-the-loop.md
      
      People also explore transcranial magnetic stimulation, transcranial direct current, and especially transcranial ultrasound in recent years. Speaking of transcranial ultrasound, there is a long series of posts between Oct 6 and Dec 11 on this substack, and it covers both the potential of this, and evaluates risks:
      
      https://sarahconstantin.substack.com/p/ultrasound-neuromodulation (Oct 6)
      
      a lot of posts on this topic in between
      
      https://sarahconstantin.substack.com/p/risks-of-ultrasound-neuromodulation (Dec 11)
      
      Now, it’s a good point to move to risks.
      
      It’s really easy to cause full-blown seizures just by being a bit too aggressive with strobe lights (I did it to myself once many years ago with a light-and-sound machine (an inexpensive eyeglasses with flashing lights) by disobeying the instructions to keep my eyes closed, because it was so boring to keep them closed, and the visuals were so pretty when the eyes were open, and getting prettier every few seconds, and even prettier, and then … you know).
      
      When dealing with a closed feedback loop the risks are very formidable (even if there is no “AI” on the electronic side of things, and there might be one). I start to discuss the appropriate safety protocols in mind-games/post-4-closing-the-loop.md
      
      The post on risks of transcranial ultrasound by Sarah Constantin does make me quite apprehensive (e.g. there is a company named Prophetic AI, which hopes to have an EEG/transcranial ultrasound headband stabilizing lucid dreams, and I am sure it’s doable, but am I comfortable with the risks here? It’s a strange situation where the risks are “officially low”, but it’s less clear whether they are all that low in reality).
      
      So, yes, if we can navigate the risks, I think that our capabilities to read from the brain are now very powerful (and we can read from the body, polygraph-style too, but particularly minding the risk of the feedback situation here), and we can achieve at minimum pretty effective cognitive modulation via audio-visual channels, and probably much more...
      
      Of course, people will try to have narrowly crafted AIs on the electronic side (these days the thought is rather obvious), so if one pushes harder one can really optimize a joint cognitive process by a human and a narrow AI interacting with each other, but can this be done in a safe and beneficial manner?
      
      Basically, non-invasiveness of interfaces should not lull us into the false sense of safety in these kinds of experiments, and I think one needs to keep a laser-sharp focus on risk management, but other than that, from the purely technical viewpoint, pieces seem to be ready.