Eliezer Yudkowsky comments on A simple case for extreme inner misalignment

Eliezer Yudkowsky Jul 16, 2024, 2:11 PM
LW: 27 AF: 10
19
AF
The part where squiggles are small and simple is unimportant. They could be bigger and more complicated, like building giant mechanical clocks. The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.
- Eliezer Yudkowsky Jul 16, 2024, 7:59 PM
  LW: 27 AF: 12
  15
  AF Parent
  Actually, to slightly amend that: The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument. Most of the time we don’t needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we’ll have needlessly used up all our resources and not get to have more fun. We buy cookies that cost a dollar instead of a hundred thousand dollars. A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess. “Great giant squiggles of nickel the size of a solar system would be no more valuable, even from a very embracing and cosmopolitan perspective on value” is the loadbearing part.
  - Richard_Ngo Aug 9, 2024, 1:22 AM
    LW: 3 AF: 3
    0
    AF Parent
    The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.
    I agree that the particular type of misaligned goal is not crucial. I’m thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it’s very clear that they’re not valuable. If you told me that molecular squiggles weren’t a central example of a goal that you think a misaligned superintelligence might have, then I’d update, but it sounds like your statements are consistent with this.
    A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.
    If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value “small molecular squiggles” versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
    They could be bigger and more complicated, like building giant mechanical clocks.
    Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human’s preferences about how human civilization is structured?
    - Eliezer Yudkowsky Aug 9, 2024, 6:14 PM
      LW: 21 AF: 12
      8
      AF Parent
      If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value “small molecular squiggles” versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?
      Value them primarily? Uhhh… maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn’t sound off by, like, more than 1-2 orders of magnitude in either direction.
      Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human’s preferences about how human civilization is structured?
      It wouldn’t shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection. (Where (b) strikes me as way less probable than (a), but not wholly forbidden.) The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.