Rob Bensinger comments on Some abstract, non-technical reasons to be non-maximally-pessimistic about AI alignment

Rob Bensinger 12 Dec 2021 23:51 UTC
18 points
0
I’d argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
I think this is straightforwardly true in two different ways:
- Prior to the deep learning revolution, Eliezer didn’t predict that ANNs would be a big deal — he expected other, neither-GOFAI-nor-connectionist approaches to AI to be the ones that hit milestones like ‘solve Go’.
- MIRI thinks the current DL paradigm isn’t alignable, so we made a bet on trying to come up with more alignable AI approaches (which we thought probably wouldn’t succeed, but considered high-enough-EV to be worth the attempt).
I don’t think this has anything to do with the OP, but I’m happy to talk about it in its own right. The most relevant thing would be if we lost a bet like ‘we predict deep learning will be too opaque to align’, but we still are just as pessimistic about humanity’s ability to align deep nets are ever, so if you think we’ve hugely underestimated the tractability of aligning deep nets, I’d need to hear more about why. What’s the path to achieving astronomically good outcomes, on the assumption that the first AGI systems are produced by 2021-style ML methods?
What links here?
- DL towards the unaligned Recursive Self-Optimization attractor by jacob_cannell (18 Dec 2021 2:15 UTC; 32 points)
- jacob_cannell's comment on LOVE in a simbox is all you need by jacob_cannell (30 Sep 2022 17:46 UTC; 5 points)
- jacob_cannell 13 Dec 2021 1:47 UTC
  6 points
  0
  Parent
  Thanks, strong upvote, this is especially clarifying.
  Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
  The weakly alignable baseline should be “marginally better than humans”. Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i’m quite uncertain, but it’s probably considerably higher). Ideally we should always have an MVP alignment solution in place.
  My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:
  DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI’s power by constraining it’s environment. For example we have nothing to fear from AGI’s trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.
  As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I’m somewhat optimistic that it’s no more complex than other major brain systems we’ve already mostly reverse engineered.
  The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.
  - Rob Bensinger 13 Dec 2021 22:57 UTC
    8 points
    0
    Parent
    Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
    I don’t know what “strongly alignable”, “robust, high certainty paradigm”, or “approximately/weakly alignable” mean here. As I said in another comment:
    There are two problems here:
    Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
    Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.
    Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it’s something humanity can solve at its leisure once we’ve solved #1 and therefore aren’t at immediate risk of destroying ourselves.
    If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not ‘what’s possible in principle, given arbitrarily large amounts of time?‘, but rather ‘what can we do in practice to actually end the acute risk period / ensure we don’t blow ourselves up in the immediate future?’.
    (Where I’m imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
    The weakly alignable baseline should be “marginally better than humans”.
    I don’t understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
    I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I’m thinking in terms of ‘what’s the least difficult-to-align act humanity could attempt with an AGI?’.
    Maybe you mean something different by “marginally better than humans”?
    As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found.
    I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’), not a Problem #1 research direction (‘we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well’).
    What links here?
    jacob_cannell's comment on Framing approaches to alignment and the hard problem of AI cognition by ryan_greenblatt (16 Dec 2021 3:36 UTC; 8 points)
    jacob_cannell's comment on LOVE in a simbox is all you need by jacob_cannell (30 Sep 2022 17:46 UTC; 5 points)
    - Steven Byrnes 18 Dec 2021 22:17 UTC
      3 points
      0
      Parent
      For what it’s worth I’m cautiously optimistic that “reverse-engineering the circuits underlying empathy/love/altruism/etc.” is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you’re interested.
    - jacob_cannell 14 Dec 2021 17:53 UTC
      2 points
      0
      Parent
      Maybe you mean something different by “marginally better than humans”?
      No I meant “merely as aligned as a human”. Which is why I used “approximately/weakly” aligned—as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.
      I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred.
      I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’),
      Alright so now I’m guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to—let’s just call that DLA—may take subjective centuries, which thus suggests that you believe:
      That DLA is significantly more difficult than DL AGI in general
      That uploading is likewise significantly more difficult
      or perhaps
      DLA isn’t necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn’t effective) comes first
      Is any of that right?
      - Rob Bensinger 16 Dec 2021 7:02 UTC
        2 points
        0
        Parent
        Sounds right, yeah!