Rohin Shah comments on [Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy

Rohin Shah 7 Mar 2023 16:27 UTC
LW: 12 AF: 6
3
AF
I think that skews it somewhat but not very much. We only have to “win” once in the sense that we only need to build an aligned Sovereign that ends the acute risk period once, similarly to how we only have to “lose” once in the sense that we only need to build a misaligned superintelligence that kills everyone once.
(I like the discussion on similar points in the strategy-stealing assumption.)
- David Johnston 8 Mar 2023 9:07 UTC
  LW: 7 AF: 4
  0
  AF Parent
  Is building an aligned sovereign to end the acute risk period different to a pivotal act in your view?
  - Rohin Shah 9 Mar 2023 14:38 UTC
    LW: 3 AF: 3
    0
    AF Parent
    Depends what the aligned sovereign does! Also depends what you mean by a pivotal act!
    In practice, during the period of time where biological humans are still doing a meaningful part of alignment work, I don’t expect us to build an aligned sovereign, nor do I expect to build a single misaligned AI that takes over: I instead expect there to be a large number of AI systems, that could together obtain a decisive strategic advantage, but could not do so individually.
    - David Johnston 10 Mar 2023 0:01 UTC
      LW: 6 AF: 4
      0
      AF Parent
      So, if I’m understanding you correctly:
      
      if it’s possible to build a single AI system that executes a catastrophic takeover (via self-bootstrap or whatever), it’s also probably possible to build a single aligned sovereign, and so in this situation winning once is sufficient
      if it is not possible to build a single aligned sovereign, then it’s probably also not possible to build a single system that executes a catastrophic takeover and so the proposition that the model only has to win once is not true in any straightforward way
      in this case, we might be able to think of “composite AI systems” that can catastrophically take over or end the acute risk period, and for similar reasons as in the first scenario, winning once with a composite system is sufficient, but such systems are not built from single acts
      
      and you think the second scenario is more likely than the first.
      - Rohin Shah 10 Mar 2023 5:35 UTC
        LW: 4 AF: 3
        0
        AF Parent
        Yes, that’s right, though I’d say “probable” not “possible” (most things are “possible”).