TurnTrout comments on AGI Predictions

TurnTrout 21 Nov 2020 6:10 UTC
10 points
In the following, an event is “catastrophic” if it endangers several human lives; it need not be an existential catastrophe.
Before AGI, will we learn of an example of catastrophic deceptive misalignment?
Conditional on the AI community learning of pre-AGI catastrophic deceptive misalignment, will the ($ spent on AI alignment research)/($ spent on AI research) ratio increase by more than 50% over the two years following the catastrophe?
Edit: I meant to say “deceptive alignment”, but the meaning should be clear either way.
- Ben Pace 21 Nov 2020 20:04 UTC
  2 points
  Parent
  ”Catastrophic” is normally used in the term ”global catastrophic risk” and means something like “kills 100,000s of people”, so I do think “doesn’t necessarily kill but could’ve killed a couple of people” is a fairly different meaning. In retrospect I realize that I put my answer to the second question far too high — if it just means “a deceptive aligned system nearly gives a few people in hospital a fatal dosage but it’s stopped and we don’t know why the system messed up” then it’s quite plausible nothing this substantial will happen as a result of that.
  - TurnTrout 21 Nov 2020 20:20 UTC
    5 points
    Parent
    ”Catastrophic” is normally used in the term ”global catastrophic risk” and means something like “kills 100,000s of people”, so I do think “doesn’t necessarily kill but could’ve killed a couple of people” is a fairly different meaning.
    Agreed. In retrospect, I might have opted for “pre-AGI nearly-deadly accident caused by deceptive alignment.”
    In retrospect I realize that I put my answer to the second question far too high — if it just means “a deceptive aligned system nearly gives a few people in hospital a fatal dosage but it’s stopped and we don’t know why the system messed up” then it’s quite plausible nothing this substantial will happen as a result of that.
    I intended the situation to be more like “we catch the AI pretending to be aligned, but actually lying, and it almost or does kill at least a few people as a result of that.”
    With #1, I’m trying to have people predict the “deception is robustly instrumental behavior, but AIs will be bad at it at first and so we’ll catch them.” #2 is trying to operationalize whether this would be viewed as a fire alarm.
    Some ways you might think scenario #1 won’t happen:
    You don’t think deception will be incentivized
    Fast takeoff means the AI is never smart enough to deceive but dumb enough to get caught
    Our transparency tools won’t be good enough for many people to believe it was actually deceptively aligned
    - Rohin Shah 1 Dec 2020 16:57 UTC
      2 points
      Parent
      Some ways you might think scenario #1 won’t happen:
      Also: we solve alignment really well on paper, and that’s why deception doesn’t arise. (I assign non-trivial probability to this.)