”Catastrophic” is normally used in the term ”global catastrophic risk” and means something like “kills 100,000s of people”, so I do think “doesn’t necessarily kill but could’ve killed a couple of people” is a fairly different meaning. In retrospect I realize that I put my answer to the second question far too high — if it just means “a deceptive aligned system nearly gives a few people in hospital a fatal dosage but it’s stopped and we don’t know why the system messed up” then it’s quite plausible nothing this substantial will happen as a result of that.
”Catastrophic” is normally used in the term ”global catastrophic risk” and means something like “kills 100,000s of people”, so I do think “doesn’t necessarily kill but could’ve killed a couple of people” is a fairly different meaning.
Agreed. In retrospect, I might have opted for “pre-AGI nearly-deadly accident caused by deceptive alignment.”
In retrospect I realize that I put my answer to the second question far too high — if it just means “a deceptive aligned system nearly gives a few people in hospital a fatal dosage but it’s stopped and we don’t know why the system messed up” then it’s quite plausible nothing this substantial will happen as a result of that.
I intended the situation to be more like “we catch the AI pretending to be aligned, but actually lying, and it almost or does kill at least a few people as a result of that.”
With #1, I’m trying to have people predict the “deception is robustly instrumental behavior, but AIs will be bad at it at first and so we’ll catch them.” #2 is trying to operationalize whether this would be viewed as a fire alarm.
Some ways you might think scenario #1 won’t happen:
You don’t think deception will be incentivized
Fast takeoff means the AI is never smart enough to deceive but dumb enough to get caught
Our transparency tools won’t be good enough for many people to believe it was actually deceptively aligned
In the following, an event is “catastrophic” if it endangers several human lives; it need not be an existential catastrophe.
Edit: I meant to say “deceptive alignment”, but the meaning should be clear either way.
”Catastrophic” is normally used in the term ”global catastrophic risk” and means something like “kills 100,000s of people”, so I do think “doesn’t necessarily kill but could’ve killed a couple of people” is a fairly different meaning. In retrospect I realize that I put my answer to the second question far too high — if it just means “a deceptive aligned system nearly gives a few people in hospital a fatal dosage but it’s stopped and we don’t know why the system messed up” then it’s quite plausible nothing this substantial will happen as a result of that.
Agreed. In retrospect, I might have opted for “pre-AGI nearly-deadly accident caused by deceptive alignment.”
I intended the situation to be more like “we catch the AI pretending to be aligned, but actually lying, and it almost or does kill at least a few people as a result of that.”
With #1, I’m trying to have people predict the “deception is robustly instrumental behavior, but AIs will be bad at it at first and so we’ll catch them.” #2 is trying to operationalize whether this would be viewed as a fire alarm.
Some ways you might think scenario #1 won’t happen:
You don’t think deception will be incentivized
Fast takeoff means the AI is never smart enough to deceive but dumb enough to get caught
Our transparency tools won’t be good enough for many people to believe it was actually deceptively aligned
Also: we solve alignment really well on paper, and that’s why deception doesn’t arise. (I assign non-trivial probability to this.)