This is pretty misleading, that 30% was way more specific than “AI from debate” and really depended on the specific conditional. (Also right now I’m feeling like it should be higher, maybe 50%.) I had told a very specific story for some code you could write and then was talking about doom from that particular approach. In particular, the scenario ignores two major sources of optimism:
As your model gets more capable, you can notice problems that your alignment scheme (debate in this case) failed to prevent, and fix them
You can do better by having the debaters look at each other’s activations, or use cross-examination, or provide mechanistic interpretability tools to help the judge, or training the judges, etc (in the story I was putting a probability on, there is just one neural net that produces text that is interpreted as two sides of a conversation)
If you instead ask me to condition on building AGI with debate, I think “oh good, that means we didn’t fail to use alignment techniques entirely” and “oh no, that suggests we couldn’t find anything better to do, probably because timelines were short”, and then my number might be something like 10%.
EDIT: Also relevant is that all of these numbers are misalignment risk only rather than an unconditional doom estimate.
Thank you for pointing this out, Rohin, that’s my mistake. I’ll update the post and P(doom) table and add an addendum that it’s misalignment risk estimates.
This is pretty misleading, that 30% was way more specific than “AI from debate” and really depended on the specific conditional. (Also right now I’m feeling like it should be higher, maybe 50%.) I had told a very specific story for some code you could write and then was talking about doom from that particular approach. In particular, the scenario ignores two major sources of optimism:
As your model gets more capable, you can notice problems that your alignment scheme (debate in this case) failed to prevent, and fix them
You can do better by having the debaters look at each other’s activations, or use cross-examination, or provide mechanistic interpretability tools to help the judge, or training the judges, etc (in the story I was putting a probability on, there is just one neural net that produces text that is interpreted as two sides of a conversation)
If you instead ask me to condition on building AGI with debate, I think “oh good, that means we didn’t fail to use alignment techniques entirely” and “oh no, that suggests we couldn’t find anything better to do, probably because timelines were short”, and then my number might be something like 10%.
EDIT: Also relevant is that all of these numbers are misalignment risk only rather than an unconditional doom estimate.
Thank you for pointing this out, Rohin, that’s my mistake. I’ll update the post and P(doom) table and add an addendum that it’s misalignment risk estimates.