First problem, A lot of future gains may come from RL style self play (IE:let the AI play around solving open ended problems) That’s not safe in the way you outline above.
Still, offline learning is very useful, and so long as you do enough offline learning, then you don’t have problems in the online learning phase.
Next, jailbreaking. I’ll admit, this isn’t something I initially covered, though if we admit that alignment is achievable, and we only have the question over whether alignment is stable, then in my model we’ve won almost all the value, as my threat model is closer to “We want good, capable AGI, but we can’t get it because aligning it is very difficult.”
So I think alignment was the load-bearing part of my model, and thus we have much lower p(Doom), more like 0.1-10% probability.
Still, offline learning is very useful, and so long as you do enough offline learning, then you don’t have problems in the online learning phase.
Next, jailbreaking. I’ll admit, this isn’t something I initially covered, though if we admit that alignment is achievable, and we only have the question over whether alignment is stable, then in my model we’ve won almost all the value, as my threat model is closer to “We want good, capable AGI, but we can’t get it because aligning it is very difficult.”
So I think alignment was the load-bearing part of my model, and thus we have much lower p(Doom), more like 0.1-10% probability.