I agree that’s a possible way things could be. However, I don’t see how it’s compatible with accepting the arguments that say we should assume that alignment is a hard problem. I mean absent such arguments why expect you have to do anything special beyond normal training to solve alignment?
As I see the argumentative landscape the high x-risk estimates depend on arguments that claim to give reason to believe that alignment is just a generally hard problem. I don’t see anything in those arguments that distinguishes between these two cases.
In other words our arguments for alignment difficulty don’t depend on any specific assumptions about capability of intelligence so we should currently assign the same probability to an AI being unable to save it’s alignment problem as we do to us being unable to solve it.
This is best understood in terms of models. Prove doesn’t assert that something is provable in our normal sense it asserts that there are numbers encoding a proof. So what “PA |- Prov(phi) → phi” really asserts is that any structure that satisfies the axioms of PA must believe that phi is true iff it believes there is a number coding a proof of phi.
But there are lots of structures other than the usual non-negative integers that satisfy PA. Including ones with infinite ‘numbers.’. In fact, anything consistent with the axioms of PA is going to be true in one of these non-standard models.
So what this is really saying is that if phi isn’t a consequence of PA then there is some structure which contains a fake ‘infinite proof’ of phi but in which phi isn’t true. No paradoxes at all.