I don’t mean alignment with human concerns. I mean that the AI itself is engaged in the same project we are: building a smarter system than itself. So if it’s hard to control the alignment of such a system then it should be hard for the AI. (In theory you can imagine that it’s only hard at our specific level of intelligence but in fact all the arguments that AI alignment is hard seem to apply equally well to the AI making an improved AI as to us making an AI).
See my reply above. The AI x-risk arguments require the assumption that superintelligence necessarily entails the agent try to optimize some simple utility function (this is different than orthogonality which says increasing intelligence doesn’t cause convergence to any particular utility function). So the doesn’t care option is off the table since (by orthogonality) it’s super unlikely you get the one utility function which says just maximize intelligence locally (even global max isn’t enough bc some child AI who has different goals could interfere).
I agree that’s a possible way things could be. However, I don’t see how it’s compatible with accepting the arguments that say we should assume that alignment is a hard problem. I mean absent such arguments why expect you have to do anything special beyond normal training to solve alignment?
As I see the argumentative landscape the high x-risk estimates depend on arguments that claim to give reason to believe that alignment is just a generally hard problem. I don’t see anything in those arguments that distinguishes between these two cases.
In other words our arguments for alignment difficulty don’t depend on any specific assumptions about capability of intelligence so we should currently assign the same probability to an AI being unable to save it’s alignment problem as we do to us being unable to solve it.