Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?
I think this failure mode would happen extremely close to ordinary AI risk; I don’t think that e.g. solving this failure mode while keeping everything else the same buys you significantly more time to solve the control problem.
Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?
I think this failure mode would happen extremely close to ordinary AI risk; I don’t think that e.g. solving this failure mode while keeping everything else the same buys you significantly more time to solve the control problem.