It’s not clear to me how you get to deceptive alignment ‘that completely supersedes the explicit alignment’. That an AI would develop epiphenomenal goals and alignments, not understood by its creators, that it perceived as useful or necessary to pursue whatever primary goal it had been set, seems very likely. But while they might be in conflict with what we want it to do, I don’t see how this emergent behavior could be such that it would be contradict the pursuit of satisfying whatever evaluation function the AI had been trained for in the beginning. Unless of course the AI made what we might consider a stupid mistake.
One design option that I haven’t seen discussed (though I have not read everything … maybe this falls in to the category of ‘stupid newbie ideas’) is that of trying to failsafe an AI by separating its evaluation or feedback in such a way that it can, once sufficiently superhuman, break in to the ‘reward center’ and essentially wirehead itself. If your AI is trying to move some calculated value to be as close to 1000 as possible, then once it understands the world sufficiently well it should simply conclude ‘aha, by rooting this other box here I can reach nirvana!’, follow through, and more or less consider its work complete. To our relief, in this case.
Of course this does nothing to address the problem of AI controlled by malicious human actors, which will likely become a problem well before any takeoff threshold is reached.
It’s not clear to me how you get to deceptive alignment ‘that completely supersedes the explicit alignment’. That an AI would develop epiphenomenal goals and alignments, not understood by its creators, that it perceived as useful or necessary to pursue whatever primary goal it had been set, seems very likely. But while they might be in conflict with what we want it to do, I don’t see how this emergent behavior could be such that it would be contradict the pursuit of satisfying whatever evaluation function the AI had been trained for in the beginning. Unless of course the AI made what we might consider a stupid mistake.
One design option that I haven’t seen discussed (though I have not read everything … maybe this falls in to the category of ‘stupid newbie ideas’) is that of trying to failsafe an AI by separating its evaluation or feedback in such a way that it can, once sufficiently superhuman, break in to the ‘reward center’ and essentially wirehead itself. If your AI is trying to move some calculated value to be as close to 1000 as possible, then once it understands the world sufficiently well it should simply conclude ‘aha, by rooting this other box here I can reach nirvana!’, follow through, and more or less consider its work complete. To our relief, in this case.
Of course this does nothing to address the problem of AI controlled by malicious human actors, which will likely become a problem well before any takeoff threshold is reached.