Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system?
And “defeat the first AGI” seems almost as difficult to formalize correctly as alignment, to me:
One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn’t leave open a loophole that the first can escape through in some way?
So I’m considering “make the world as if neither of you had ever been made”, that wouldn’t have that problem, but it’s impossible to actually attain this goal so I don’t know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system?
And “defeat the first AGI” seems almost as difficult to formalize correctly as alignment, to me:
One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn’t leave open a loophole that the first can escape through in some way?
So I’m considering “make the world as if neither of you had ever been made”, that wouldn’t have that problem, but it’s impossible to actually attain this goal so I don’t know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.