It seems like even amongst proponents of a “fast takeoff”, we will probably have a few months of time between when we’ve built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn’t stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users’ ability to come to it and request the first one be shut down..
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system?
And “defeat the first AGI” seems almost as difficult to formalize correctly as alignment, to me:
One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn’t leave open a loophole that the first can escape through in some way?
So I’m considering “make the world as if neither of you had ever been made”, that wouldn’t have that problem, but it’s impossible to actually attain this goal so I don’t know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people’s guess.
And what do you think happens when the second AGI wins, then maximizes the universe for “the other AI was defeated”. Some serious unintended consequences, even if you could specify it well.
It seems like even amongst proponents of a “fast takeoff”, we will probably have a few months of time between when we’ve built a superintelligence that appears to have unaligned values and when it is too late to stop it.
At that point, isn’t stopping it a simple matter of building an equivalently powerful superintelligence given the sole goal of destroying the first one?
That almost implies a simple plan for preparation: for every AGI built, researchers agree together to also build a parallel AGI with the sole goal of defeating the first one. perhaps it would remain dormant until its operators indicate it should act. It would have an instrumental goal of protecting users’ ability to come to it and request the first one be shut down..
I think there’s no known way to ask an AI to do “just one thing” without doing a ton of harm meanwhile.
See this on creating a strawberry safely. Yudkowsky uses the example “[just] burn all GPUs” in is latest post.
Seems useless if the first system pretends convincingly to be aligned (which I think is going to be the norm) so you never end up deploying the second system?
And “defeat the first AGI” seems almost as difficult to formalize correctly as alignment, to me:
One problem is that when the unaligned AGI transmits itself to another system, how do define it as the same AGI? Is there a way of defining identity that doesn’t leave open a loophole that the first can escape through in some way?
So I’m considering “make the world as if neither of you had ever been made”, that wouldn’t have that problem, but it’s impossible to actually attain this goal so I don’t know how you get it to satisfice over it then turn itself off afterwards, concerned it would become an endless crusade.
One of the first priorities of an AI in a takeoff would be to disable other projects which might generate AGIs. A weakly superintelligent hacker AGI might be able to pull this off before it could destroy the world. Also, fast takeoff could be less than months by some people’s guess.
And what do you think happens when the second AGI wins, then maximizes the universe for “the other AI was defeated”. Some serious unintended consequences, even if you could specify it well.