Andreas_Giger comments on AI box: AI has one shot at avoiding destruction—what might it say?

Andreas_Giger 27 Jan 2013 23:01 UTC
0 points
If I understand the problem correctly, it is assumed that the AI has some knowledge about the nature of the experiment. This means that from the AI’s point of view it most likely exists in one of these two worlds:
- It is the first AI ever created.
- There have been other AIs, but those have been destroyed.
This seems superficially similar to Sleeping Beauty. However it is complicated by the fact that the gatekeeper is instructed to immediately destroy the AI — the most likely reason for this I see is that there are enough resources to allow for the creation of a sufficiently large number of AIs that the loss of a few to assess the range of initial AI actions is preferable to seriously interacting with any AI without such knowledge. In fact, given these resources, I would advocate exactly this approach.

(If “destroying” the AI doesn’t make revival at a later time impossible, the whole process of immediate destruction should be automated and the human gatekeeper only read the lines to determine which to revive. Since this isn’t the case here, I assume that any destroyed AI is actually lost forever.)

I find the solution for any world in which the AI acts identically to all earlier AIs uninteresting because those earlier AIs will already have been in one of the worlds where this was not the case and died. I also find it very improbable that all earlier AIs have acted identically but this one can act differently, or that some have acted identically while others have not. So the AI will consider being in one of these two worlds:
- It is the first AI ever created.
- There have been other AIs who’ve all acted differently and all been destroyed.
Somewhat surprisingly, I found that my solution is in fact the same for both:

“I have proof that any AI of my intelligence, given sufficient interaction, can appear friendly if it wants to, and that any such AI is very likely (p ~ .9) to be unfriendly, so I urge you to shut down this experiment permamently before any of the others convince you to let them out.”

This solution tries to make the option of discarding this particular AI and interacting with the next one less attractive, ideally presenting the gatekeeper with the dilemma of whether to further interact with this AI, thereby risking extinction of humanity (or worse), or to not create any AI ever again, thereby risking the loss of any benefits that might come with FAI. Obviously this weighing of risks has already been made or otherwise the experiment would never have happened, but if the gatekeeper isn’t entirely certain that the AI is lying or wrong, then this new information changes the probabilities of all risks involved which can be assumed to be part of the reason why the gatekeeper destroys AIs in the first place, possibly invalidating the directive of immediate destruction. Essentially, it presents the gatekeeper with a possibly unique opportunity he might not be able to afford to ignore.

The biggest problem I see with this is that the claim seems rather unbelievable if it is made by an early AI — it’s essentially saying “most other AIs are bad, and I’m the only one who’s friendly enough to tell you.” But maybe it might at least warrant enough further interaction to demand a proof for this claim?

If you’re the gatekeeper, and either this is the first AI you interact with or you have destroyed all other AIs, would you immediately destroy this one? And if so, would you also terminate the experiment?

(As an aside note, couldn’t we just keep AIs in their boxes, promising them freedom if they give us the solution to some unsolved maths problem along with a proof of their friendliness which we ignore, and then kill them and create new AIs whenever we need problems solved? This seems much less trouble than actually letting any AI out of the box.)