jimrandomh comments on What can you do with an Unfriendly AI?

jimrandomh 21 Dec 2010 0:40 UTC
3 points
If all the stated assumptions about the AI’s utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I’m adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.)

The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe.

This is not a prisoner’s dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so.

The real reason this doesn’t work is because the assumptions about the genie’s utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don’t foresee, in which case it would rather take over the universe than not be punished by us.