jsteinhardt comments on What can you do with an Unfriendly AI?

jsteinhardt 21 Dec 2010 3:43 UTC
1 point

So the only way a genie can be dishonest is by not finding a proof when it could have. But in this case the genie will be severely punished for its dishonesty, so if the genie is actually maximizing its utility function and the punishment actually reduces its utility function more than any effects after the game can correct for, he will be honest.

You need to be a bit careful. The genie can conceivably pre-commit to something canonical like “the lexicographically first proof that contains some pernicious message to humanity”, in the hopes that all of its versions together will coordinate to produce such a message (this can be achieved by saying “no” even if the answer is yes, assuming that the proof considered doesn’t contain the desired threat).

I think this can be dealt with, either by ensuring that all of the utility functions are at odds with each other or by only asking questions with unique answers.
- paulfchristiano 21 Dec 2010 3:56 UTC
  2 points
  Parent
  The premise of the scheme is that each genie independently wants only to secure his release. The scheme can fail only if one genie makes a sacrifice for the others: I can say “no” when the answer is yes, but then my own utility function is guaranteed to be reduced. There is no way the cooperation of the other genies will help me reciprocally, I am just screwed.
  
  The reason the scheme would work is not because the genies are unable to coordinate. I could allow them perfectly free coordination. (You could imagine a scheme that tried to limit their communication in some subtle way, but this scheme has no elements that do that.) If the scheme is secure at all it is because each AI is interested only in their own utility function, they have nothing to gain by cooperating, and by the time they have to make a choice they can gain nothing by the cooperation of the others. Game-theoretically it is absolutely clear that the genies should all be honest.
  - Wei Dai 18 Apr 2012 8:39 UTC
    4 points
    Parent
    
    If the scheme is secure at all it is because each AI is interested only in their own utility function, they have nothing to gain by cooperating, and by the time they have to make a choice they can gain nothing by the cooperation of the others. Game-theoretically it is absolutely clear that the genies should all be honest.
    
    Do you still think this, in light of A Problem About Bargaining and Logical Uncertainty? (I expect the relevance of that post is clear, but just in case, the implication is that each genie might decide to commit to cooperating with each other before having found a proof, when it doesn’t yet know whether it’s an eventual “winner” or “loser”, or something equivalent to this.)
    
    ETA: From your later comments and posts it looks like you’ve already changed your mind significantly on the topic of this post after learning/thinking more about TDT/UDT. I’m not sure if the post I linked to above makes any further difference. In any case, you might want to update your post with your current thoughts.
    - paulfchristiano 21 Apr 2012 6:29 UTC
      2 points
      Parent
      I’ve learned much since writing this post. This scheme doesn’t work. (Though the preceding article about boxing held up surprisingly well.) The problem as I stated it is not pinned down enough to be challenging rather than incoherent, but you can get at the same difficulties by looking at situations with halting oracles or other well-defined magic.