Vladimir_Nesov comments on What can you do with an Unfriendly AI?

Vladimir_Nesov 20 Dec 2010 21:37 UTC
4 points
Then just get the answer straight away, or build a system specifically intended to create a reliable incentive structure for getting a probably honest answer. I object the the particular method of security by obscurity.
What links here?
- Vladimir_Nesov's comment on What can you do with an Unfriendly AI? by paulfchristiano (20 Dec 2010 21:39 UTC; 2 points)
- jimrandomh 20 Dec 2010 21:44 UTC
  3 points
  Parent
  The point is to prevent it from optimizing its answer for secondary characteristics besides correctness, when multiple answers are possible. If you “just get the answer straight away”, it chooses among all the possible answers one that is correct, honest, and maximally poisonous to you.
  - Vladimir_Nesov 20 Dec 2010 21:50 UTC
    2 points
    Parent
    So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.
    
    Two conflicting problems in this particular case. If the measurements are supposed to be independent, so that multiple data points actually reinforce each other and many questions are better than one question, then you can’t expect for the different genies to be talking about the same answer, and so can’t elicit it in the manner described in the post. Conversely, if you can in fact rely on all those genies talking about the same answer, then you can also rely on all your measurements not being independent, and your answer will be picked according to exactly the same criteria as if it was asked straight away.
    What links here?
    Vladimir_Nesov's comment on What can you do with an Unfriendly AI? by paulfchristiano (20 Dec 2010 22:34 UTC; 6 points)
    - jimrandomh 20 Dec 2010 22:05 UTC
      0 points
      Parent
      
      So basically, the supposed law behind the method is in getting multiple correctness measurements, which together are expected to strengthen the estimation of correctness of the result.
      
      What? No, that’s not paulfchristano’s proposal at all. The reason for splitting up the measurement is to incentivize the genies to give you the proof which is lexicographically first. A perfect proof checker is assumed, so you know that if a proof is given, the result is correct; and the incentives are such that if the genie can find a proof, it will give it.
      - Vladimir_Nesov 20 Dec 2010 22:25 UTC
        1 point
        Parent
        What reason acts to prevent the genie from giving a proof that is not the lexicographically smallest it can find?
        jimrandomh 20 Dec 2010 22:31 UTC
        0 points
        Parent
        If it did that, then there would be at least one subgenie that didn’t optimize its utility function—either one reporting no such proof of length n and getting punished when it could have provided a matching proof, or one reporting no such proof of length n and prefix p when it could have provided one and, again, getting punished. Remember, while each invocation (subgenie) has the same utility function, by assumption that function refers to something that can be done to the particular invocation which optimizes it.
        Vladimir_Nesov 20 Dec 2010 22:34 UTC
        6 points
        Parent
        
        If it did that, then there would be at least one subgenie that didn’t optimize its utility function
        
        If there is any genie-value to obtaining control over the real world, instances of genies will coordinate their decisions to get it.
        
        You can’t fight superintelligent genies with explicit dependence bias. They are not independent, they can coordinate, and so they will, even if you can frame the problem statement in a way that suggests that they can’t communicate. Worse, as I pointed out, in this particular game they must coordinate to get anything done.
        jimrandomh 20 Dec 2010 22:39 UTC
        2 points
        Parent
        You are arguing that the stated assumptions about the genie’s utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
        Vladimir_Nesov 20 Dec 2010 22:56 UTC
        0 points
        Parent
        
        You are arguing that the stated assumptions about the genie’s utility function are unrealistic (which may be true), but presenting it as though you had found a flaw in the proof that follows from those assumptions.
        
        It seems like the assumptions about utility, even if they hold, don’t actually deliver the behavior you expect, because the genies can coordinate. Unless the incentive structure makes sure they won’t try to take over the world in any case, it doesn’t make sure that they won’t try to take over the world if you only ask each of them to answer one binary question either.
        
        Think of the genies as making a single decision that results in all the individual actions of all the individual genies. For the genies, having multiple actors just means raising the stakes by threatening to not free more genies, which you could’ve done for a single-question wish as easily instead of creating an elaborate questioning scheme. You could even let them explicitly discuss what to answer to your question, threatening to terminate them all if examination of their answer reveals any incorrectness!
        
        Edit: See also Eliezer’s comment.
- paulfchristiano 20 Dec 2010 21:38 UTC
  0 points
  Parent
  There is absolutely no sense in which this scheme is security by obscurity. My claim is that the genie will respect my wishes even though he knows exactly what I am doing, because he values my generosity right now more than the promise of taking over the world later.
  - Vladimir_Nesov 20 Dec 2010 21:44 UTC
    4 points
    Parent
    Again, if your genie is already incentivized to be honest, in what sense is your scheme with all its bells and whistles better than asking for the shortest answer the genie can find, in plain English?
    - paulfchristiano 20 Dec 2010 22:27 UTC
      2 points
      Parent
      It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
      - Vladimir_Nesov 20 Dec 2010 22:31 UTC
        0 points
        Parent
        
        It is not magically incentivized to be honest. It is incentivized to be honest because each query is constructed precisely such that an honest answer is the rational thing to do, under relatively weak assumptions about its utility function. If you ask in plain English, you would actually need magic to produce the right incentives.
        
        My question is about the difference. Why exactly is the plain question different from your scheme?
        
        (Clearly your position is that your scheme works, and therefore “doesn’t assume any magic”, while the absence of your scheme doesn’t, and so “requires magic in order to work”. You haven’t told me anything I don’t already know, so it doesn’t help.)
        paulfchristiano 21 Dec 2010 1:10 UTC
        0 points
        Parent
        Here is the argument in the post more concisely. Hopefully this helps:
        
        It is impossible to lie and say “I was able to find a proof” by the construction of the verifier (if you claim you were able to find a proof, the verifier needs to see the proof to believe you.) So the only way you can lie is by saying “I was not able to find a proof” when you could have if you had really tried. So incentivizing the AI to be honest is precisely the same as incentivizing them to avoid admitting “I was not able to find a proof.” Providing such an incentive is not trivial, but it is basically the easiest possible incentive to provide.
        
        I know of no way to incentivize someone to answer the plain question easily just based on your ability to punish them or reward them when you choose to. Being able to punish them for lying involves being able to tell when they are lying.
        Vladimir_Nesov 21 Dec 2010 2:04 UTC
        −2 points
        Parent
        See this comment.
    - [deleted] 20 Dec 2010 22:41 UTC
      0 points
      Parent
      Are you saying you think Christiano’s scheme is overkill? Presumably we don’t have to sacrifice a virgin in order to summon a new genie, so it doesn’t look expensive enough to matter.
      - Vladimir_Nesov 20 Dec 2010 22:51 UTC
        2 points
        Parent
        I’m saying that it’s not clear why his scheme is supposed to add security, and it looks like it doesn’t. If it does, we should understand why, and optimize that property instead of using the scheme straight away, and if it doesn’t, there is no reason to use it. Either way, there is at least one more step to be made. (In this manner, it could work as raw material for new lines of thought where we run out of ideas, for example.)
    - Perplexed 20 Dec 2010 22:11 UTC
      0 points
      Parent
      As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest. And the reason for the roundabout way of asking the question is to make the answer-channel bandwidth as narrow as possible.
      - paulfchristiano 20 Dec 2010 22:30 UTC
        2 points
        Parent
        It is impossible to be dishonest by saying “yes,” by construction. The genie is incentivized to say “yes’ whenever possible, so it is disincentivized to be dishonest by saying “no.” So the genie is incentivized to be honest, not just to avoid being called out for dishonesty.
      - Vladimir_Nesov 20 Dec 2010 22:27 UTC
        0 points
        Parent
        
        As I understand it, the genie is not incentivized to honest. It is incentivized to not get caught being dishonest.
        
        Since we care about the genie actually being honest, the technique can be thought about as a way of making it more likely that the genie is honest, with the threat of punishing dishonestly a component of that technique.