Vladimir_Nesov comments on Decision Theories: A Semi-Formal Analysis, Part I

Vladimir_Nesov 24 Mar 2012 21:13 UTC
3 points

if your source code for X were released in advance

But everyone’s source code is communicated at each round to everyone. What do you mean, “in advance”? What for?

The problem is that we have an exploitable vulnerability here: if your source code for X were released in advance, then other programmers could write a Gödel-numbering for X’s inference module into their own programs in such a way that X has to take a circuitous route to the “genuine” counterfactuals (or so that the genuine ones come back “Unknown”), and in the meantime X will stumble upon exactly the spurious counterfactuals that help Y.

This is very speculative. While an agent can make predictions about itself as long as it wishes, thus making a natural counterfactual inference performed by the other agent longer, that makes the spurious counterfactuals longer as well. Wei asks in another thread:

Having f(L)=L+1 would be sufficient to protect against any “moral arguments” that involve first proving agent()!=a for some a. Do we have any examples of “spurious” proofs that are not of this form?

Unless an agent sabotages itself by deciding arbitrarily, and then observes spurious inferences from that decision, I don’t see how to push spurious inferences in front of the natural ones.

I think this part of the post should be rephrased, so that it says that we don’t know how to prove that even such bizarre exploitation strategy is impossible, instead of the current assertion (unless you have background knowledge or a new result that suggests that this conjecture is reasonable).

Edit: Turns out this is indeed possible, if X can be tricked into checking a specifically contrived proof out of order. Here’s a post describing the setup.
- orthonormal 24 Mar 2012 23:01 UTC
  0 points
  Parent
  
  But everyone’s source code is communicated at each round to everyone. What do you mean, “in advance”? What for?
  
  It’s easier for me to imagine a programmer writing a program specifically to exploit mine than it is to imagine a program that’s able to recognize an opportunity for such exploitation in general. Just as a Nash equilibrium strategy can’t be exploited even if it’s publicly known in advance, we’d like our decision theories to be secure from exploitation even if their source code were publicly known.
  
  This is very speculative.
  
  I agree; I just thought of it as I was writing the post, and in retrospect I should have asked the decision-theory mailing list before putting it in. But my intuition is that this sort of malice could be possible, using the equivalent of a Henkin sentence. I’ll edit to make it clearer that this is speculation, though.
  - Vladimir_Nesov 24 Mar 2012 23:15 UTC
    0 points
    Parent
    
    But my intuition is that this sort of malice could be possible, using the equivalent of a Henkin sentence.
    
    This sounds like a way of making a natural proof, not a spurious one… Could you explain in more detail?
    - orthonormal 24 Mar 2012 23:23 UTC
      0 points
      Parent
      
      Could you explain in more detail?
      
      I had the intuition that one could both “push X away from genuine counterfactuals” using diagonalization and “pull X toward particular spurious counterfactuals” using specially constructed explicit Henkin sentences (which the inference module will find). But I’m more confident that the first half makes sense than that the second half does.
      - Vladimir_Nesov 24 Mar 2012 23:29 UTC
        0 points
        Parent
        
        “pull X toward particular spurious counterfactuals” using specially constructed explicit Henkin sentences (which the inference module will find)
        
        Right, but sentences where? Why does our agent believe them? Can you give a sketch of an analogous example?
        orthonormal 24 Mar 2012 23:36 UTC
        0 points
        Parent
        The inference module of X will at some point start trying to prove sub-problems about the code of Y if it doesn’t find an easily proved counterfactual. In the source code of Y, in a place where X will apply its inference module, the programmer has included an explicit Henkin sentence (i.e. contains the ingredients for its own proof) such that when X applies its inference module, it quickly proves “if X!=a then U=-1000”, which happens to be the sort of thing the NDT agent was looking for. This section of code doesn’t affect Y’s decision except when Y is playing a game of this sort against X.
        
        Again, highly speculative.
        Vladimir_Nesov 24 Mar 2012 23:55 UTC
        0 points
        Parent
        
        In the source code of Y, in a place where X will apply its inference module, the programmer has included an explicit Henkin sentence (i.e. contains the ingredients for its own proof) such that when X applies its inference module, it quickly proves “if X!=a then U=-1000”
        
        I still don’t understand what you’re saying. What’s the nature of this “Henkin sentence”, that is in what way is it “included” in Y’s source code? Is it written in the comments or something? What does it mean for X to “apply its inference module”? My only guess is that you are talking about X starting to enumerate proofs in order other than their length, checking the proofs that happen to be lying around in Y’s code just in case, so that including something in Y’s code might push X to consider a proof that Y wants it to consider earlier than some other proof. Is it what you mean?
        
        How would “if X!=a then U=-1000” follow from a Henkin sentence (it obviously can’t make anything it likes true)?
        orthonormal 25 Mar 2012 0:53 UTC
        3 points
        Parent
        The mental image of an inference module that I’m trying to trick is something like a chess program, performing a heuristically biased depth-first search on the space of proofs, based on the source code of the round. If the first path it takes is a garden path to an easy but spurious counterfactual proof, and if the genuine counterfactual is unreachable at the same level (by some diagonalization of X’s proof method), then X should be fooled. The question is whether, knowing X’s source code, there’s a Y which leads X’s heuristics down that garden path.
        
        Since in the end we’ll achieve X=a, the statement “if X!=a then U=-1000” is true, and it should be provable by NDT if it finds the right path, so there should be some way of helping NDT find the proof more quickly.
        
        I don’t have a strong intuition for how to fool the “enumerate and check all proofs up to length N” inference module.
        Vladimir_Nesov 25 Mar 2012 1:14 UTC
        1 point
        Parent
        Okay. Let’s say we write an X that will start from checking any proof given to it (and accepting its conclusion). How can we construct a proof of X=a that X can read then? It looks like the more X reads, the longer the proof must be, and so there doesn’t exist a proof that X can also read. I don’t see how to construct a counterexample to this property without corrupting X’s inference system, though I imagine some quining trick might work...
        What links here?
        An example of self-fulfilling spurious proofs in UDT by cousin_it (25 Mar 2012 11:47 UTC; 33 points)
        Vladimir_Nesov's comment on Decision Theories: A Semi-Formal Analysis, Part I by orthonormal (24 Mar 2012 21:13 UTC; 3 points)
        orthonormal 25 Mar 2012 1:19 UTC
        2 points
        Parent
        Right. That’s why I acknowledge that this is speculative.
        
        If it turns out there’s really no need to worry about spurious counterfactuals for a reasonable inference module, then hooray! But mathematical logic is perverse enough that I expect there’s room for malice if you leave the front door unlocked.
        Vladimir_Nesov 25 Mar 2012 11:28 UTC
        3 points
        Parent
        ...and Slepnev posted a proof (on the list) that in my formulation, X can be successfully deceived. It’s not so much a Henkin sentence, just a program that enumerates all proofs, looking for a particular spurious counterfactual, and doesn’t give up until it finds it. If the spurious counterfactual is provable, the program will find it, and so the agent will be tricked by it, and then the spurious counterfactual will be true. We have an implication from provability of the spurious argument to its truth, so by Loeb’s theorem it’s true, and X is misled. So you were right!
        cousin_it 25 Mar 2012 11:50 UTC
        2 points
        Parent
        I just wrote up the proof Nesov is talking about, here.