Stuart_Armstrong comments on Troll Bridge

Stuart_Armstrong Oct 1, 2019, 3:05 PM
LW: 2 AF: 1
AF
Interesting.

I have two issues with the reasoning as presented; the second one is more important.

First of all, I’m unsure about “Rather, the point is that the agent’s “counterfactual” reasoning looks crazy.” I think we don’t know the agent’s counterfactual reasoning. We know, by Löb’s theorem, that “there exists a proof that (proof of L implies L)” implies “there exists a proof of L”. It doesn’t tell us what structure this proof of L has to take, right? Who knows what counterfactuals are being considered to make that proof? (I may be misunderstanding this).

Second of all, it seems that if we change the last line of the agent to [else, “cross”], the argument fails. Same if we insert [else if A()=”cross” ⊢ U=-10, then output “cross”; else if A()=”not cross” ⊢ U=-10, then output “not cross”] above the last line. In both cases, this is because U=-10 is now possible, given crossing. I’m suspicious when the argument seems to depend so much on the structure of the agent.

To develop that a bit, it seems the agent’s algorithm as written implies “If I cross the bridge, I am consistent” (because U=-10 is not an option). If we modify the algorithm as I just suggested, then that’s no longer the case; it can consider counterfactuals where it crosses the bridge and is inconsistent (or, at least, of unknown consistency). So, given that, the agent’s counterfactual reasoning no longer seems so crazy, even if it’s as claimed. That’s because the agent’s reasoning needs to deduce something from “If I cross the bridge, I am consistent” that it can’t deduce without that. Given that statement, then being Löbian or similar seems quite natural, as those are some of the few ways of dealing with statements of that type.
What links here?
- Troll Bridge by abramdemski (Aug 23, 2019, 6:36 PM; 86 points)
- abramdemski Oct 2, 2019, 9:04 PM
  LW: 2 AF: 1
  AF Parent
  First of all, I’m unsure about “Rather, the point is that the agent’s “counterfactual” reasoning looks crazy.” I think we don’t know the agent’s counterfactual reasoning. We know, by Löb’s theorem, that “there exists a proof that (proof of L implies L)” implies “there exists a proof of L”. It doesn’t tell us what structure this proof of L has to take, right? Who knows what counterfactuals are being considered to make that proof? (I may be misunderstanding this).
  The agent as described is using provable consequences of actions to make decisions. So, it is using provable consequences as counterfactuals. At least, that’s the sense in which I mean it—forgive my terminology if this doesn’t make sense to you. I could have said “the agent’s conditional reasoning” or “the agent’s consequentialist reasoning” etc.
  Second of all, it seems that if we change the last line of the agent to [else, “cross”], the argument fails. Same if we insert [else if A()=”cross” ⊢ U=-10, then output “cross”; else if A()=”not cross” ⊢ U=-10, then output “not cross”] above the last line. In both cases, this is because U=-10 is now possible, given crossing. I’m suspicious when the argument seems to depend so much on the structure of the agent.
  I think the benefit of the doubt does not go to the agent, here. If the agent’s reasoning is sensible only under certain settings of the default action clause, then one would want a story about how to set the default action clause ahead of time so that we can ensure that the agent’s reasoning is always sensible. But this seems impossible.
  However, I furthermore think the example can be modified to make the agent’s reasoning look silly in any case.
  If the last line of the agent reads “cross” rather than “not cross”, I think we can recover the argument by changing what the troll is doing. The general pattern is supposed to be: the troll blows up the bridge if we cross “for a dumb reason”—where “dumb” is targeted at agents who do anything analogous to epsilon exploration or the chicken rule.
  So, we modify the troll to blow up the bridge if PA is inconsistent OR if the agent reaches its “else” clause.
  The agent can no longer be accused of reasoning as if it controlled PA. However, it still sees the bridge blowing up as a definite consequence of its crossing the bridge. This still doesn’t seem sensible, because its very reason for believing this involves a circular supposition that it’s provable—much like my chess example where an agent concludes that an action is poor due to a probabilistic belief that it would be a poor player if it took that sort of action.
  [Note that I have not checked the proof for the proposed variant in detail, so, could be wrong.]
  - Stuart_Armstrong Oct 3, 2019, 1:32 PM
    LW: 2 AF: 1
    AF Parent
    If the agent’s reasoning is sensible only under certain settings of the default action clause
    
    That was my first rewriting; the second is an example of a more general algorithm that would go something like this. If we assume that both probabilities and utilities are discrete, all of the form q/n for some q, and bounded above and below by N, then something like this (for EU the expected utility, and Actions the set of actions, and b some default action):
    
    for q integer in N*n^2 to -N*n^2 (ordered from highest to lowest): for a in Actions: if A()=a ⊢ EU=q/n^2 then output a else output b
    
    Then the Löbian proof fails. The agent will fail to prove any of those “if” implications, until it proves “A()=”not cross” ⊢ EU=0″. Then it outputs “not cross”; the default action b is not relevant. Also not relevant, here, is the order in which a is sampled from “Actions”.
    - abramdemski Oct 4, 2019, 11:25 PM
      LW: 3 AF: 2
      AF Parent
      I don’t see why the proof fails here; it seems to go essentially as usual.
      
      Reasoning in PA:
      
      Suppose a=cross->u=-10 were provable.
      
      Further suppose a=cross.
      
      Note that we can see there’s a proof that not crossing gets 0, so it must be that a better (or equal) value was found for crossing, which must have been +10 unless PA is inconsistent, since crossing implies that u is either +10 or −10. Since we already assumed crossing gets −10, this leads to trouble in the usual way, and the proof proceeds from there.
      
      (Actually, I guess everything is a little messier since you haven’t stipulated the search order for actions, so we have to examine more cases. Carrying out some more detailed reasoning: So (under our assumptions) we know PA must have proved (a=cross → u=10). But we already supposed that it proves (a=cross → u=-10). So PA must prove not(a=cross). But then it must prove prove(not(a=cross)), since PA has self-knowledge of proofs. Which means PA can’t prove that it’ll cross, if PA is to be consistent. But it knows that proving it doesn’t take an action makes that action very appealing; so it knows it would cross, unless not crossing was equally appealing. But it can prove that not crossing nets zero. So the only way for not crossing to be equally appealing is for PA to also prove not crossing nets 10. For this to be consistent, PA has to prove that the agent doesn’t not-cross. But now we have PA proving that the agent doesn’t cross and also that it doesn’t not cross! So PA must be inconsistent under our assumptions. The rest of the proof continues as usual.)
      - Stuart_Armstrong Oct 13, 2019, 2:48 PM
        LW: 3 AF: 2
        AF Parent
        You are entirely correct; I don’t know why I was confused.
        
        However, looking at the proof again, it seems there might be a potential hole. You use Löb’s theorem within an assumption sub-loop. This seems to assume that from “ $A \to (□ B \to B)$ ”, we can deduce “ $A \to B$ ”.
        
        But this cannot be true in general! To see this, set $A = □ B \to B$ . Then $A \to (□ B \to B)$ , trivially; if, from that, we could deduce $A \to B$ , we would have $(□ B \to B) \to B$ for any $B$ . But this statement, though it looks like Löb’s theorem, is one that we cannot deduce in general (see Eliezer’s “medium-hard problem” here).
        
        Can this hole be patched?
        
        (note that if $A \to (□_{A} B \to B)$ , where $□_{A}$ is a PA proof that adds A as an extra axiom, then we can deduce $A \to B$ ).