Eliezer Yudkowsky comments on Towards a New Decision Theory

Eliezer Yudkowsky 18 Aug 2009 0:36 UTC
1 point

Suppose it expects to face a TDT agent in the future. Whether that agent will play C or D against it is independent of what it decides now.

Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you, and in which case “modify to play C only if I expect that other agent simulating me to play C iff I modify to play C” is a superior strategy to “just D” because an agent who simulates you making the former choice (and which expects to be correctly simulated itself) will play C against you, while if it simulates you making the latter choice it will play D against you.

If it does self-modify into TDT, then it might play C against the other TDT where it otherwise would have played D, and since the payoff for C is lower than for D, holding the other player’s choice constant, it will decide not to self-modify into TDT.

The whole point is that the other player’s choice is not constant. Otherwise there is no reason ever for anyone to play C in a one-shot true PD! Simulation introduces logical dependencies—that’s the whole point and to the extent it is not true even TDT agents will play D.

“Holding the other player’s choice constant” here is the equivalent of “holding the contents of the boxes constant” in Newcomb’s Problem. It presumes the answer.
- Wei Dai 18 Aug 2009 4:16 UTC
  0 points
  Parent
  
  Unless that agent already knows or can guess your source code, in which case it is simulating you or something highly correlated to you
  
  I think you’re invoking TDT-style reasoning here, before the agent has self-modified into TDT.
  
  Besides, I’m assuming a world where agents can’t know or guess each others’ source codes. I thought I made that clear. If this assumption doesn’t make sense to you, consider this: What evidence can one AI use to infer the source code of another AI or its creator? What if any such evidence can be faked near perfectly by the other AI? What about for two AIs of different planetary origins meeting in space?
  
  I know you’d like to assume a world where guessing each others’ source code is possible, since that makes everything work out nicely and everyone can “live happily ever after”. But why shouldn’t we consider both possibilities, instead of ignoring the less convenient one?
  
  ETA: I think it may be possible to show that a CDT won’t self-modify into a TDT as long as it believes there is a non-zero probability that it lives in a world where it will encounter at least one agent that won’t know or guess its current or future source code, but in the limit as that probability goes to zero, the DT it self-modifies into converges to TDT.
  - Eliezer Yudkowsky 18 Aug 2009 4:28 UTC
    4 points
    Parent
    
    I think you’re invoking TDT-style reasoning here, before the agent has self-modified into TDT.
    
    I already said that agents which start out as pure CDT won’t modify into pure TDTs—they’ll only cooperate if someone gets a peek at their source code after they self-modified. However, humans, at least, are not pure CDT agents—they feel at least the impulse to one-box on Newcomb’s Problem if you raise the stakes high enough.
    
    This has nothing to do with evolutionary contexts of honor and cooperation and defection and temptation, and everything to do with our evolved instincts governing abstract logic and causality, which is what governs what sort of source code you think has what sort of effect. Even unreasonably pure CDT agents recognize that if they modify their source code at 7am, they should modify to play TDT against any agent that has looked at their source code after 7am. To humans, who are not pure CDT agents, the idea that you should play essentially the same way if Omega glimpsed your source code at exactly 6:59am, seems like common sense given the intuitions we have about logic and causality and elegance and winning. If you’re going to all the trouble to invent TDT anyway, it seems like a waste of effort to two-box against Omega if he perfectly saw your source code 5 seconds before you self-modified. (These being the kind of ineffable meta-decision considerations that we both agree are important, but which are hard to formalize.)
    
    Besides, I’m assuming a world where agents can’t know or guess each others’ source codes.
    
    You are guessing their source code every time you argue that they’ll choose D. If I can’t make you see this as an instance of “guessing the other agent’s source code” then indeed you will not see the large correlations at the start point, and if the agents start out highly uncorrelated then the rare TDT agents will choose the correct maximizing action, D. They will be rare because, by assumption in this case, most agents end up choosing to cooperate or defect for all sorts of different reasons, rather than by following highly regular lines of logic in nearly all cases—let alone the same line of logic that kept on predictably ending up at D.
    
    There’s a wide variety of cases where philosophers go astray by failing to recognize an instance of an everyday concept as an abstract concept. For example, they say in one breath that “God is unfalsifiable”, and in the next breath talk about how God spoke to them in their heart, because they don’t recognize “God spoke to me in my heart” as an instance of “God allegedly made something observable happen”. Philosophers talk about qualia being epiphenomenal in one breath, and then in the next speak of how they know themselves to be conscious, because they don’t recognize this self-observation as an instance of “something making something else happen” aka “cause and effect”. The only things recognized as matching the formal-sounding phrase “cause and effect” are big formal things officially labeled “causal”, not just stuff that makes other stuff happen.
    
    In the same sense, you have this idea about modeling other agents as this big official affair that requires poring over their source code with a magnifying glass and then furthermore verifying that they can’t change it while you aren’t looking.
    
    You need to recognize the very thought processes you are carrying out right now in arguing that just about anyone will choose D as an instance of guessing the outputs of the other agents’ source codes and moreover guessing that most such codes and outputs are massively logically correlated.
    
    This is witnessed by the fact that if we did get to see some interstellar transactions, and you saw that the first three transactions were (C, C), you would say, “Wow, guess Eliezer was right” and expect the next one to be (C, C) as well. (And of course if I witnessed three cases of (D, D) I would say “Guess I was wrong.”) Even though the initial conditions are not physically correlated, we expect a correlation. What is this correlation, then? It is a logical correlation. We expect different species to end up following similar lines of reasoning, that is, performing similar computations, like factorizing 123,456 in spacelike separated galaxies.
    - Wei Dai 18 Aug 2009 14:39 UTC
      1 point
      Parent
      It occurs to me that the problem is important enough that even if we can reach intuitive agreement, we should still do the math. But it doesn’t help to solve the wrong problem, so do you think the following is the right formalization of the problem?
      
      Assume a “no physical proof of source code” universe.
      Assume three types of intelligent life can arise in this universe.
      In a Type A species, Eliezer’s intuition is obvious to everyone, so they build AIs running TDT without further consideration.
      In a Type B species, my intuition is obvious to everyone so, so they build AIs running XDT, or AIs running CDT which immediately self-modify into XDT. Assume (or prove) that XDT behaves like TDT except it unconditionally plays D in PD.
      In a Type C species, different people have different intuitions, and some (Type D individuals) don’t have strong intuitions or prefer to use a formal method to make this meta-decision. We human beings obviously belong to this type of species, and let’s say we at LessWrong belong to this last subgroup (Type D).
      
      Does this make sense so far?
      
      Let me say where my intuition expects this to lead to, so you don’t think I’m setting a trap for you to walk into. Whatever meta-decision we make, it can be logically correlated only to AIs running TDT and other Type D individuals in the universe. If the proportion of Type D individuals in the universe is low, then it’s obviously better for us to implement XDT instead of TDT. That’s because whether we use TDT or XDT will have little effect on how often other TDTs play cooperate. (They can predict what Type D individuals will decide, but since there are few of us and they can’t tell which AIs were created by Type D individuals, it won’t affect their decisions much.)
      
      Unfortunately we don’t know the proportions of different types of species/individuals. So we should program an AI to estimate them, and have it make the decision of what to self-modify into.
      
      ETA: Just realized that the decisions of Type D individuals can also correlate with the intuitions of others, since intuitions come from unconscious mental computations and they may be of a similar nature with our explicit decisions. But this correlation will be imperfect, so the above reasoning still applies, at least to some extent.
      
      ETA2: This logical correlation stuff is hard to think about. Can we make any sense of these types of problems before having a good formal theory of logical correlation?
      
      ETA3: The thing that’s weird here is that assuming everyone’s intuitions/decisions aren’t perfectly correlated, some will build TDTs and some will build XDTs. And it will be the ones who end up deciding to build XDTs that defect who will win. How to make sense of this, if that’s the wrong decision?
      
      ETA4: I’ll be visiting Mt. Rainier for the rest of the day, so that’s it. :) Sorry for the over-editing.
    - Wei Dai 18 Aug 2009 5:10 UTC
      1 point
      Parent
      Maybe cousin_it is right and we really have to settle this by formal math. But I’m lazy and will give words one more try. If we don’t reach agreement after this I’m going to the math.
      
      So, right now we have different intuitions. Let’s say you have the correct intuition and convince everyone of it, and I have the incorrect one but I’m too stupid to realize it. So you and your followers go on to create a bunch of AIs with TDT. I go on to create an AI which is like TDT except it plays defect in PD. Lets say I pretended to be your follower and we never had this conversation, so there is no historical evidence that I would create such an AI. When my AI is born, it modifies my brain so that I start to believe I created an AI with TDT, thus erasing the last shred of evidence. My AI will then go on and win against every other AI.
      
      Given the above, why should I change my mind now, and not win?
      
      ETA: Ok, I realize this is pretty much the same scenario as the brain lesion one, except it’s not just possible, it’s likely. Someone is bound to have my intuition and be resistant to your persuasion. If you say that smart agents win, then he must be the smart one, right?
    - Vladimir_Nesov 18 Aug 2009 11:32 UTC
      0 points
      Parent
      This I think connects one more terminological distinction. When you talked earlier about something like “reasoning about the output of platonic computation” as a key insight that started your version of TDT, you meant basically the same thing as me talking about how even knowing about the existence of the other agent, little things that you can use to even talk about it, is already the logical dependence between you and the other agent that could in some cases be used to stage cooperation.