Wei Dai comments on Ingredients of Timeless Decision Theory

Wei Dai Aug 19, 2009, 12:33 PM
2 points
Yes, I think Eliezer made a similar point:

What if the TDTs that you’re playing against, decide to defect unconditionally if you submit a CDT player, in order to give you an incentive to submit a TDT player?

So if you run TDT, then there are at least two equilibria in this game, only one of which involves you submitting a CDT. Can you think of a way to select between these two equilibria?

If not, I can fix this by changing the game a bit. Omega will now create his TDT AIs after you design yours, and hard code the source code of your AI into it as givens. His AIs won’t even know about you, the real player.
- Eliezer Yudkowsky Aug 19, 2009, 2:47 PM
  8 points
  Parent
  
  Omega will now create his TDT AIs after you design yours, and hard code the source code of your AI into it as givens. His AIs won’t even know about you, the real player.
  
  They might simply infer you, the real player. You might as well tell the TDT AIs that they’re up against a hardcoded Defect move as the “other player”, but they won’t know if that player has been selected. In fact, that pretty much is what you’re telling them, if you show them a CDT player. The CDT player is a red herring—the decision to defect was made by you, in the moment of submitting a CDT player. There is no law against TDT players realizing this after Omega codes them.
  
  I should note that in matters such as these, the phrase “hard code” should act as a warning sign that you’re trying to fix something that, at least in your own mind, doesn’t want to be fixed. (E.g. “hard code obedience into AIs, build it into the very circuitry!”) Where you are tempted to say “hard code” you may just need to accept whatever complex burden you were trying to get rid of by saying “fix it in place with codes of iron!”
  What links here?
  - Vladimir_Nesov's comment on Another attempt to explain UDT by cousin_it (Nov 14, 2010, 10:55 PM; 7 points)
  - Wei Dai Aug 19, 2009, 9:03 PM
    2 points
    Parent
    By hard code, I meant code it into the TDT’s probability distribution. (Even TDT isn’t meta enough to say “My prior is wrong!”) But that does make the example less convincing, so let me try something else.
    
    Have Omega’s AIs physically go first and you play for yourself. They get a copy of your source code, then make their moves in the 3-choose-2 PD game first. You learn their move, then make your choice. Now, if you follow CDT, you’ll reason that your decision has no causal effect on the TDT’s decisions, and therefore choose D. The TDTs, knowing this, will play C.
    
    And I think I can still show that if you run TDT, you will decide to self-modify into CDT before starting this game. First, if Omega’s AIs know that you run TDT at the beginning, then they can use that “play D if you self-modify” strategy to deter you from self-modifying. But you can also use “I’ll self-modify anyway” to deter them from doing that. So who wins this game? (If someone moves first logically, then he wins, but what if everyone moves simultaneously in the logical sense, which seems to be the case in this game?)
    
    Suppose it’s common knowledge that Omega mostly chooses CDT agents to participate in this game, then “play D if you self-modify” isn’t very “credible”. That’s because they only see your source code after you self-modify so they’d have to play D if they predict that a TDT agent would self-modify, even if the actual player started with CDT. Given that, your “I’ll self-modify anyway” would be highly credible.
    
    I’m not sure how to formalize this notion of “credibility” among TDTs, but it seems to make intuitive sense.
    - Eliezer Yudkowsky Aug 19, 2009, 9:37 PM
      6 points
      Parent
      
      And I think I can still show that if you run TDT, you will decide to self-modify into CDT before starting this game
      
      Well that should never happen. Anything that would make a TDT want to self-modify into CDT should make it just want to play D, no need for self-modification. It should give the same answer at different times, that’s what makes it a timeless decision theory. If you can break that without direct explicit dependence on the algorithm apart from its decisions, then I am in trouble! But it seems to me that I can substitute “play D” for “self-modify” in all cases above.
      
      First, if Omega’s AIs know that you run TDT at the beginning, then they can use that “play D if you self-modify” strategy to deter you from self-modifying.
      
      E.g., “play D if you play D to deter you from playing D” seems like the same idea, the self-modification doesn’t add anything.
      
      So who wins this game? (If someone moves first logically, then he wins, but what if everyone moves simultaneously in the logical sense, which seems to be the case in this game?)
      
      Well… it partially seems to me that, in assuming certain decisions are made without logical consequences—because you move logically first, or because the TDT agents have fixed wrong priors, etc. - you are trying to reduce the game to a Prisoner’s Dilemma in which you have a certain chance of playing against a piece of cardboard with “D” written on it. Even a uniform population of TDTs may go on playing C in this case, of course, if the probability of facing cardboard is low enough. But by the same token, the fact that the cardboard sometimes “wins” does not make it smarter or more rational than the TDT agents.
      
      Now, I want to be very careful about how I use this argument, because indeed a piece of cardboard with “only take box B” written on it, is smarter than CDT agents on Newcomb’s Problem. But who writes that piece of cardboard, rather than a different one?
      
      An authorless piece of cardboard genuinely does go logically first, but at the expense of being a piece of cardboard, which makes it unable to adapt to more complex situations. A true CDT agent goes logically first, but at the expense of losing on Newcomb’s Problem. And your choice to put forth a piece of cardboard marked “D” relies on you expecting the TDT agents to make a certain response, which makes the claim that it’s really just a piece of cardboard and therefore gets to go logically first, somewhat questionable.
      
      Roughly, what I’m trying to reply is that you’re reasoning about the response of the TDT agents to your choosing the CDT algorithm, which makes you TDT, but you’re also trying to force your choice of the CDT algorithm to go logically first, but this is begging the question.
      
      I would, perhaps, go so far as to agree that in an extension of TDT to cases in which certain agents magically get to go logically first, then if those agents are part of a small group uncorrelated with yet observationally indistinguishable from a large group, the small group might make a correlated decision to defect “no matter what” the large group does, knowing that the large group will decide to cooperate anyway given the payoff matrix. But the key assumption here is the ability to go logically first.
      
      It seems to me that the incompleteness of my present theory when it comes to logical ordering is the real key issue here.
      - Wei Dai Aug 19, 2009, 10:01 PM
        1 point
        Parent
        
        Well that should never happen. Anything that would make a TDT want to self-modify into CDT should make it just want to play D, no need for self-modification. It should give the same answer at different times, that’s what makes it a timeless decision theory. If you can break that without direct explicit dependence on the algorithm apart from its decisions, then I am in trouble! But it seems to me that I can substitute “play D” for “self-modify” in all cases above.
        
        The reason to self-modify is to make yourself indistinguishable from players who started as CDT agents, so that Omega’s AIs can’t condition their moves on the player’s type. Remember that Omega’s AIs get a copy of your source code.
        
        A true CDT agent goes logically first, but at the expense of losing on Newcomb’s Problem.
        
        But a CDT agent would self-modify into something not losing on Newcomb’s problem if it expects to face that. On the other hand, if TDT doesn’t self-modify into something that wins my game, isn’t that worse? (Is it better to be reflectively consistent, or winning, if you had to choose one?)
        
        It seems to me that the incompleteness of my present theory when it comes to logical ordering is the real key issue here.
        
        Yes, I agree that’s a big piece of the puzzle, but I’m guessing the solution to that won’t fully solve the “stupid winner” problem.
        
        ETA: And for TDT agents that move simultaneously, there remains the problem of “bargaining” to use Nesov’s term. Lots of unsolved problems… I wish you started us working on this stuff earlier!
        Vladimir_Nesov Aug 19, 2009, 10:19 PM
        3 points
        Parent
        
        The reason to self-modify is to make yourself indistinguishable from players who started as CDT agents, so that Omega’s AIs can’t condition their moves on the player’s type.
        
        Being (or performing an action) indistinguishable from X doesn’t protect you from the inference that X probably resulted from such a plot. That you can decide to camouflage like this may even reduce X’s own credibility (and so a lot of platonic/possible agents doing that will make the configuration unattractive). Thus, the agents need to decide among themselves what to look like: first-mover configurations is a limited resource.
        
        (This seems like a step towards solving bargaining.)
        Wei Dai Aug 19, 2009, 10:25 PM
        0 points
        Parent
        Yes, I see that your comment does seem like a step towards solving bargaining among TDT agents. But I’m still trying to argue that if we’re not TDT agents yet, maybe we don’t want to become them. My comment was made in that context.
        Vladimir_Nesov Aug 19, 2009, 10:47 PM
        1 point
        Parent
        Let’s pick up Eliezer’s suggestion and distinguish now-much-less-mysterious TDT from the different idea of “updateless decision theory”, UDT, that describes choice of a whole strategy (function from states of knowledge to actions) rather than choice of actions in each given state of knowledge, of which latter class TDT is an example. TDT isn’t a UDT, and UDT is a rather vacuous statement, as it only achieves reflective consistency pretty much by definition, but doesn’t tell much about the structure of preference and how to choose the strategy.
        
        I don’t want to become a TDT agent, as in UDT sense, TDT agents aren’t reflectively consistent. They could self-modify towards more UDT-ish look, but this is the same argument as with CDT self-modifying into a TDT.
        Eliezer Yudkowsky Aug 19, 2009, 10:59 PM
        0 points
        Parent
        Dai’s version of this is a genuine, reflectively consistent updateless decision theory, though. It makes the correct decision locally, rather than needing to choose a strategy once and for all time from a privileged vantage point.
        
        That’s why I referred to it as “Dai’s decision theory” at first, but both you and Dai seem to think your idea was the important one, so I compromised and referred to it as Nesov-Dai decision theory.
        Vladimir_Nesov Aug 19, 2009, 11:12 PM
        2 points
        Parent
        Well, as I see UDT, it also makes decisions locally, with understanding that this local computation is meant to find the best global solution given other such locally computed decisions. That is, each local computation can make a mistake, making the best global solution impossible, which may make it very important for the other local computations to predict (or at least notice) this mistake and find the local decisions that together with this mistake constitute the best remaining global solution, and so on. The structure of states of knowledge produced by the local computations for the adjacent local computations is meant to optimize the algorithm of local decision-making in those states, giving most of the answer explicitly, leaving the local computations to only move the goalpost a little bit.
        
        The nontrivial form of the decision-making comes from the loop that makes local decisions maximize preference given the other local decisions, and those other local decisions do the same. Thus, the local decisions have to coordinate with each other, and they can do that only through the common algorithm and logical dependencies between different states of knowledge.
        
        At which point the fact that these local decisions are part of the same agent seems to become irrelevant, so that a more general problem needs to be solved, one of cooperation of any kinds of agents, or even more generally processes that aren’t exactly “agents”.
        What links here?
        Vladimir_Nesov's comment on Outlawing Anthropics: An Updateless Dilemma by Eliezer Yudkowsky (Sep 20, 2009, 2:33 PM; 2 points)
        Wei Dai Aug 19, 2009, 11:38 PM
        4 points
        Parent
        One thing I don’t understand is that both you and Eliezer talk confidently about how agents would make use of logical dependencies/correlations. You guys don’t seem to think this is a really hard problem.
        
        But we don’t even know how to assign a probability (or whether it even makes sense to do so) to a simple mathematical statement like P=NP. How do we calculate and/or represent the correlation between one agent and another agent (except in simple cases like where they’re identical or easily proven to be equivalent)? I’m impressed by how far you’ve managed to push the idea of updatelessness, but it’s hard for me to process what you say, when the basic concept of logical uncertainty is still really fuzzy.
        Expand this thread
        Eliezer Yudkowsky Aug 19, 2009, 11:53 PM
        3 points
        Parent
        I can argue pretty forcefully that (1) a causal graph in which uncertainty has been factored into uncorrelated sources, must have nodes or some kind of elements corresponding to logical uncertainty; (2) that in presenting Newcomblike problems, the dilemma-presenters are in fact talking of such uncertainties and correlations; (3) that human beings use logical uncertainty all the time in an intuitive sense, to what seems like good effect.
        
        Of course none of that is actually having a good formal theory of logical uncertainty—I just drew a boundary rope around a few simple logical inferences and grafted them onto causal graphs. Two-way implications get represented by the same node, that sort of thing.
        
        I would be drastically interested in a formal theory of logical uncertainty for non-logically-omniscient Bayesians.
        
        Meanwhile—you’re carrying out logical reasoning about whole other civilizations starting from a vague prior over their origins, every time you reason that most superintelligences (if any) that you encounter in faraway galaxies, will have been built in such a way as to maximize a utility function rather than say choosing the first option in alphabetical order, on the likes of true PDs.
        Wei Dai Aug 20, 2009, 10:54 AM
        2 points
        Parent
        I want to try to understand the nature of logical correlations between agents a bit better.
        
        Consider two agents who are both TDT-like but not perfectly correlated. They play a one-shot PD but in turn. First one player moves, then the other sees the move and makes its move.
        
        In normal Bayesian reasoning, once the second player sees the first player’s move, all correlation between them disappears. (Does this happen in your TDT?) But in UDT, the second player doesn’t update, so the correlation is preserved. So far so good.
        
        Now consider what happens if the second player has more computing power than the first, so that it can perfectly simulate the first player and compute its move. After it finishes that computation and knows the first player’s move, the logical correlation between them disappears, because no uncertainty implies no correlation. So, given there’s no logical correlation, it ought to play D. The first player would have expected that, and also played D.
        
        Looking at my formulation of UDT, this may or may not happen, depending on what the “mathematical intuition subroutine” does when computing the logical consequences of a choice. If it tries to be maximally correct, then it would do a full simulation of the opponent when it can, which removes logical correlation, which causes the above outcome. Maybe the second player could get a better outcome if it doesn’t try to be maximally correct, but the way my theory is formulated, what strategy the “mathematical intuition subroutine” uses is not part of what’s being optimized.
        
        So, I’m not sure what to do about this, except to add it to the pile of unsolved problems.
        What links here?
        “UDT2” and “against UD+ASSA” by Wei Dai (May 12, 2019, 4:18 AM; 50 points)
        Wei Dai's comment on Ingredients of Timeless Decision Theory by Eliezer Yudkowsky (Aug 23, 2009, 10:12 AM; 2 points)
        ESRogs Apr 26, 2014, 9:26 AM
        2 points
        Parent
        Coming to this a bit late :), but I’ve got a basic question (which I think is similar to Nesov’s, but I’m still confused after reading the ensuing exchange). Why would it be that,
        
        The first player would have expected that, and also played D.
        
        If the second player has more computer power (so that the first player cannot simulate it), how can the first player predict what the second player will do? Can the first player reason that since the second player could simulate it, the second player will decide that they’re uncorrelated and play D no matter what?
        
        That dependence on computing power seems very odd, though maybe I’m sneaking in expectations from my (very rough) understanding of UDT.
        Vladimir_Nesov Aug 20, 2009, 11:51 AM
        0 points
        Parent
        
        Now consider what happens if the second player has more computing power than the first, so that it can perfectly simulate the first player and compute its move. After it finishes that computation and knows the first player’s move, the logical correlation between them disappears, because no uncertainty implies no correlation. So, given there’s no logical correlation, it ought to play D. The first player would have expected that, and also played D.
        
        The first player’s move could depend on the second player’s, in which case the second player won’t get the answer is a closed form, the answer must be a function of the second player’s move...
        What links here?
        ESRogs's comment on Ingredients of Timeless Decision Theory by Eliezer Yudkowsky (Apr 26, 2014, 9:26 AM; 2 points)
        Wei Dai Aug 20, 2009, 11:57 AM
        0 points
        Parent
        But if the second player has more computational power, it can just keep simulating the first player until the first player runs out of clock cycles and has to output something.
        Vladimir_Nesov Aug 20, 2009, 12:23 PM
        0 points
        Parent
        I don’t understand your reply: exact simulation is brute force that isn’t a good idea. You can prove general statements about the behavior of programs on runs of unlimited or infinite length in finite time. But anyway, why would the second player provoke mutual defection?
        Wei Dai Aug 20, 2009, 12:39 PM
        0 points
        Parent
        
        But anyway, why would the second player provoke mutual defection?
        
        In my formulation, it doesn’t have a choice. Whether or not it does exact simulation of the first player is determined by its “mathematical intuition subroutine”, which I treated as a black box. If that module does an exact simulation, then mutual defection is the result. So this also ties in with my lack of understanding regarding logical uncertainty. If we don’t treat the thing that reasons about logical uncertainty as a black box, what should we do?
        
        ETA: Sometimes exact simulation clearly is appropriate, for example in rock-paper-scissors.
        Vladimir_Nesov Aug 19, 2009, 11:49 PM
        0 points
        Parent
        Conceptually, I treat logical uncertainty as I do prior+utility, a representation of preference, in this more general case over mathematical structures. The problems of representing this preference compactly and extracting human preference don’t hinder these particular explorations.
        Wei Dai Aug 20, 2009, 12:02 AM
        0 points
        Parent
        I don’t understand this yet. Can you explain in more detail what is a general (noncompact) way to representing logical uncertainty?
        What links here?
        Wei Dai's comment on Agree, Retort, or Ignore? A Post From the Future by Wei Dai (Nov 25, 2009, 12:20 AM; 12 points)
        Vladimir_Nesov Aug 19, 2009, 10:26 PM
        0 points
        Parent
        If you are a CDT agent, you can’t (or simply won’t) become a normal TDT agent. If you are a human, who knows what that means.
  - MichaelVassar Aug 19, 2009, 5:08 PM
    0 points
    Parent
    After all, for anything you can hard code, the AI can build a new AI that lacks your hard coding and sacrifice its resources to that new AI.