Can Nesov’s AI correctly guess what AI Eliezer would probably have built and vice versa?
No, I’m assuming that the AIs don’t have enough information or computational power to predict the human players’ choices. Think if a human-created AI were to meet a paperclipper that was designed by a long-lost alien race. Wouldn’t you program the human AI to play defect against the paperclipper, assuming that there is no way for the AIs to prove their source codes to each other? The two AIs ought to think that they are both using the same decision theory (assuming there is just one obviously correct theory that they would both converge to). But that theory can’t be TDT, because if it were TDT, then the human AI would play cooperate, which you would have overridden if you knew was going to happen.
Wei, the whole point of TDT is that it’s not necessary for me to insert special cases into the code for situations like this. Under any situation in which I should program the AI to defect against the paperclipper, I can write a simple TDT agent and it will decide to defect against the paperclipper.
TDT has that much meta-power in it, at least. That’s the whole point of using it.
(Though there are other cases—like the timeless decision problems I posted about that I still don’t know how to handle—where I can’t make this statement about the TDT I have in hand; but this is because I can’t handle those problems in general.)
Given an arbitrary, non-symmetric, one-shot, two-player game with non-transferable utility (your payoffs are denominated in human lives, the other guy’s in paperclips), and given that it’s common knowledge to both agents that they’re using identical implementations of your “TDT”, how do we calculate which outcome gets played?
Under any situation in which I should program the AI to defect against the paperclipper, I can write a simple TDT agent and it will decide to defect against the paperclipper.
So, what is that simple TDT agent? You seemed to have ignored my argument that it can’t exist, but if you can show me the actual agent (and convince me that it would defect against the paperclipper if that’s not obvious) then of course that would trump my arguments.
This problem statement oversimplifies the content of information available to each player about the other player. Depending on what the players know, either course of action could be preferable. The challenge of a good decision theory is to formally describe what these conditions are.
Wouldn’t you program the human AI to play defect against the paperclipper, assuming that there is no way for the AIs to prove their source codes to each other?
Whatever I decide on this point, I expect FAI programmers in general to decide the same; and I expect the paperclipper to know what FAI programmers in general do, through simulations or higher-level reasoning, and act accordingly. So, no.
No, I’m assuming that the AIs don’t have enough information or computational power to predict the human players’ choices. Think if a human-created AI were to meet a paperclipper that was designed by a long-lost alien race. Wouldn’t you program the human AI to play defect against the paperclipper, assuming that there is no way for the AIs to prove their source codes to each other? The two AIs ought to think that they are both using the same decision theory (assuming there is just one obviously correct theory that they would both converge to). But that theory can’t be TDT, because if it were TDT, then the human AI would play cooperate, which you would have overridden if you knew was going to happen.
Let me know if that still doesn’t make sense.
Wei, the whole point of TDT is that it’s not necessary for me to insert special cases into the code for situations like this. Under any situation in which I should program the AI to defect against the paperclipper, I can write a simple TDT agent and it will decide to defect against the paperclipper.
TDT has that much meta-power in it, at least. That’s the whole point of using it.
(Though there are other cases—like the timeless decision problems I posted about that I still don’t know how to handle—where I can’t make this statement about the TDT I have in hand; but this is because I can’t handle those problems in general.)
...How much power, exactly?
Given an arbitrary, non-symmetric, one-shot, two-player game with non-transferable utility (your payoffs are denominated in human lives, the other guy’s in paperclips), and given that it’s common knowledge to both agents that they’re using identical implementations of your “TDT”, how do we calculate which outcome gets played?
So, what is that simple TDT agent? You seemed to have ignored my argument that it can’t exist, but if you can show me the actual agent (and convince me that it would defect against the paperclipper if that’s not obvious) then of course that would trump my arguments.
ETA: Never mind, I figured this out myself. See step 11 of http://lesswrong.com/lw/15m/towards_a_new_decision_theory/11lx
This problem statement oversimplifies the content of information available to each player about the other player. Depending on what the players know, either course of action could be preferable. The challenge of a good decision theory is to formally describe what these conditions are.
Whatever I decide on this point, I expect FAI programmers in general to decide the same; and I expect the paperclipper to know what FAI programmers in general do, through simulations or higher-level reasoning, and act accordingly. So, no.