Eliezer wants to cooperate against a cooperating opponent, as depicted in the beginning of “Three Worlds Collide”. What I “should” do is quite another matter.
You don’t cooperate against a Paperclip maximizerif you know it’ll cooperate even if you defect. If you cooperate in this situation, it’s a murder of 1 billion people. I’m quite confident that if you disagree with this statement, you misunderstand the problem.
Oh. Time to change my opinion—now I finally see what you and Eliezer mean by “general theory”. It reduces to something like this:
My source code contains a rule M that overrides everything else and is detectable by other agents. It says: I will precommit to cooperating (playing the Pareto-optimal outcome) if I can verify that the opponent’s source code contains M. Like a self-printing program (quine), no infinite recursion in sight. And, funnily enough, this statement can persuade other agents to modify their source code to include M—there’s no downside. Funky!
But I still have no idea what Newcomb’s problem has to do with that. Maybe should give myself time to think some more...
Or, more generally: “If, for whatever reason, there’s sufficiently strong correlation between my cooperation and my opponent’s cooperation, then cooperation is the correct answer”
You need causation, not correlation. Correlation considers the whole state space, whereas you need to look at correlation within each conditional area of state space, given one action (your cooperation), or another (your defection), which in this case corresponds to causation. If you only look for unconditional correlation, you are inadvertently asking the same circular question: “what will I do?”. When you act, you determine which parts of the state space are to be annihilated, become not just counterfactual, but impossible, and this is what you can (ever) do. Correlation depends on that, since it’s computed over what remains. So you can’t search for that information, and use it as a basis for your decisions.
If you know the following fact: “The other guy will cooperate iff I cooperate”, even if you know nothing about the nature of the cause of the correlation, that’s still a good enough reason to cooperate.
You ask yourself “If I defect, what will the outcome be? If I cooperate, what will the outcome be?” Taking into account the correlation, one then determines which they prefer. And there you go.
For example, imagine that, say, two AIs that were created with the same underlying archetecture (though possibly with different preferences) meet up. They also know the fact of their similarity. Then they may reason something like “hrmm… The same underlying algorithms running in me are running in my opponent. So presumably they are reasoning the exact same way as I am even at this moment. So whichever way I happen to decide, cooperate or defect, they’ll probably decide the same way. So the only reasonably possible outcomes would seem to be ‘both of us cooperate’ or ‘both of us defect’, therefore I choose the former, since it has a better outcome for me. Therefore I cooperate.”
In other words, what I chose is also lawful. That is, physics underlies my brain. My decision is not just a thing that causes future things, but a thing that was caused by past things. If I know that the same past things influenced my opponent’s decision in the same way, then I may be able to infer “whatever sort of reasoning I’m doing, they’re also doing, so...”
Or did I completely fail to understand your objection?
My source code contains a rule M that overrides everything else and is detectable by other agents. It says: I will precommit to cooperating (playing the Pareto-optimal outcome) if I can verify that the opponent’s source code contains M. Like a self-printing program (quine), no infinite recursion in sight. And, funnily enough, this statement can persuade other agents to modify their source code to include M—there’s no downside. Funky!
Something like this. Referring to an earlier discussion, “Cooperator” is an agent that implements M. Practical difficulties are all in signaling that you implement M, while actually implementing it may be easy (but pointless if you can’t signal it and can’t detect M in other agents).
The relation to Newcomb’s problem is that there is no need to implant a special-purpose algorithm like M you described above, you can guide all of your actions by a single decision theory that implements M as a special case (generalizes M if you like), and also solves Newcomb’s problem.
One inaccuracy here is that there are many Pareto optimal global strategies (in PD there are many if you allow mixed strategies), with different payoffs to different agents, and so they must first agree on which they’ll jointly implement. This creates a problem analogous to the Ultimatum game, or the problem of fairness.
you can guide all of your actions by a single decision theory that implements M as a special case (generalizes M if you like), and also solves Newcomb’s problem
Didn’t think about that. Now I’m curious: how does this decision theory work? And does it give incentive to other agents to adopt it wholesale, like M does?
That’s the idea. I more or less know how my version of this decision theory works, and I’m likely to write it up in the next few weeks. I wrote a little bit about it here (I changed my mind about causation, it’s easy enough to incorporate it here, but I’ll have to read up on Pearl first). There is also Eliezer’s version, that started the discussion, and that was never explicitly described, even on a surface level.
Overall, there seem to be no magic tricks, only the requirement for a philosophically sane problem statement, with inevitable and long-known math following thereafter.
OK, I seem to vaguely understand how your decision theory works, but I don’t see how it implements M as a special case. You don’t mention source code inspection anywhere.
What matters is the decision (and its dependence on other facts). Source code inspection is only one possible procedure for obtaining information about the decision. The decision theory doesn’t need to refer to a specific means of getting that information. I talked about a related issue here.
Forgive me if I’m being dumb, but I still don’t understand. If two similar agents (not identical to avoid the clones argument) play the PD using your decision theory, how do they arrive at C,C? Even if agents’ algorithms are common knowledge, a naive attempt to simulate the other guy just falls into bottomless recursion as usual. Is the answer somehow encoded in “the most general precommitment”? What do the agents precommit to? How does Pareto optimality enter the scene?
If you know that no matter what you do, the other one will cooperate, then you should defect.
Eliezer wants to cooperate against a cooperating opponent, as depicted in the beginning of “Three Worlds Collide”. What I “should” do is quite another matter.
You don’t cooperate against a Paperclip maximizer if you know it’ll cooperate even if you defect. If you cooperate in this situation, it’s a murder of 1 billion people. I’m quite confident that if you disagree with this statement, you misunderstand the problem.
Oh. Time to change my opinion—now I finally see what you and Eliezer mean by “general theory”. It reduces to something like this:
My source code contains a rule M that overrides everything else and is detectable by other agents. It says: I will precommit to cooperating (playing the Pareto-optimal outcome) if I can verify that the opponent’s source code contains M. Like a self-printing program (quine), no infinite recursion in sight. And, funnily enough, this statement can persuade other agents to modify their source code to include M—there’s no downside. Funky!
But I still have no idea what Newcomb’s problem has to do with that. Maybe should give myself time to think some more...
Or, more generally: “If, for whatever reason, there’s sufficiently strong correlation between my cooperation and my opponent’s cooperation, then cooperation is the correct answer”
You need causation, not correlation. Correlation considers the whole state space, whereas you need to look at correlation within each conditional area of state space, given one action (your cooperation), or another (your defection), which in this case corresponds to causation. If you only look for unconditional correlation, you are inadvertently asking the same circular question: “what will I do?”. When you act, you determine which parts of the state space are to be annihilated, become not just counterfactual, but impossible, and this is what you can (ever) do. Correlation depends on that, since it’s computed over what remains. So you can’t search for that information, and use it as a basis for your decisions.
If you know the following fact: “The other guy will cooperate iff I cooperate”, even if you know nothing about the nature of the cause of the correlation, that’s still a good enough reason to cooperate.
You ask yourself “If I defect, what will the outcome be? If I cooperate, what will the outcome be?” Taking into account the correlation, one then determines which they prefer. And there you go.
For example, imagine that, say, two AIs that were created with the same underlying archetecture (though possibly with different preferences) meet up. They also know the fact of their similarity. Then they may reason something like “hrmm… The same underlying algorithms running in me are running in my opponent. So presumably they are reasoning the exact same way as I am even at this moment. So whichever way I happen to decide, cooperate or defect, they’ll probably decide the same way. So the only reasonably possible outcomes would seem to be ‘both of us cooperate’ or ‘both of us defect’, therefore I choose the former, since it has a better outcome for me. Therefore I cooperate.”
In other words, what I chose is also lawful. That is, physics underlies my brain. My decision is not just a thing that causes future things, but a thing that was caused by past things. If I know that the same past things influenced my opponent’s decision in the same way, then I may be able to infer “whatever sort of reasoning I’m doing, they’re also doing, so...”
Or did I completely fail to understand your objection?
Something like this. Referring to an earlier discussion, “Cooperator” is an agent that implements M. Practical difficulties are all in signaling that you implement M, while actually implementing it may be easy (but pointless if you can’t signal it and can’t detect M in other agents).
The relation to Newcomb’s problem is that there is no need to implant a special-purpose algorithm like M you described above, you can guide all of your actions by a single decision theory that implements M as a special case (generalizes M if you like), and also solves Newcomb’s problem.
One inaccuracy here is that there are many Pareto optimal global strategies (in PD there are many if you allow mixed strategies), with different payoffs to different agents, and so they must first agree on which they’ll jointly implement. This creates a problem analogous to the Ultimatum game, or the problem of fairness.
Didn’t think about that. Now I’m curious: how does this decision theory work? And does it give incentive to other agents to adopt it wholesale, like M does?
That’s the idea. I more or less know how my version of this decision theory works, and I’m likely to write it up in the next few weeks. I wrote a little bit about it here (I changed my mind about causation, it’s easy enough to incorporate it here, but I’ll have to read up on Pearl first). There is also Eliezer’s version, that started the discussion, and that was never explicitly described, even on a surface level.
Overall, there seem to be no magic tricks, only the requirement for a philosophically sane problem statement, with inevitable and long-known math following thereafter.
OK, I seem to vaguely understand how your decision theory works, but I don’t see how it implements M as a special case. You don’t mention source code inspection anywhere.
What matters is the decision (and its dependence on other facts). Source code inspection is only one possible procedure for obtaining information about the decision. The decision theory doesn’t need to refer to a specific means of getting that information. I talked about a related issue here.
Forgive me if I’m being dumb, but I still don’t understand. If two similar agents (not identical to avoid the clones argument) play the PD using your decision theory, how do they arrive at C,C? Even if agents’ algorithms are common knowledge, a naive attempt to simulate the other guy just falls into bottomless recursion as usual. Is the answer somehow encoded in “the most general precommitment”? What do the agents precommit to? How does Pareto optimality enter the scene?