After kicking around ideas for a bit, I came up with a candidate for “fair decision theory problem class” and a reasonable-seeming notion of optimality (like Pareto optimality, not necessarily unique) which things like DefectBot and the various FairBots all fail. However, Masquerade fails it too, albeit for interesting reasons.
Let’s outsource all of our provability oracle needs: instead of receiving the opponent’s source code, let’s redefine the problem so that both agents have access to a hierarchy of halting oracles for PA+i, and have “my current opponent” as a callable variable. That is, they can ask if PA+i proves statements of the form “X(Y)=C” or “X(Y)=D”, where X and Y can be the agent’s self, their opponent, or any specified mask. But they can’t do things like check whether the source code is identical to their own. (Why am I doing this? As Eliezer puts it when talking about Newcomb’s Problem, only your decisions should matter to another agent, not your ritual of cognition.) Aside from those oracle calls, our agents are limited to using bounded loops.
Let’s introduce the notion of “regret”, an analogue of counterfactual surgery on one’s decision. I imagine someone who’s written a FairBot, upon seeing that it’s playing a single game against a CooperateBot, regret that they couldn’t have included the subroutine “if your opponent is this CooperateBot, then defect” in their program.
(Note that this subroutine is illegal under my definition of the class of agents; but it serves the purpose of analogizing “counterfactual surgery on this particular decision”.)
As I’ve been saying, there are some forms of this regret that I’m willing to accept. One of these is regret for third-party punishment; if I find I’m playing against an agent who cooperates with me iff I cooperate with DefectBot, then I might locally wish I had the appropriate subroutine when I played this punisher, but having it would surely harm me if I later played a DefectBot. So I’ll call it a regret only if I wish I could have coded a surgical subroutine that applies only to the current opponent, not to any third parties.
The other exception I want to make is regret for “running out of time”. If X is “cooperate if PA proves Opp(X)=C, else defect” and Y is “defect if PA proves Opp(Y)=D, then cooperate if PA+1 proves Opp(Y)=C, else defect”, then X and Y will mutually defect and either programmer could regret this… but instead of adding a surgical subroutine, they could fix the calamity by changing X’s code to look in PA+1 rather than PA. So I’ll say that a regret is an “out-of-time regret” if a better outcome could have been reached by strictly increasing indices of oracle calls in one or both agents. (Masquerade suffers from out-of-time regrets against lots of FairBots, since it takes until PA+3 to work out that mutual cooperation is the best option.)
So the kind of optimality that I want is an agent in this class, such that its only regrets against other agents of this class are out-of-time regrets.
DefectBot and FairBot clearly fail this. At first, I hoped that Masquerade would satisfy it, but then realized that an opponent could be exploitably stupid without Masquerade realizing it. To wit, let StupidBot cooperate with X if and only if PA+4 proves that X cooperates with (this post’s version of) FairBot. Then Masquerade mutually cooperates with StupidBot, but Masquerade could have a surgical subroutine to defect against StupidBot and it would get away with it.
I’m wondering, though, if adding in a bit of “playing chicken with your decision process” would not only patch this hole, but satisfy my optimality condition...
ETA: I’m not sure that anything will turn out to satisfy this criterion, but it seems plausible. However, if we pass from the Prisoner’s Dilemma to games with a bargaining component, I’d expect that nothing satisfies it: there’s just too much room for one-upsmanship there. Hopefully there will be some weak generalization (or some emendation) that’s meaningful.
@orthonormal: Thanks for pointing me to these posts during our dinner last Thursday.
I haven’t checked the proofs in detail, and I didn’t read all previous posts in the series. I’m a bit confused about a couple of things, and I’d be happy if someone could enlighten me.
1) What is the purpose of the i-loops?
In your definition of FairBot and others, you check for all i=1,...,N whether some sentence is provable in PA+i. However, if a sentence is provable in PA+i, then it is also provable in PA+(i+1). For this reason, I don’t understand why you loop over i=1,...,N.
2) FairBot and the search procedure can be simplified
The following holds for all agents X and Y, and all j,k in {C,D}:
Search(X,Y) fails if and only if, for all j,k in {C,D}, the sentence “j = X(Y) and k=Y(X)” is not provable in PA+N.
If Search(X,Y) does not fail, then Search(X,Y) outputs ( X(Y) , Y(X) ).
This means you could define the search procedure simply as follows:
Search’(X,Y) := ( X(Y) , Y(X) )
This procedure Search’ has the same output as Search whenever Search does not fail, and Search’ is meaningful on strictly more inputs. In particular, Search’ halts and outputs the correct answer if and only if X halts on input Y and Y halts on input X.
Then you can redefine Search(X,Y) equivalently as follows:
If “X(Y) halts and Y(X) halts” is provable in PA+N then return ( X(Y) , Y(X) );
Otherwise, return False.
Formulated this way, it becomes apparent that you are trying to
find proofs for the fact that certain Turing machines halt on certain inputs; the
actual output is not so important and can be computed outside the
proof system.
Similarly, FairBot(X) can be simplified as follows:
If “X(FairBot) halts” is provable in PA+N, then return X(FairBot).
Otherwise return D.
This nicely matches the intuition that we try to understand our opponent in interaction with us, and if we do, we can predict what they will do and match their action; if we fail to understand them, we are clueless and have to defect.
3) On a related note, what is the utility if we don’t halt?
Maybe your
concern is that we should halt even if we play against the Turing
machine that never halts? To me it seems that a prisoner who can’t
decide on what to do effectively outputs “Cooperate” since the guards
will only wait a finite amount of time and don’t extract any
information from a prisoner that does not halt to explicitly output D.
In the model in which not halting is the same as cooperating, the trivial bot below is optimally fair:
TrivialFairBot(X):
return X(TrivialFairBot). (i.e., simply simulate your opponent and match their action)
Based on the fact that you’re doing something more complicated, I
assume that you want to forbid not halting; let’s say we model this
such that every non-halting player gets a utility of -∞. Then it is a
bit weird that we have to halt while our opponent is allowed to run
into an infinite loop: it seems to give malicious opponents an unfair
advantage if they’re allowed to threaten to go into an infinite loop.
4) Why allow these PA oracles?
I don’t understand why you forbid your agents to run into an infinite
loop, but at the same time allow them have access to uncomputable PA+N
oracles. Or are you only looking at proofs of a certain bounded
length? If so, what’s the motivation for that?
5) How do you define “regrets” formally?
I haven’t checked this, but it
seems that, by diagonalization arguments, every possible agent will
loose against infinitely many opponents. It further seems that, for
any given agent X, you can always find an opponent Y so that adding a
hard-wired rule “if your opponent is this particular agent Y, then do
C/D” (whichever is better) improves the performance of X (note that
this can’t be done for all Y since Y could first check whether X has a
hard-wired rule like that). In particular, I can imagine that there is
an infinite sequence X_1,.. of agents, such that X_{i+1} performs at
least as good as X_i on all inputs, and it performs better on at least
one input. This would mean that there is no “Pareto-optimum” since
there’s always some input on which our performance can be improved.
6) What’s the motivation for studying a decision theory that is
inherently inefficient, almost not computable, and easily exploitable
by some opponents?
If we’re trying to model a real decision theory,
the two agents arguably shouldn’t have access to a lot of time.
Would the following resource-bounded version make sense? For some function t(n), the machines are forced to halt and produce their output within t(n) computation steps, where n=|X|+|Y| is the length of the binary encoding of the machines. In this post, you’re getting decision theories that are at least exponential in n. I’d be interested to see whether we achieve something when t(n) is a polynomial?
The time-bounded setting has some advantages: I think it’d more realistically model the problem, and it prevents an arms race of power since no player is allowed to take more than t(n) time. On the other hand, things get more difficult (and more interesting) now since you can inspect your opponent’s source code, but you don’t have enough time to fully simulate your opponent.
Masquerade correctly defects against TemptingCooperateBot, for the following reasons:
PA proves “FB(TCB)=TCB(FB)=C”
PA+5 proves “TCB(DB)=C” and of course PA proves “DB(TCB)=D”
PA+5 proves “TCB(Masquerade)=C” (though this is actually irrelevant, as Masquerade will defect at this point whether the sanity check succeeds or fails)
It’s not that there’s anything magical about PA+4 in Masquerade; my mental image is that Masquerade uses a large value for N, so that it has time to find mutual cooperation with all but the very loopiest of opponents. If you delay it until PA+4, then it’ll just start from there, more or less.
Unless I’ve misunderstood Masquerade, there seems to be a version of SlowCooperateBot that Masquerade also cooperates with:
If that’s correct, then Wei’s comment seems to be a good summary of the reason why your approach works. But maybe I’m still missing something.
After kicking around ideas for a bit, I came up with a candidate for “fair decision theory problem class” and a reasonable-seeming notion of optimality (like Pareto optimality, not necessarily unique) which things like DefectBot and the various FairBots all fail. However, Masquerade fails it too, albeit for interesting reasons.
Let’s outsource all of our provability oracle needs: instead of receiving the opponent’s source code, let’s redefine the problem so that both agents have access to a hierarchy of halting oracles for PA+i, and have “my current opponent” as a callable variable. That is, they can ask if PA+i proves statements of the form “X(Y)=C” or “X(Y)=D”, where X and Y can be the agent’s self, their opponent, or any specified mask. But they can’t do things like check whether the source code is identical to their own. (Why am I doing this? As Eliezer puts it when talking about Newcomb’s Problem, only your decisions should matter to another agent, not your ritual of cognition.) Aside from those oracle calls, our agents are limited to using bounded loops.
Let’s introduce the notion of “regret”, an analogue of counterfactual surgery on one’s decision. I imagine someone who’s written a FairBot, upon seeing that it’s playing a single game against a CooperateBot, regret that they couldn’t have included the subroutine “if your opponent is this CooperateBot, then defect” in their program.
(Note that this subroutine is illegal under my definition of the class of agents; but it serves the purpose of analogizing “counterfactual surgery on this particular decision”.)
As I’ve been saying, there are some forms of this regret that I’m willing to accept. One of these is regret for third-party punishment; if I find I’m playing against an agent who cooperates with me iff I cooperate with DefectBot, then I might locally wish I had the appropriate subroutine when I played this punisher, but having it would surely harm me if I later played a DefectBot. So I’ll call it a regret only if I wish I could have coded a surgical subroutine that applies only to the current opponent, not to any third parties.
The other exception I want to make is regret for “running out of time”. If X is “cooperate if PA proves Opp(X)=C, else defect” and Y is “defect if PA proves Opp(Y)=D, then cooperate if PA+1 proves Opp(Y)=C, else defect”, then X and Y will mutually defect and either programmer could regret this… but instead of adding a surgical subroutine, they could fix the calamity by changing X’s code to look in PA+1 rather than PA. So I’ll say that a regret is an “out-of-time regret” if a better outcome could have been reached by strictly increasing indices of oracle calls in one or both agents. (Masquerade suffers from out-of-time regrets against lots of FairBots, since it takes until PA+3 to work out that mutual cooperation is the best option.)
So the kind of optimality that I want is an agent in this class, such that its only regrets against other agents of this class are out-of-time regrets.
DefectBot and FairBot clearly fail this. At first, I hoped that Masquerade would satisfy it, but then realized that an opponent could be exploitably stupid without Masquerade realizing it. To wit, let StupidBot cooperate with X if and only if PA+4 proves that X cooperates with (this post’s version of) FairBot. Then Masquerade mutually cooperates with StupidBot, but Masquerade could have a surgical subroutine to defect against StupidBot and it would get away with it.
I’m wondering, though, if adding in a bit of “playing chicken with your decision process” would not only patch this hole, but satisfy my optimality condition...
ETA: I’m not sure that anything will turn out to satisfy this criterion, but it seems plausible. However, if we pass from the Prisoner’s Dilemma to games with a bargaining component, I’d expect that nothing satisfies it: there’s just too much room for one-upsmanship there. Hopefully there will be some weak generalization (or some emendation) that’s meaningful.
@orthonormal: Thanks for pointing me to these posts during our dinner last Thursday.
I haven’t checked the proofs in detail, and I didn’t read all previous posts in the series. I’m a bit confused about a couple of things, and I’d be happy if someone could enlighten me.
1) What is the purpose of the i-loops?
In your definition of FairBot and others, you check for all i=1,...,N whether some sentence is provable in PA+i. However, if a sentence is provable in PA+i, then it is also provable in PA+(i+1). For this reason, I don’t understand why you loop over i=1,...,N.
2) FairBot and the search procedure can be simplified
The following holds for all agents X and Y, and all j,k in {C,D}:
Search(X,Y) fails if and only if, for all j,k in {C,D}, the sentence “j = X(Y) and k=Y(X)” is not provable in PA+N.
If Search(X,Y) does not fail, then Search(X,Y) outputs ( X(Y) , Y(X) ).
This means you could define the search procedure simply as follows:
Search’(X,Y) := ( X(Y) , Y(X) )
This procedure Search’ has the same output as Search whenever Search does not fail, and Search’ is meaningful on strictly more inputs. In particular, Search’ halts and outputs the correct answer if and only if X halts on input Y and Y halts on input X.
Then you can redefine Search(X,Y) equivalently as follows:
If “X(Y) halts and Y(X) halts” is provable in PA+N then return ( X(Y) , Y(X) );
Otherwise, return False.
Formulated this way, it becomes apparent that you are trying to find proofs for the fact that certain Turing machines halt on certain inputs; the actual output is not so important and can be computed outside the proof system.
Similarly, FairBot(X) can be simplified as follows:
If “X(FairBot) halts” is provable in PA+N, then return X(FairBot).
Otherwise return D.
This nicely matches the intuition that we try to understand our opponent in interaction with us, and if we do, we can predict what they will do and match their action; if we fail to understand them, we are clueless and have to defect.
3) On a related note, what is the utility if we don’t halt?
Maybe your concern is that we should halt even if we play against the Turing machine that never halts? To me it seems that a prisoner who can’t decide on what to do effectively outputs “Cooperate” since the guards will only wait a finite amount of time and don’t extract any information from a prisoner that does not halt to explicitly output D.
In the model in which not halting is the same as cooperating, the trivial bot below is optimally fair:
TrivialFairBot(X):
return X(TrivialFairBot). (i.e., simply simulate your opponent and match their action)
Based on the fact that you’re doing something more complicated, I assume that you want to forbid not halting; let’s say we model this such that every non-halting player gets a utility of -∞. Then it is a bit weird that we have to halt while our opponent is allowed to run into an infinite loop: it seems to give malicious opponents an unfair advantage if they’re allowed to threaten to go into an infinite loop.
4) Why allow these PA oracles?
I don’t understand why you forbid your agents to run into an infinite loop, but at the same time allow them have access to uncomputable PA+N oracles. Or are you only looking at proofs of a certain bounded length? If so, what’s the motivation for that?
5) How do you define “regrets” formally?
I haven’t checked this, but it seems that, by diagonalization arguments, every possible agent will loose against infinitely many opponents. It further seems that, for any given agent X, you can always find an opponent Y so that adding a hard-wired rule “if your opponent is this particular agent Y, then do C/D” (whichever is better) improves the performance of X (note that this can’t be done for all Y since Y could first check whether X has a hard-wired rule like that). In particular, I can imagine that there is an infinite sequence X_1,.. of agents, such that X_{i+1} performs at least as good as X_i on all inputs, and it performs better on at least one input. This would mean that there is no “Pareto-optimum” since there’s always some input on which our performance can be improved.
6) What’s the motivation for studying a decision theory that is inherently inefficient, almost not computable, and easily exploitable by some opponents?
If we’re trying to model a real decision theory, the two agents arguably shouldn’t have access to a lot of time. Would the following resource-bounded version make sense? For some function t(n), the machines are forced to halt and produce their output within t(n) computation steps, where n=|X|+|Y| is the length of the binary encoding of the machines. In this post, you’re getting decision theories that are at least exponential in n. I’d be interested to see whether we achieve something when t(n) is a polynomial?
The time-bounded setting has some advantages: I think it’d more realistically model the problem, and it prevents an arms race of power since no player is allowed to take more than t(n) time. On the other hand, things get more difficult (and more interesting) now since you can inspect your opponent’s source code, but you don’t have enough time to fully simulate your opponent.
Thanks for the reply, and sorry for the delay...
Masquerade correctly defects against TemptingCooperateBot, for the following reasons:
PA proves “FB(TCB)=TCB(FB)=C”
PA+5 proves “TCB(DB)=C” and of course PA proves “DB(TCB)=D”
PA+5 proves “TCB(Masquerade)=C” (though this is actually irrelevant, as Masquerade will defect at this point whether the sanity check succeeds or fails)
It’s not that there’s anything magical about PA+4 in Masquerade; my mental image is that Masquerade uses a large value for N, so that it has time to find mutual cooperation with all but the very loopiest of opponents. If you delay it until PA+4, then it’ll just start from there, more or less.