Thank you for providing those resources. They weren’t quite what I was hoping to see, but they did help me see that I did not correctly describe what I was looking for.
Specifically, if we use the first paper’s definition that “adversarially robust” means “inexploitable—i.e. the agent will never cooperate with something that would defect against it, but may defect even if cooperating would lead to a C/C outcome and defecting would lead to D/D”, one example of “an adversarially robust decision theory which does not require infinite compute” is “DefectBot” (which, in the language of the third paper, is a special case of Defect-Unless-Proof-Of-Cooperation-bot (DUPOC(0))).
What I actually want is an example of a concrete system that is
Inexploitable (or nearly so): This system will never (or rarely) play C against something that will play D against it.
Competitive: There is no other strategy which can, in certain environments, get long-term better outcomes than this strategy by sacrificing inexploitability-in-theory for performance-in-its-actual-environment-in-practice (for example, I note that in the prisoner’s dilemma tournament back in 2013, the actual winner was a RandomBot despite some attempts to enter FairBot and friends, though also a lot of the bots in that tournament had Problems)
Computationally tractable.
Ideally, it would also be
Robust to the agents making different predictions about the effects of their actions. I honestly don’t know what a solution to that problem would look like, even in theory, but “able to operate effectively in a world where not all effects of your actions are known in advance” seems like an important thing for a decision theory.
Robust to the “trusting trust” problem (i.e. the issue of “how do you know that the source code you received is what the other agent is actually running”). Though if you have a solution for this problem you might not even need a solution to a lot of the other problems, because a solution to this problem implies an extremely powerful already-existing coordination mechanism (e.g. “all manufactured hardware has preloaded spyware from some trusted third party that lives in a secure enclave and can make a verifiable signed report of the exact contents of the memory and storage of that computer”).
In any case, it may be time to run another PD tournament. Perhaps this time with strategies described in English and “evaluated” by an LLM, since “write a program that does the thing you want” seems to have been the blocking step for things people wanted to do in previous submissions.
Edit: I would be very curious to hear from the person who strong-disagreed with this about what, specifically, their disagreement is? I presume that the disagreement is not with my statement that I could have phrased my first comment better, but it could plausibly be any of “the set of desired characteristics is not a useful one”, “no, actually, we don’t need another PD tournament”, or “We should have another PD tournament, but having the strategies be written in English and executed by asking an LLM what the policy does is a terrible idea”.
Robust to the “trusting trust” problem (i.e. the issue of “how do you know that the source code you received is what the other agent is actually running”). ″
This is the crux really, and I’m surprised that many LW’s seem to believe the ‘robust cooperation’ research actually works sans a practical solution to ‘trusting trust’ (which I suspect doesn’t actually exist), but in that sense it’s in good company (diamonoid nanotech, rapid takeoff, etc)
Thank you for providing those resources. They weren’t quite what I was hoping to see, but they did help me see that I did not correctly describe what I was looking for.
Specifically, if we use the first paper’s definition that “adversarially robust” means “inexploitable—i.e. the agent will never cooperate with something that would defect against it, but may defect even if cooperating would lead to a C/C outcome and defecting would lead to D/D”, one example of “an adversarially robust decision theory which does not require infinite compute” is “DefectBot” (which, in the language of the third paper, is a special case of Defect-Unless-Proof-Of-Cooperation-bot (DUPOC(0))).
What I actually want is an example of a concrete system that is
Inexploitable (or nearly so): This system will never (or rarely) play C against something that will play D against it.
Competitive: There is no other strategy which can, in certain environments, get long-term better outcomes than this strategy by sacrificing inexploitability-in-theory for performance-in-its-actual-environment-in-practice (for example, I note that in the prisoner’s dilemma tournament back in 2013, the actual winner was a RandomBot despite some attempts to enter FairBot and friends, though also a lot of the bots in that tournament had Problems)
Computationally tractable.
Ideally, it would also be
Robust to the agents making different predictions about the effects of their actions. I honestly don’t know what a solution to that problem would look like, even in theory, but “able to operate effectively in a world where not all effects of your actions are known in advance” seems like an important thing for a decision theory.
Robust to the “trusting trust” problem (i.e. the issue of “how do you know that the source code you received is what the other agent is actually running”). Though if you have a solution for this problem you might not even need a solution to a lot of the other problems, because a solution to this problem implies an extremely powerful already-existing coordination mechanism (e.g. “all manufactured hardware has preloaded spyware from some trusted third party that lives in a secure enclave and can make a verifiable signed report of the exact contents of the memory and storage of that computer”).
In any case, it may be time to run another PD tournament. Perhaps this time with strategies described in English and “evaluated” by an LLM, since “write a program that does the thing you want” seems to have been the blocking step for things people wanted to do in previous submissions.
Edit: I would be very curious to hear from the person who strong-disagreed with this about what, specifically, their disagreement is? I presume that the disagreement is not with my statement that I could have phrased my first comment better, but it could plausibly be any of “the set of desired characteristics is not a useful one”, “no, actually, we don’t need another PD tournament”, or “We should have another PD tournament, but having the strategies be written in English and executed by asking an LLM what the policy does is a terrible idea”.
This is the crux really, and I’m surprised that many LW’s seem to believe the ‘robust cooperation’ research actually works sans a practical solution to ‘trusting trust’ (which I suspect doesn’t actually exist), but in that sense it’s in good company (diamonoid nanotech, rapid takeoff, etc)