Prisoner’s dilemma tournament results
The prisoner’s dilemma tournament is over. There were a total of 21 entries. The winner is Margaret Sy, with a total of 39 points. 2nd and 3rd place go to rpglover64 and THE BLACK KNIGHT, with scores of 38 and 36 points respectively. There were some fairly intricate strategies in the tournament, but all three of these top scorers submitted programs that completely ignored the source code of the other player and acted randomly, with the winner having a bias towards defecting.
You can download a chart describing the outcomes here, and the source codes for the entries can be downloaded here.
I represented each submission with a single letter while running the tournament. Here is a directory of the entries, along with their scores: (some people gave me a term to refer to the player by, while others gave me a term to refer to the program. I went with whatever they gave me, and if they gave me both, I put the player first and then the program)
A: rpglover64 (38)
B: Watson Ladd (27)
c: THE BLACK KNIGHT (36)
D: skepsci (24)
E: Devin Bayer (30)
F: Billy, Mimic-- (27)
G: itaibn (34)
H: CooperateBot (24)
I: Sean Nolan (28)
J: oaz (26)
K: selbram (34)
L: Alexei (25)
M: LEmma (25)
N: BloodyShrimp (34)
O: caa (32)
P: nshepperd (25)
Q: Margaret Sy (39)
R: So8res, NateBot (33)
S: Quinn (33)
T: HonoreDB (23)
U: SlappedTogetherAtTheLastMinuteBot (20)
- 20 Oct 2020 8:44 UTC; 2 points) 's comment on The Darwin Game by (
Yay, I wasn’t last!
Still, I’m not surprised that laziness did not pay off. I wrote a simple bot, then noticed that it cooperated against defectbot and defected against itself. I thought to myself, “This is not a good sign.” Then I didn’t bother changing it.
Frankly, I find this hilarious.
Can I suggest running a second tournement, including all these bots and new bots that we can examine these bots in writing? After a few iterations, we might beat random. Or we might develop all sorts of signaling quirks, but that would also be interesting.
I like this plan. I’d be willing to run it, unless AlexMennan wants to.
When I first announced the tournament, people came up with many suggestions for modifications to the rules, most of which I did not take for the sake of simplicity. Maybe we should try to figure out what rules will make it easier to write more effective bots, even if that would disqualify many of the bots already submitted to this tournament.
Am I the only one who sees a problem in that we’re turning a non-zero-sum game into a winner-take-all tournament? Perhaps instead of awarding a limited resource like bitcoins to the “winner”, each player should be awarded an unlimited resource such as karma or funny cat pictures according to their strategy’s performance.
No, not the only one. The same complaint was well received in the initial thread.
Maybe we should have a prisoner’s dilemma meta-tournament in which we define a tournament-goodness metric and then everyone submits a tournament design that is run using the bot participants in the previous tournament, and then use the rankings of those designs.
Wait: does anyone know a good way to design the meta-tournament?
I’ll list some suggestions.
Conduct a statistically significant number of trials. With few trial, the highest ranked agents will be the ones with a high variance in possible score, rather than the agents with the highest average score.
Allow no libraries, except for a very small whitelist (e.g. random numbers).
No timing, no threads. They make analysis too hard. The payoff might have to be changed to make it never in your best interest to make your opponent diverge, e.g. (-1, −1).
The rule “This is about Prisoner’s dilemma, not the language you’re working in. If you violate the spirit of this rule, your agent will be disqualified.”
(Possible) Change language to a restricted dialect of Python. You can analyze your opponent bytecode, if you wish, or run your opponent as a black-box. Programs with side effects (such as writing to global variables) and programs that use Pythonic trickiness (such as inspecting the stack) are verboten.
Most of these suggestions are okay, but don’t move away from Lisp. A very restricted Scheme (e.g. one way to set variables, one looping construct (I really like foldl, as N shows), etc.) would be good; the lexical scoping and general immutability make it one of the best Lisps for our purposes.
+1 for restricted Scheme. I suggest restricted Python because it is more accessible and would allow more people to participate, but personally I like Scheme a bit more.
On the other hand, many many people were turned off by Scheme last time, and few of them will be reading this post. I’d like to give them a shot, if only to get a more diverse crowd.
On looping constructs: Would that buy us anything? What if people use recursion for looping? Should we restrict recursion?
Question for the audience: would there be a benefit to using a programming language that allows only primitive recursive functions?
I think a programming language that only allows primitive recursion is a bad idea. One common pattern (which I think we want to allow) was for bots to simulate their opponents, which entails the ability to simulate arbitrary valid code. This would not be possible in a language which restricts to primitive recursion.
I think that the bots already submitted should be considered for grandfathering if the rules change to prohibit them.
I also have a specific entry in mind, but lack the expertise to go from ‘flowchart’ to ‘code’.
I think there would more people interested in playing if strategies could be submitted in pseudocode, so that would be great.
For entries that only eval(opponent(x)), for simple x, that seems reasonable. Strategies that rely on deeper analysis would need to look for specific things about the code that I, as a non-programmer, cannot describe accurately.
I would agree that they should compete even if they’re not awarded prizes for winning. Because we should at least have a comparison with performance against old bots. More data is pretty much always better.
It should also be trivial to calculate the score both in this heat and against all previously submitted bots.
Bring it on! Just to elaborate, in an iterated 1v1, a defectbot will reliably beat a randombot (4 to 1 if the randombot is unweighted). Or rather, any bot which detects a randombot and then always defects.
However, it’s not even necessary to beat the few randombots in a 1v1 to win the tournament, it is sufficient to beat a majority of other bots (including some cooperatebots for easy wins) to get ahead of the pack enough so that the results against the randombots don’t matter.
I went and ran this another 100 times, so I could see what it would look like without the variance. The mean scores are:
A: 32.03
B: 28.53
C: 32.48
D: 24.94
E: 28.75
F: 29.62
G: 28.42
H: 26.12
I :26.06
J: 26.10
K: 36.15
L: 27.21
M: 25.14
N: 34.37
O: 31.06
P: 26.55
Q: 34.95
R: 32.93
S: 37.08
T: 26.43
U: 24.24
If you’re interested, here’s the code for the test (takes a day to run) and the raw output for my run (an inconvenient format, but it shows the stats for the matchups).
And for the lazy, here these scores are in sorted order (with original scores in parentheses):
S: 37.08 - Quinn (33)
K: 36.15 - selbram (34)
Q: 34.95 - Margaret Sy (39)
N: 34.37 - BloodyShrimp (34)
R: 32.93 - So8res, NateBot (33)
C: 32.48 - THE BLACK KNIGHT (36)
A: 32.03 - rpglover64 (38)
O: 31.06 - caa (32)
F: 29.62 - Billy, Mimic-- (27)
E: 28.75 - Devin Bayer (30)
B: 28.53 - Watson Ladd (27)
G: 28.42 - itaibn (34)
L: 27.21 - Alexei (25)
P: 26.55 - nshepperd (25)
T: 26.43 - HonoreDB (23)
H: 26.12 - CooperateBot (24)
J: 26.1 - oaz (26)
I: 26.06 - Sean Nolan (28)
M: 25.14 - LEmma (25)
D: 24.94 - skepsci (24)
U: 24.24 - SlappedTogetherAtTheLastMinuteBot (20)
Has someone gone through all the programs to figure out what strategy they’re supposed to be implementing, then checked that they in fact correctly implement that strategy? Also, given that the bots were allowed to use randomness, it would’ve been a good idea to average a bunch of rounds together.
A tournament like this would be much more interesting if it involved multiple generations. Here, the results heavily depended upon the pool of submitted strategies, regardless of their actual competitiveness, while a multiple-generations tournament would measure success as performance against other successful strategies.
h, i, and j are identical and received noticeably different scores. Maybe next time, repeat the whole process a few times to even out the RNG?
Also, it would be interesting to see the head-to-head matchups, especially to see if R’s complexity actually helped even though it seemed mostly aimed at special cases that never came up.
You can see the head-to-head matchups.
Yes. I will be taking a look at that later. Sorry that I made an unnecessary implied request there.
Hmm. Considering the heavy amount of random strategies, it seems like it might be worth it iterating the tournament several more times and summing the point totals, particularly since C(one of the randoms) is above the three way tie at 4th place between G, K and N by only two points, and A and C also have a 2 point difference despite having functionally identical strategies (50% random Cooperate or Defect.) I feel like that implies standard deviations may make analysis of the results premature.
Nonetheless, analyzing is fun, so I’ll do a little bit anyway on the fourth placers.
K appears to be functionally identical to a Defect bot. It has more complicated code, but that code never cooperated. (Side note: randomly, in this run, Margaret Sy’s winning random favor defection at 80% strategy only cooperated with this bot and defected against all other bots.)
Also G and N, despite being in 4th place, didn’t actually seem to take successfully advantage of any of the 3 pseduosimplistic cooperate bots in this run (H, I, J—Simulate something, then cooperate regardless.) by defecting against them, which would seem to be one of the key ways a mind reader could try to score points for a win. I would guess the ‘pretend to be thinking’ purpose of the simulate call made them look too complicated to exploit in that manner, but I’m not that familiar with the programming language, so perhaps not? (Not that being H,I or J’s type of bot was sufficient to win. Their points are low enough that it seems unlikely that they would surge ahead even after repetition.)
Anyway, thank you for running this! It was a neat idea and as above, I had fun looking at the results.
Edit: I realized a pronoun above was unclear and replaced it with nouns.
I think that kind of deviation was probably a large part of the motivation for those who submitted random-playing bots, since the contest rules specified one round.
I haven’t looked too closely at K, nor ran any tests, but I have a slight suspicion it sometimes cooperated in simulation, despite always defecting on its turn.
As for the cooperatebots, there were multiple reasons I didn’t write N to exploit them, but not source complexity—they don’t even do anything besides cooperate; the middle term is just the function’s argument list.
I checked the behavior of all the bots that cooperated with K, and all but two (T and Q) would have always cooperated with a defectBot. Specifically the defect bot:
Sometimes they cooperated for different reasons. For example, U cooperates with K because it has “quine” in the code, while it cooperates with defectBot because it doesn’t have “quine”, “eval”, or “thread” in it.
Q, of course, acts randomly. T is the only one that doesn’t cooperate with defectBot but was tricked by K into cooperating. Though I’m having trouble figuring out why because I’m not sure what T is doing.
Anyway, it looks like K is reasonable proxy for how defectBot would have done here.
T was supposed to do a bit more than it did, but it had some portability bugs so I hastily lobotomized it. All it’s supposed to do now is simulate the opponent twice against an obfuscated defectbot, defect if it cooperates both times, otherwise play mimicbot. I didn’t have the time to add some of the obvious safeguards. I’m not sure if K is exploiting me or just got lucky, but at a glance, what it might be doing is checking whether the passed-in bot can generate a perfect quine of itself, and cooperating only then. That would be pretty ingenious, since typically a quine chain will go “original—functional copy—identical copy—identical copy”, etc.
You’re right. K is a MimicBot with an additional check for proper quining. I primarily intended it to cause defection against CooperateBots, RandomBots, and others that don’t simulate their opponents meaningfully. I expected a lot more MimicBot variants and mutual cooperations...
True. I’m not saying that the random people did a poor job of selecting strategies, but Error was depressed that random behavior dominated. My and Luke_A_Somers’s points appears to be that the amount randomness dominates by, if it is dominant, would be easier to determine with more rounds.
I don’t know quine at all and can’t easily understand exactly what all these guys are doing but:
This is obviously NOT a stable metagame; in both senses.
If we ran this tournament again the random-bots would get utterly destroyed (since it’s very easy/fast to check for them and defect). If we ran this tournament with multiple rounds, where successful strategies survive and unsuccessful die, and there was even one defect-against-randombot code in the mix, it will win if it doesn’t die early.
My guess-happy summary of what happened is this: The core problem is very hard, so most people instead did something simple, often very simple (cooperate bot, defect bot, random bot—which won!) with a small number of people trying for real. The people trying for real, however, spent all their energy planning for other people trying for real and trying to fool them, rather than trying to act correctly against all the super-simple programs, because if you put in that much work you think other people will mostly put in at least some work (R has an entire long blog post of metagame and strategic analysis which is COMPLETELY wrong in exactly this way, in real life I do this a lot). That, combined with programming errors since the task involved was actually hard, prevented them from winning.
The lesson here, I think, is more about how people react to such tournaments than it is about the actual problem at hand; if anyone had assumed their opponents would either be simple or hopelessly complex, they could have written a program that defects against programs that don’t look at their code or otherwise are clearly safe defections, does something hard to exploit against everyone else that likely is somewhat random, and win easily.
Your summary seems pretty accurate. I don’t think there were many programming errors outside of P’s meltdown, though. Also, as has been touched upon elsewhere in these comments, some of the failures to maximally exploit simple bots were necessary side effects of the attempts to trick complex bots, not just failures to anticipate there being a significant number of simple bots at all. (Sort of a quantitative instead of qualitative prediction mistake—we just thought there’d be more complex bots than simple bots).
One clue towards the general simplicity of the field is the generally atrocious formatting (e.g. close-parens on their own lines)--not many people seem to have had too much experience with Lisp, let alone Lisp projects as complex as this game can get. That’s a shame, but it’s not like any non-Lisp languages are remotely suitable for the kind of source analysis we all wanted to see in this game.
If the tournament were run a second time, my expectation would be that such programs would abound (since writing “if they showed signs of life randomly do something but usually defect, else cooperate” is pretty easy), there’d still be a bunch of simple guys running around, there’d be a small number of people trying to play higher-level exploit games that would have better code than last time, and you’d likely see a level-2 bot (one that beats the obvious level-1 bot that wins the first tournament) come out ahead.
for reference, here’s what I was “planning”-I was all too aware i didn’t have time to learn a new language to actually implement it in. a pity i didn’t think to post this before the tournament results, even if it’s just pseudocode...
run(me,opponent){ //start point notethetime() if(me=opponent){return C} defectbot[]=a list of primitive defectbots with a junk string value added run5times and record results:(run(opponnent, (opponent, defectbot[i]) { if (opponent cooperated sometimes){ if(isooponentclearlymimicbot(ooponent)){cooperate()}else{defect()}}//mimic bot is assumed to be simple and easy to identify as such, but I’m not exactly sure how i’d test this. checktime()//if the opponent takes way too long to run against a short defect bot with barely any baggage, it’s going to have trouble with you. Defect for both your sakes.
if(opponent cooperates allways){if(opponentisclearly cooperate-bot){do I cooperate with C-bots in the hopes of cooperating with those who cooperate with C-bots or do I take advantage of the schmuck?}//a decision i really never made...
mimicbot=”....” result=(opponent,mimicbot) if(opponent defected{defect()} checktime(){}//but this time, cooperate if things are taking dangerously long-MAYBE. it’s somewhat reasonable for an opponent to take a while against a mimicbot, especially if opponent is mimicbot with small epsilon. .
psychbot1.0=(”....”+mimicbot+”...”); /psychbot 1.0 sends you against mimic bot, and then does what you do. only, I get to see the final result.
record(run( opponent,psychbot1.0))
checktime();-possibly in a seperate thread. depends on how quickly this whole thing goes.
psychbot 1.1=”...+defectbot”.
result=run(opponent, psychbot1.1); psychbot 1.1 also checks you against defect bot, like the main program. the purpose, if used, is to avoid cooperating with programs that defect on finding a “D” if(result=C){return C;} else{return D}. }
}
Yeah. pseudocode. for java, not lisp.
edit: clarified a few things in the intro.
I find cooperation with cooperation-bot utterly insane in context; I would have predicted less than three, but a good chance of at least one. In the real world problem, or a tournament with a lot of rounds, it’s potentially worth saying that since C-bots will die off quickly (although, if enough people cooperate with them, maybe they won’t) they’re effectively unlikely enough that you can safely use your response to them as signaling for other programs, but if that’s true, then your opponent should presumably know that and throw non-trivial cases at you instead.
yeah...but I was more concerned about various TrollBots-including but not limited to HashBot, Randombot, DitherBot, cooperate-IFF you are a defect bot or cooperate bot, antimimicbot...and I considered “cooperates with C-Bot” to be extremely useless as a source of information given the unusual demographics of LessWrong while drawing up my plans.
One additional note on a second tournament: I notice bots weren’t played against themselves. Should this be done? It seems useful as a sanity check; I feel like a bot that won’t even cooperate against itself ought to get fewer points.
Why? In what circumstance would you ever be offered a choice to cooperate or compete with your concurrent self?
Probably in a respectable fraction of the universes in which other agents have a copy of your source code.
This is the reason my bot attempted to recognize itself. Attention anyone who plays in a tournament that continues to use this bot pool: put the token LightninRoy in your source!
In Robin Hanson’s em dystopia. :D
It seems fair that if I have to play against a copy of myself, then you have to play against two copies of myself (because it means the universe contains two instances of me). But make it millions, and then I play against approximately as much copies of myself as you do.
Good news: You win! :-)
Bad news: You lose! :-(
Best news: You cooperate. Twice! :D
Huh. I’m not entirely surprised that random behavior was dominant, but it’s certainly depressing. Nevertheless, I’m glad you ran this.
This may be more a reflection that programming interesting strategies which run each others code and fully do what you want them to do is tough. It might be interesting to look at the strategies which did look at the other’s code and see if any are responding in an obviously wrong fashion in that particular instance.
I wasn’t expecting there to be so much randomness. If I had been, I guess I would have spent the time to figure out how to test for it in some kind of loop and defect if the opponent was random. Looping in lisp is unnecessarily complicated though, so I didn’t bother...
That suggests that a large part of this behavior may have been due to limited amounts of programming time.
My own strategy-if i’d felt like joining-would have started by checking the opponent against defectbot; anybody who cooperates with even one of the 5 or so defect-varaints is clearly insane and not to be trusted; defect immediately. OR maybe take a detour and check if they’re actually a mimicbot that got unlucky.
well...I might have played Hawkbot instead; Hawkbot spends 9.999 seconds dithering before finally defecting, so it does insanely well against any opponent that tries to run it.
Several of the bots using simulation also used multithreading to create timer processes so they can quit and defect against anyone who took to long to simulate.
I was also thinking of doing something similar, which was to infinite loop if the opposing programs code was small enough, since that probably meant something more complex was simulating it against something.
Ah, so multithreading was allowed...so instead of dithering for 9.999 seconds, it would have called forkbomb(9.9seconds)
Quinn appears to have submitted something similar. As far as I can tell, against CooperateBot it cooperates; otherwise, it waits 9 seconds before defecting. (The weird timing conditional in there should never return true if things are working properly, and checking the results, indeed, it defected against everything but the three CooperateBots.)
Unless I’m going insane, Luke_A_Somers’ comment below is incorrect. If you defect and your opponent times out, you still get 1 point and they get 0, which is marginally better than both getting 1 point in the case of mutual defection.
That was the purpose my (sleep 9). I figured anyone who was going to eval me twice against anything other than CooperateBot is going to figure out that I’m a jerk and defect against me, so I’m only getting one point from them anyway. Might as well lower their score by 1.
My assumption might not have been correct though. In the original scoreboard, T (who times out against me) actually does cooperate with K, who defected against everybody!
The bizarre-looking always-false conditional was a long shot at detecting simulation. I heard an idea at a LW meetup (maybe from So8res?) that players might remove sleeps from each other’s code before calling eval. I figured I might as well fake such players out if there were any (there were not).
Anybody who runs the other program without an interrupt timer and so times out is, as you put it, “clearly insane”. Hawkbot gets 1 against any sane opponent that tries to run it.
I reread the payoff matrix some time after posting that....and the defect-bot test was only the first in a series of tests I had planned out.
Forget sanity—if your opponent fails to play, the most you can get is one point. Hawkbot is awful.
That wouldn’t work well since timing out causes both players to get 0. You’d do better just defecting.
It does? The payoff matrix had entries for.. … double checks original contest rules OH. I see. I misread how the payoffs work when “other” occurs.
Collaborateurbot: Times out against everyone except the bot it is secretly helping. If it’s not against the letter of the rules, it’s fair game from a “rational = winning” perspective.
I suspect that sort of thing is why the rules forbade multiple entrants.
Are you saying my Grandma can’t submit an entry?
How will we tell her?
It is a nice mathematical model of the valley of bad rationality.
Fantastic work, thank you =)
For anyone else unpacking the zip file, note you’ll want to create a new directory to unzip it into, not just unzip it in the middle of your home folder.
Wait, someone actually submitted CooperateBot? That wasn’t just a default entry?
Thoughts on a possible second tournament:
I like Luke Somers’s suggestion that each matchup should be run several times to smooth out randomness (obviously, bots don’t get access to earlier runs, this isn’t iterated prisoner’s dilemma)
Should it include default entries? Perhaps a CooperateBot, a DefectBot, and a RandomBot? Or perhaps just a RandomBot?
Should we maybe restrict to a subset of Scheme to make things easier for those of us who want to do static analysis? :) (Yeah, yeah, I never actually submitted anything...) Or should we just rely on pre-announcmets of “If you do this, I’m defecting?”
I was a bit irritated that no fewer than three players submitted CooperateBot, as I cooperated with it in hopes of tricking bots like submission B.
EDIT: More offense was taken at my use of the word “trolls” than was intended.
Why does submitting CooperateBot to a competition that does not include it make someone a troll? Would submitting DefectBot make one a troll, too?
(I believe the competition should have automatically included one CooperateBot and DefectBot each, and stated that this was the case at the beginning. I am sad there were three CooperateBots and no DefectBots.)
I think the idea is that someone submitting CooperateBot is not trying to win. But I find this a poor excuse. (It’s why I always give the maximum possible number whenever I play “guess 2/3rds of the average.”) I agree that the competition should have been seeded with some reasonable default bots.
Submitting DefectBot makes you a CDT agent, not a troll.
As one of the players who submitted a cooperatebot (yes, I see the irony), allow me to explain my reasoning for doing so. I scoped the comments to see what bots were being suggested (mimicbots, prudentbots, etc) and I saw much more focus on trying to enforce mutual cooperation than trying to exploit bots that can be exploited. I metagamed accordingly, hypothesizing that the other bots would cooperate with a cooperatebot but possibly fail to cooperate with each other. My hypothesis was incorrect, but worth testing IMO.
Thank you.
There’s a problem with that logic. Did you really expect the people designing exploitation bots to talk about them publicly?
How do we draw the line? Tit-for-Tat is very simple, yet does very well. Arguably before knowing how it performs it could be considered a troll.
Considering this was an experimental tournament, learning how certain strategies perform against others seems far more interesting to me than winning, and I can’t imagine any strategy I would label as a troll submission. Even strategies solely designed to be obstacles are valid and valuable contributions, and the fact that random strategies skew the results is a fault of the tournament rules and not of the strategies themselves.
Can you elaborate on this?
You are right that I used the inflammatory t-word because CooperateBot submitters are probably not trying to win. I certainly expected to see DefectBots (or CliqueBots from optimists), and agree that the competition should have been seeded with both CooperateBots and DefectBots.
But I don’t understand this at all:
To me, this looks like t-wording the people who play 0.
Are we thinking of the same game, where the payout is constant?
Yes, that game. My point is that complaining “that’s not fair, X player wasn’t playing to win” is a failure to think like reality. You know that you’re playing against humans, and that humans do lots of things, including playing games in a way that isn’t playing to win. You should be taking that into account when modeling the likely distribution of opponents you’re going to face. This is especially true if there isn’t a strong incentive to play to win.
Yes, and I agree with this. I’m familiar with Straw Vulcanism and accept that guessing incorrectly is my mistake, not others’.
It seems anger and frustration were read into my comment, when in fact I was merely surprised, so I’ve edited out the offending t-word.
I would have liked to see a proper DefectBot as well, however contenstant K defected every time and only one of the bots that cooperated with it would have defected against DefectBot, so it makes a fairly close proxy.
Be sure to distinguish the “think you’re wrong” and the “find it offensive” components of the responses.
Yes, the “here’s why Quinn is wrong about CooperateBot being a troll submission” comments were valuable, so I don’t regret provoking them. Presumably if my comment had said “players” instead of “trolls” from the outset, it would have been ignored for being inoffensive and content-free.
But provoking a few good comments was a happy accident from my perspective. I will avoid casual use of the word “troll” on this site, unless I have a specific purpose in mind.
As one of the people who submitted a CoopBot. I had three primary motivations.
The first came from a simulation of the competition I created in Python, where I created copies of a lot of the strategies mentioned in the thread comments (excluding of course the ones with a nebulous perfectlyPredictOpponent() function). Given that a lot of the strategies discussed involved checking for cooperation with a coopBot, playing an actual coopBot ranked relatively highly in my test competitions, failing primarily against random or defectbots, and it seemed less exploitable by higher level bots, if they still wanted to be able to trick the lower level bots.
The second motivation was to use the coopBot as a baseline to see how well the other bots really did perform.
The third motivation was simple: I don’t really know Scheme, so most of my more complex bots from my Python simulation proved too difficult to implement in the time I allocated to work on this.
There exist LW readers who believe that Cooperate is the dominant strategy in PD.
Humans don’t only play to win, coupled with a comparatively obscure choice of programming language, simple bots will be in the majority.
Plus even a flowchart of the ideal strategy is hard to create without using black boxes or brute-force iterating every possible proof of something.
EDIT: Oh, and the humans played to win a game different from the nominal game. That distinction is important, because you can only observe them in the actual game and not the nominal one. (At CFAR workshops, PD for low stakes resulted in universal cooperation when it was done openly; the first time it was done semi-anonymously, there was a defector. There were several other differences as well, but the nominal stakes were similar.)
Can you provide links to comments that are examples of this? If so, I’d like to look at them to confirm and then to make sure I have downvoted and/or corrected them as necessary. Without seeing the examples I assign roughly 0.5 probability to them existing, with most of the remaining probability mass going to “There exist comments that defy CDT in a way that some readers may consider to be equivalent to declaring that Cooperate is dominant but is in fact not a claim about strategic dominance at all”.
I had that discussion in person, so I can’t point to comments. There may have been some confusion about the claims made, such that the claim I understood them to be making is different from the claim that they made, and the specific game in question was 2-iteration PD with the nonstandard payoff matrix (50,50; 100,0; 0;100; 25,25), but I think they were using cached thoughts for the discussion, rather than something that applied to the specific case but not the general.
Anyone who submits a strategy that cooperates with CooperateBot has no place shaming others as ‘trolls’ for cooperating. If you are trying to ‘trick’ others and guess their responses incorrectly then the joke is entirely on you!
:(
I didn’t realise that when you said programs should assume
eval
would work normally, that meant I needed to make my simulatedeval
take a namespace as an optional argument (and use the proper namespace containing the standard library as a default). So my program crashed whenever it simulated something that uses(eval x)
. There goes my clever algorithm, lolWhat did you intend your program to do? I tried to work it out from the source code but it scared me off.
Well, now the contest’s over I can publish an expanded version of my source code with comments. Basically it was a sort-of variation of PrudentBot designed to get cooperation from So8res’s stupid pseudo-mimics. The original variation had two conditions:
if ((they cooperate with us if we cooperate) and (they defect against defectbot)) then cooperate; else defect
. The second condition allows you to exploit stupid bots like cooperatebot, but since doing that causes all justicebots, etc etc to defect, I ended up removing that. (Though, looks like that was a bad idea in the end, since there were three cooperatebots submitted!)Well, it’s not quite a true PrudentBot since PrudentBot tries to prove unconditional cooperation, rather than cooperation-if-I-cooperate.
Sorry about that. I was trying to warn people ahead of time about bugs so they could fix them, but I didn’t run sophisticated enough tests to catch that.
Oh well. I don’t think I would have won anyway. The problem seems to be that the REPL behaves very differently to non-REPL contexts, which I certainly wasn’t expecting. Actually, it looks like
eval
in the REPL uses the REPL namespace as a default (so for example you can type(eval '(define a 2)) (write a)
and see a2
printed out) which seems rather unconscionable to me¹, especially since the REPL namespace includes a boatload of miscellaneous modules such asracket/pretty
andracket/function
and probably others as well.Possibly next time it would be useful to publish a “test harness” ahead of time which runs contestants against each other in a clean (or at least well-specified) namespace, and make it clear that the actual contest would be run using this code.
¹ Actually, now that I’ve noticed this, it looks like you could use this to write a bot that corrupts the global (REPL) namespace, and therefore the namespace of anyone trying to simulate it, maybe even forcing the opponent to cooperate by rewriting
if
or something. Ha!I did look briefly for ways of forcing a bot running my program as a simulation to cooperate, but that sort of thing needs way too much knowledge of the programming environment for me to have a hope of succeeding.
An early attempt I had for a bot tried to save state between simulations inside the RNG seed, but that didn’t work out either...
If only THE BLACK KNIGHT had won the tournament, I would’ve revealed my identity then. This way, the identity of THE BLACK KNIGHT shalt forevermore remain a mystery. Giddyup!
In more seriousness though, how could this happen, being worse than random? Just lacking a clause filtering for the string ‘random’ (that could be exploitable, too), or alternatively, not simulating the opponent, say, 10 times, then going with the majority behavior if there is one, and if there’s not (6-5, 5-5, 4-6), treating the opponent as randombot. Won’t filter all randombots, of course. Also won’t explain why another entry didn’t beat out most of the rest of the pack, so as to become impervious to the few detrimental results against the minority of randombots. In conclusion, what happened Ani?
Given Rice’s theorem, there is no source code analysis that is guaranteed to tell you anything interesting about your opponent, unless your opponent has been written in such a way as to facilitate that analysis. This suggests two possible modifications of the competition.
Firstly, the rule could be imposed that source code is unanalysable. A competing program will not be furnished with its opponent’s source, but only an opaque blob. The only operation they can perform on this blob is to supply it with another blob and see what answer it gives. Each program is given access to itself (as an opaque blob), and can contain within itself whatever other blobs it wishes to use for probing purposes. Note that if every program begins a contest by running its opponent on some example, then no-one terminates and everyone loses. Therefore a program must sometimes make a decision without having run its opponent on any examples. This looks like it severely limits the usefulness of simulating the opponent to see what it will do, perhaps to the point where it is impossible to gain anything from it, and the competition reduces to an ordinary PD tournament.
Alternatively, source code would be revealed, but in addition each program would be allowed to furnish machine-checkable proofs of assertions about itself. The machine checking would be part of the infrastructure of the competition, so competitors would be able to have complete trust in these assertions.
Or one could combine these: one receives one’s opponent as an unanalysable blob together with some set of guaranteed true assertions about it, supplied by the creator of that program and verified by the competition infrastructure. The idea is that skill at analysing someone’s source code is secondary to the point of the exercise.
What verifiably true assertions would it be useful for a competitor to make about itself? It would be easy to write code of which it can be proved “I always cooperate”, “I always defect”, “I play Tit For Tat”, or any of the other ordinary PD strategies. What verifiable assertions can a program make about the sort of opponent it will cooperate or defect with? “I will cooperate with anyone who cooperates with me” would not be an admissible assertion (by Rice’s theorem). Nor would “I will cooperate with anyone who verifiably asserts that they cooperate with me”. But “I will cooperate with anyone who verifiably asserts that they cooperate with anyone verifiably asserting this sentence” might be, if the circularity can be handled.
ETA: But “I will cooperate with anyone who verifiably asserts that they cooperate with anyone verifiably asserting a sentence logically equivalent to this one” won’t work. So there would have to be a negotiation phase before the competition in which people publish the assertions they want to make about themselves, so that people can coordinate to agree on the exact formualtion of the mutually referential sentences they want to be seen to be true of them.
Thinking about this further, given enough assertions, there’s no need to have the programs at all. Let the agents be just the assertions that they make about themselves. Each agent would consist of a set of axioms, perhaps written in Prolog, about who they cooperate or defect against, and running a contest between two agents would just be a matter of taking the union of the two sets of axioms and attempting to deduce (by the standard Prolog proof-search) what choice each one makes.
What happens if the two sets of axioms are individually consistent, but together are inconsistent?
Deem both of the agents to have not terminated?
How will you know? The set of consistent axiom systems is undecidable. (Though the set of inconsistent axioms systems is computably enumerable.)
You just run the Prolog (or whatever logic system implements all this), and it either terminates with a failure or does not terminate within the time allowed by the competition. The time limit renders everything practically decidable.
Perhaps it terminates in the time required proving that A defects and B cooperates, even though the axioms were inconsistent, and one could also have proved that A cooperates and B defects.
The competitors know the deterministic proof-search algorithm (e.g. that of Prolog), and the first answer that produces within the time limit, that is the answer.
What’s wrong with “I will cooperate with anyone who verifiably asserts that they cooperate with me”. A program could verifiably assert that by being, e.g., cooperatebot. A program could be written that cooperates with any opponent that provides it with a proof that it will cooperate.
The word “me”. By Rice’s theorem, they can’t tell that they’re dealing with someone computationally equivalent to me, and there’s no other sense of my “identity” that can be referred to.
Athough that could be added. Have all competitors choose a unique name and allow them to verifiably claim to be the owner of their name. Then “I will cooperate with anyone who verifiably asserts that they cooperate with me” can work if “me” is understood to mean “the entity with my name”. Discovering a blob’s name would have to be a second primitive operation allowed on blobs.
ETA: I think I missed your point a little. Yes, a cooperate-bot verifiably asserts that it cooperates with everyone, so it is an entity that “verifiably asserts (something that implies) that they will cooperate with me.” And there will be other bots of which this is true. But I’m not sure that I can verifiably express that I will cooperate with the class of all such bots, because “the class of all such bots” looks undecidable.
Your source code is your name. Having an additional name would be irrelevant. It is certainly possible for bots to prove they cooperate with a given bot, by looking at that particular bot’s source. It would, as you say, be much harder for a bot to prove it cooperates with every bot equivalent to a given bot (in the sense of making the same cooperate/defect decisions vs. every opponent).
Rice’s theorem may not be as much of an obstruction as you seem to indicate. For example, Rice’s theorem doesn’t prohibit a bot which proves that it defects against all defectbots, and cooperates with all cooperatebots. Indeed, you can construct an example of such a bot. (Rice’s theorem would, however, prevent constructing a bot which cooperates with cooperatebots and defects against everyone else.)
Sorry if this is a stupid question, but this tournament looked to me like a thinly disguised version of:
“Construct a robot that can read code and interpret what it means.”
which is a Really Hard Problem.
Is that not a fair description? Was there some other way to approach the problem?
The only way I can see to go about constructing a GOOD entrant to this is to write something that can take as its input the code of the opponent and interpret what it will actually DO, that can recognize the equivalence between (say):
return DEFECT
and
if 1: return DEFECT return COOPERATE
and can interpret things like:
if opp_code == my_code return COOPERATE return DEFECT
And I have no idea how to go about doing that. From the fact that the winning entrants were all random, it seems safe to say that no entrants had any idea how to go about doing that either.
Am I missing something here?
One good way to interpret code is to run the code with “eval”, which many submitted bots did. This method has no problems with the examples you gave. One important place it breaks down is with bots that behave randomly. In that case a robot may, by chance, be simulated to cooperate and defect in whatever sequence would make it seem worth cooperating with even if it actually ends up defecting. This, combined with a little luck, made the random bots come out ahead. There are ways to get around this problem, and a few bots did so, but they still didn’t do better random bots because they had less potential for exploitation in this particular pool of entrants.
I agree, my fellow top-ranking-non-source-ignoring player. Saying “nobody could do any better than randomness in this tournament” is strictly true but a bit misleading; the tiny, defect-happy pool with almost 20% random players (the top 3 and also G; he just obfuscated his somewhat) didn’t provide a very favorable structure for more intelligent bots to intelligently navigate, but there was still certainly some navigation.
I’m pretty pleased with how my bot performed; it never got deterministically CD’d and most of its nonrandom mutual defections were against bots who had some unusual trigger condition for defecting based on source composition, not performance, or had very confused performance triggers (e.g. O—why would you want to play your opponent’s anti-defectbot move when you determine they cooperate with cooperatebot?). Some of its mutual defections were certainly due to my detect-size-changes exploit, but so were its many DCs.
Perfect simulation is not only really hard, it has been proven to be impossible. See http://en.wikipedia.org/wiki/Halting_problem
Some notes I’ve made on the entries:
(Note, this is largely from the perspective of “I want to write a bot that works by doing lots of static analysis to the other bots”, since I was working on such a bot. In fact I can see now that the idea I had in mind didn’t actually make sense, but, oh well, I have an idea for how to fix it for the next contest...)
Mutation: Three of the bots—K, L, and R—used mutation. N and P used a function “namespace-set-variable-value!”; not sure what that does so I’m not sure if it’s really mutating things in context. Nonetheless, point is, I might have to deal with mutation next time around. (Or I can just decide “mutation is defection”. That might be appropriate.)
Do loops: Nobody use do loops… but L used a for loop, which isn’t even standard Scheme, IINM. My program totally would have choked on that. (I had just planned on counting do loops as a defection, meanwhile...)
Syntax redefinition: Thankfully, nobody did this. As for whether my (ultimately empty) declaration that I’d consider such a defection had anything to do with this, I don’t know.
Fine-grained randomness: One program, K, called random with no argument; no program called random with an argument higher than 100. (If I can actually get this thing working for the next contest, btw, I entirely intend to count calling random without an argument as a defection, as well as with an argument more than about 1000 or so.) K didn’t actually need randomness that fine, note; it was just generating a 1-in-100 chance.
Multithreading: All four of the “big programs”—K, N, P, and R—used this. Unsurprising, seeing as it’s the only way to do an apply-with-timeout. I’m kind of surprised Racket doesn’t natively provide a function for this, seeing as it provides time-apply already, but unfortunately it doesn’t and you can’t even use time-apply to build one; you have to use multithreading. P I noticed built such a function themselves, called timeout-exec; I don’t know about the others.
Other timing things: S slept for 9 seconds, for reasons I don’t understand (see wedrifid’s and BloodyShrimp’s comments about why such a program is a bad idea). It then had some weird conditional involving time, but the conditional could only come out one way, so I’m not sure what’s going on there. Just some obfuscation? That seems like a bad idea. (Not related to time, but G also had a weird conditional that could only come out one way. Not sure what’s going on there either.)
Other threading things: T sleeps, yet doesn’t seem to use timing or multithreading anywhere? Is this some subtle “this has an effect if I’m being run in another thread” thing?
Quasiquotation: All four of the “big programs” K, N, P, and R used quasiquotation. OK, this is probably not too worthy of note, except writing the quasiquote handler was one of the places I got stuck when writing my program (in retrospect, it was certainly not the biggest problem). I contemplated counting quasiquotes as a defection—and would have done so had I finished the rest of the program except for that—but that seemed and now even more so seems unreasonable, especially since my own program was going to use quasiquotes. (Why do quasiquotes exist? Well, OK, I get why they exist. I’d just like to point out that—if I’m understanding correctly, which I might not be—quasiquotes are never necessary, and, aside from quoting symbols, neither is quote, because there exists the list function, which is an ordinary function call and I could have handled like an ordinary function call. Oh well. I’ll probably take the time to get my quasiquote handler working next time… in fact I think I see how to fix it already...)
Delay, force, call/cc: Thankfully noone tried to use any of these! Edit: Similarly with multiple-values stuff.
(I call K, N, P, and R the “big programs” since they were the ones long enough that I didn’t take the time to make a detailed analysis of what they do.)
In a pool with a sufficient number of close-to-mimicbots, considering use of language features you didn’t anticipate or can’t handle to be defections seems like a good way to get too many mutual defections to win.
Also, not sure what “mutation” means in context. If you mean “almost quine yourself, but write in a few differences”, N did that; I can’t think of any other reasonable meanings which N does. I used namespace-set-variable-value! merely so that eval could see my variables; in most lisps I’ve used in the past I could have omitted it entirely.
By mutation I mean set!, or anything else that can change the value of a variable; things ending in ‘!’. namespace-set-variable-value! does technically mutate things but looking at the way people actually used it I don’t think it would have caused a problem?
Ah, mutation in the “mutable state” sense. When I was doing some light experimenting with static analysis in the early days of the contest, I looked for variables storing anything involving eval (or involving another variable storing anything involving eval, etc.), and just treated (sounds like that should be “trought” or something) set! calls as define calls—another chance for a variable to become contaminated with evalness. Could you give an example of a case where set! makes things harder to analyze?
OK. Let me explain what the basic idea of my program was. The idea was to analyze the opponent’s program in an attempt to answer the following question: What is the probability that I can affect the outcome of their program? Now that I’ve thought about this some more, this makes less sense than I initially thought; if there’s a second round (and I have time to program something as complicated as this), I’m going to try the rather more sensible “How much I can I affect the probability distribution of their program’s output?” But for now let’s go with the original idea, even though it doesn’t make sense in general.
So the idea was to compute this probability, p, and then cooperate if p>=1/2. (The number 1⁄2 is not arbitrary and is based on the particular values for T, R, P, and S used. Actually if the numbers were different, there wouldn’t be a single p, it would depend on other things, but I’m not going to into that, because this strategy doesn’t actually make sense.)
So e.g. the strategy should defect against CooperateBot, DefectBot, and any sort of RandomBot; it would cooperate with (and get defected against by) Watson’s entry; it would cooperate with (and get cooperated with) by what So8res called a “rank N MimicBot”, so long as N>=2; and it would cooperate with what solipsist called a Mimicbot so long as “cooperate anyway” probability was at most 1⁄2, otherwise it would defect.
OK. So let’s say I’m running my program and it comes up against one of these MimicBots. It gets to a part that looks like
What does it do? Well, it has to check both branches and see that in the first, which occurs with probability 1⁄100, it can’t affect the outcome, and in the second, which occurs with probability 99⁄100, it can. (Really it’s a bit more complicated than that—how does it get the probabilities? But A. the strategy doesn’t actually quite make sense anyway and B. I’m trying to keep this simple.)
Thing is, in order to check both branches, it basically has to (partially) run them. To be honest, “static analysis” is maybe not the best description of what my program was supposed to do—it wasn’t so much “statically analyze the given program”, so much as “statically transform it into a different program, and then run that”. It has to partially run them so that it’s not fooled by conditionals like (if (= 0 1) ‘C ’D); it shouldn’t take branches that are simply never going to happen.
And that leads to the problem—suppose instead it were to come to
Now running both branches is no longer going to yield a sensible result. I’m not sure what it would yield, but it wouldn’t yield anything helpful. It gets worse if instead of being the result, the variable set is some sort of intermediate...
I still don’t see where set! is relevantly different from define… e.g.
In this case (notionally), the (transformed) conditional would end up returning a data structure representing “probability 1⁄100 of always C, probability 99⁄100 of C or D depending on what I do”, and then result would be set to that data structure. The (transformed) define itself doesn’t return a value, but that’s OK, it’s not supposed to, it’s just supposed to influence the value of something else.
In the above case, result would first be set to a data structure representing “probability 1 of C” and then to a data structure representing “probability 1 of C or D depending on what I do”. Then the (transformed) conditional itself would return something undefined, since the return value of set! is undefined. (And OK, it wasn’t supposed to return a value either, but its influence over other parts of the program has been screwed up.)
Basically, my (start of a) program was making the assumption that the return value of a thing was what was important—or at least that, while side effects might exist, they wouldn’t screw things up elsewhere. (With a few exceptions for special side-effecty functions which I intended to handle specially.) But allowing mutation would mean having to write my program to keep track of all of the simulated variables, whereas without it I can just focus on return values and let the Scheme interpreter handle that.
First of all, thanks for running the contest Alex and congrats to the winners.
Random behaviour was obviously a good strategy, but I don’t feel it teaches us much about making better bots. If I had suspected so much random behaviour, defecting whenever it’s detected would have been a good additional test.
Hmm, three way tie for fourth place.
I thought So8res was trying to set up subtle vulnerabilities for the mimicbot clique to exploit himself! My own exploit (store their size and see if it changes; if so defect, on the assumption that most who change their size will only change it once; play small-n mimicbot so you can induce cooperation on their turn and when the exploit doesn’t fire at all) seems to have cut the problem a little cleaner, picking up more nonrandom DCs. I had essentially thought of this exploit in the first few days of the contest, but I had thought it would set me up for a few too many CDs (ended up getting one, but it was random) until So8res posted his tutorial, at which point it seemed like there’d be enough paranoid mimicbotters choosing a random rank on their first move (for example) and leaving the rank-choosing machinery out of their later versions for my exploit to be profitable…
...and instead my DCs were on non-quiney bots which ran simple bot tests, like the one in the contest rules’ example. I did anticipate I’d pick these up, but I didn’t think there’d be so many of them.
Anyway, this was good fun. Thanks for hosting, AlexMennen, and thanks for playing, everyone else.
Sorry if this was answered somewhere else, but do you plan on hosting another tournament? If so, when?
Undecided. I was kind of disappointed with the results of this one, but I’m not optimistic that another one would end up any better unless it has a better structure. I plan to run another tournament if I come up with a way to change the rules that would make it easier to submit effective programs, which I have not made any progress on so far. Some people have made suggestions in the comments, but they range from probably not helpful to difficult to implement.
Would-be submitters have a lot more information going forward. In particular: if you want to win, you have to be able to beat a random bot.
COUNTIF could have made that xls document work without the third and fourth table.
One more question about rules for a second contest: Do we want to keep the rule as stated that “Each player will submit a file containing a single Scheme lambda-function. The function should take one input.”, or should we relax it to “Each player will submit a file containing a single Scheme expression, that must evaluate to a function taking one input”?
Benefits of the latter: Allows quines to be written less artificially (although, if we switch to the two-argument version, this becomes unimportant), is just generally more flexible.
Benefits of the former: Might allow easier analysis (you can more easily find the variable name your program is being called by—though it’s probably still possible to obfuscate that quite a bit if you want).
-
I would have expected a few more metaevaluator bots which clean up the environment to prevent detection. Of course this can be an expensive strategy, and certainly is in programmer time. A metaevaluator bot would have probably broken several recursion detection strategies, or even defining set! out of the language.