This is a rather interesting way to look at the IPD: it seems like there’s value here.
If you’re not familiar with Markov chains, read up on them first or this is not going to make much sense.
The basic framework is that the IPD can be played by only remembering 1) what happened last round and 2) what to play based on what happened last round. A player’s strategy can be written as a vector of four probabilities- how likely they are to cooperate in each case (CC CD DC CC), which they call p for the first player and q for the second.
By combining both vectors, you can determine the transition matrix for the next round, which is a 4 by 4 matrix where all the rows sum to 1. If CooperateBot (1 1 1 1) is playing DefectBot (0 0 0 0), then regardless of what the last round was, the next (and all future) rounds will be CD. If CoinBot (.5 .5 .5 .5) plays CooperateBot, then the next round will be CC with probability .5 and DC with probability .5.
When things get interesting is when the players play strategies that actually make use of the memory. When CoinBot plays TitForTat (1 0 1 0), then if the last round was CC or CD, it will be CC with probability .5, and DC with probability .5. If it was DC or DD last round, then it will be CD with probability .5 and DD with probability .5.
You can then do Markov chain analysis on the transition matrix (and the payoffs) to figure out what the long-run average benefit to each player will be. (When the transition matrix is a closed communicating class, this is easy; when it’s not communicating, like two TitForTat players playing each other, then it depends on the initial state. Notice that TitForTat’s dynamics can be expressed in this form but not its rule of cooperating on round 1.)
This is where the evolution comes in: if players submitted p and q and then went to lunch, you could calculate the score and be done. But the players are allowed to vary p and q as the game progresses, and a natural choice is simple steepest descent. (This is actually a spot where the paper is a bit fuzzy, and could stand to be more explicit about what they mean by “theory of mind.” Players will clearly need a memory longer than one round to run any sort of steepest descent algorithm, and it’s much clearer to see how the gradients are calculated if both players are provided with p and q, not just a history of how the game has gone.)
And here’s the fascinating result: if player X chooses a p and then does not change it, and player Y sets their q to extract the maximum possible score out of the situation, then X can do better than the cooperation value of 3, because Y does worse (by more than X gains), even though Y is maximizing their average score in the short run. So naive score maximizers can be taken advantage of. If X and Y instead use p and q to communicate, then it becomes a negotiation game like the ultimatum game: X proposes a split of the long run average winnings, and Y either takes it (by playing the strategy that maximizes Y’s score) or refuses it (typically by turning into DefectBot). This means Y gives up short-term benefit to convince X that a more equitable division is necessary- which is only a good play when Y expects that X will actually change their behavior.
So even though it’s not meaningful to make your probabilities for the next round depend on more than one round’s history, it is very meaningful to make the trajectory of your probability vector depend on the history of the other player’s probability vector (and your projection of where they’ll go in the future). That’s what they mean by theory of mind.
This is a rather interesting way to look at the IPD: it seems like there’s value here.
If you’re not familiar with Markov chains, read up on them first or this is not going to make much sense.
The basic framework is that the IPD can be played by only remembering 1) what happened last round and 2) what to play based on what happened last round. A player’s strategy can be written as a vector of four probabilities- how likely they are to cooperate in each case (CC CD DC CC), which they call p for the first player and q for the second.
By combining both vectors, you can determine the transition matrix for the next round, which is a 4 by 4 matrix where all the rows sum to 1. If CooperateBot (1 1 1 1) is playing DefectBot (0 0 0 0), then regardless of what the last round was, the next (and all future) rounds will be CD. If CoinBot (.5 .5 .5 .5) plays CooperateBot, then the next round will be CC with probability .5 and DC with probability .5.
When things get interesting is when the players play strategies that actually make use of the memory. When CoinBot plays TitForTat (1 0 1 0), then if the last round was CC or CD, it will be CC with probability .5, and DC with probability .5. If it was DC or DD last round, then it will be CD with probability .5 and DD with probability .5.
You can then do Markov chain analysis on the transition matrix (and the payoffs) to figure out what the long-run average benefit to each player will be. (When the transition matrix is a closed communicating class, this is easy; when it’s not communicating, like two TitForTat players playing each other, then it depends on the initial state. Notice that TitForTat’s dynamics can be expressed in this form but not its rule of cooperating on round 1.)
This is where the evolution comes in: if players submitted p and q and then went to lunch, you could calculate the score and be done. But the players are allowed to vary p and q as the game progresses, and a natural choice is simple steepest descent. (This is actually a spot where the paper is a bit fuzzy, and could stand to be more explicit about what they mean by “theory of mind.” Players will clearly need a memory longer than one round to run any sort of steepest descent algorithm, and it’s much clearer to see how the gradients are calculated if both players are provided with p and q, not just a history of how the game has gone.)
And here’s the fascinating result: if player X chooses a p and then does not change it, and player Y sets their q to extract the maximum possible score out of the situation, then X can do better than the cooperation value of 3, because Y does worse (by more than X gains), even though Y is maximizing their average score in the short run. So naive score maximizers can be taken advantage of. If X and Y instead use p and q to communicate, then it becomes a negotiation game like the ultimatum game: X proposes a split of the long run average winnings, and Y either takes it (by playing the strategy that maximizes Y’s score) or refuses it (typically by turning into DefectBot). This means Y gives up short-term benefit to convince X that a more equitable division is necessary- which is only a good play when Y expects that X will actually change their behavior.
So even though it’s not meaningful to make your probabilities for the next round depend on more than one round’s history, it is very meaningful to make the trajectory of your probability vector depend on the history of the other player’s probability vector (and your projection of where they’ll go in the future). That’s what they mean by theory of mind.
Thanks, this is a very useful explanation!