What counts as defection?
Thanks to Michael Dennis for proposing the formal definition; to Andrew Critch for pointing me in this direction; to Abram Demski for proposing non-negative weighting; and to Alex Appel, Scott Emmons, Evan Hubinger, philh, Rohin Shah, and Carroll Wainwright for their feedback and ideas.
There’s a good chance I’d like to publish this at some point as part of a larger work. However, I wanted to make the work available now, in case that doesn’t happen soon.
They can’t prove the conspiracy… But they could, if Steve runs his mouth.
The police chief stares at you.
You stare at the table. You’d agreed (sworn!) to stay quiet. You’d even studied game theory together. But, you hadn’t understood what an extra year of jail meant.
The police chief stares at you.
Let Steve be the gullible idealist. You have a family waiting for you.
Sunlight stretches across the valley, dappling the grass and warming your bow. Your hand anxiously runs along the bowstring. A distant figure darts between trees, and your stomach rumbles. The day is near spent.
The stags run strong and free in this land. Carla should meet you there. Shouldn’t she? Who wants to live like a beggar, subsisting on scraps of lean rabbit meat?
In your mind’s eye, you reach the stags, alone. You find one, and your arrow pierces its barrow. The beast shoots away; the rest of the herd follows. You slump against the tree, exhausted, and never open your eyes again.
You can’t risk it.
People talk about ‘defection’ in social dilemma games, from the prisoner’s dilemma to stag hunt to chicken. In the tragedy of the commons, we talk about defection. The concept has become a regular part of LessWrong discourse.
Informal definition. A player defects when they increase their personal payoff at the expense of the group.
This informal definition is no secret, being echoed from the ancient Formal Models of Dilemmas in Social Decision-Making to the recent Classifying games like the Prisoner’s Dilemma:
you can model the “defect” action as “take some value for yourself, but destroy value in the process”.
Given that the prisoner’s dilemma is the bread and butter of game theory and of many parts of economics, evolutionary biology, and psychology, you might think that someone had already formalized this. However, to my knowledge, no one has.
Formalism
Consider a finite -player normal-form game, with player having pure action set and payoff function . Each player chooses a strategy (a distribution over ). Together, the strategies form a strategy profile . is the strategy profile, excluding player ’s strategy. A payoff profile contains the payoffs for all players under a given strategy profile.
A utility weighting is a set of non-negative weights (as in Harsanyi’s utilitarian theorem). You can consider the weights as quantifying each player’s contribution; they might represent a percieved social agreement or be the explicit result of a bargaining process.
When all are equal, we’ll call that an equal weighting. However, if there are “utility monsters”, we can downweight them accordingly.
We’re implicitly assuming that payoffs are comparable across players. We want to investigate: given a utility weighting, which actions are defections?
Definition. Player ’s action is a defection against strategy profile and weighting if
Social loss:
If such an action exists for some player , strategy profile , and weighting, then we say that there is an opportunity for defection in the game.
Remark. For an equal weighting, condition (2) is equivalent to demanding that the action not be a Kaldor-Hicks improvement.
Our definition seems to make reasonable intuitive sense. In the tragedy of the commons, each player rationally increases their utility, while imposing negative externalities on the other players and decreasing total utility. A spy might leak classified information, benefiting themselves and Russia but defecting against America.
Definition. Cooperation takes place when a strategy profile is maintained despite the opportunity for defection.
Theorem 1. In constant-sum games, there is no opportunity for defection against equal weightings.
Theorem 2. In common-payoff games (where all players share the same payoff function), there is no opportunity for defection.
Edit: In private communication, Joel Leibo points out that these two theorems formalize the intuition between the proverb “all’s fair in love and war”: you can’t defect in fully competitive or fully cooperative situations.
Proposition 3. There is no opportunity for defection against Nash equilibria.
An action is a Pareto improvement over strategy profile if, for all players ,.
Proposition 4. Pareto improvements are never defections.
Game Theorems
We can prove that formal defection exists in the trifecta of famous games. Feel free to skip proofs if you aren’t interested.
Theorem 5. In symmetric games, if the Prisoner’s Dilemma inequality is satisfied, defection can exist against equal weightings.
Proof. Suppose the Prisoner’s Dilemma inequality holds. Further suppose that . Then . Then since but , both players defect from with .
Suppose instead that . Then , so . But , so player 1 defects from with action , and player 2 defects from with action . QED.
Theorem 6. In symmetric games, if the Stag Hunt inequality is satisfied, defection can exist against equal weightings.
Proof. Suppose that the Stag Hunt inequality is satisfied. Let be the probability that player 1 plays . We now show that player 2 can always defect against strategy profile for some value of .
For defection’s first condition, we determine when :
This denominator is positive ( and ), as is the numerator. The fraction clearly falls in the open interval .
For defection’s second condition, we determine when
Combining the two conditions, we have
Since , this holds for some nonempty subinterval of . QED.
Theorem 7. In symmetric games, if the Chicken inequality is satisfied, defection can exist against equal weightings.
Proof. Assume that the Chicken inequality is satisfied. This proof proceeds similarly as in theorem 6. Let be the probability that player 1′s strategy places on .
For defection’s first condition, we determine when :
The inequality flips in the first equation because of the division by , which is negative ( and ). , so ; this reflects the fact that is a Nash equilibrium, against which defection is impossible (proposition 3).
For defection’s second condition, we determine when
The inequality again flips because is negative. When , we have , in which case defection does not exist against a pure strategy profile.
Combining the two conditions, we have
Because ,
QED.
Discussion
This bit of basic theory will hopefully allow for things like principled classification of policies: “has an agent learned a ‘non-cooperative’ policy in a multi-agent setting?”. For example, the empirical game-theoretic analyses (EGTA) of Leibo et al.’s Multi-agent Reinforcement Learning in Sequential Social Dilemmas say that apple-harvesting agents are defecting when they zap each other with beams. Instead of using a qualitative metric, you could choose a desired non-zapping strategy profile, and then use EGTA to classify formal defections from that. This approach would still have a free parameter, but it seems better.
I had vague pre-theoretic intuitions about ‘defection’, and now I feel more capable of reasoning about what is and isn’t a defection. In particular, I’d been confused by the difference between power-seeking and defection, and now I’m not.
- Voting Results for the 2020 Review by 2 Feb 2022 18:37 UTC; 108 points) (
- 2019 Review Rewrite: Seeking Power is Often Robustly Instrumental in MDPs by 23 Dec 2020 17:16 UTC; 35 points) (
- TASP Ep 3 - Optimal Policies Tend to Seek Power by 11 Mar 2021 1:44 UTC; 24 points) (
- Open problem: how can we quantify player alignment in 2x2 normal-form games? by 16 Jun 2021 2:09 UTC; 23 points) (
- 27 Mar 2021 17:22 UTC; 17 points) 's comment on Generalizing POWER to multi-agent games by (
- [AN #109]: Teaching neural nets to generalize the way humans would by 22 Jul 2020 17:10 UTC; 17 points) (
- 25 Mar 2021 1:51 UTC; 16 points) 's comment on Generalizing POWER to multi-agent games by (
- 23 Dec 2020 17:14 UTC; 12 points) 's comment on Seeking Power is Often Convergently Instrumental in MDPs by (
- 5 Dec 2020 17:37 UTC; 12 points) 's comment on The Schelling Choice is “Rabbit”, not “Stag” by (
- 17 Jun 2021 18:41 UTC; 4 points) 's comment on Open problem: how can we quantify player alignment in 2x2 normal-form games? by (
- 5 Dec 2020 20:25 UTC; 4 points) 's comment on The Schelling Choice is “Rabbit”, not “Stag” by (
- 16 Jun 2021 14:46 UTC; 2 points) 's comment on Open problem: how can we quantify player alignment in 2x2 normal-form games? by (
- 12 Nov 2022 21:19 UTC; 2 points) 's comment on “Rudeness”, a useful coordination mechanic by (
This post’s main contribution is the formalization of game-theoretic defection as gaining personal utility at the expense of coalitional utility.
Rereading, the post feels charmingly straightforward and self-contained. The formalization feels obvious in hindsight, but I remember being quite confused about the precise difference between power-seeking and defection—perhaps because popular examples of taking over the world are also defections against the human/AI coalition. I now feel cleanly deconfused about this distinction. And if I was confused about it, I’d bet a lot of other people were, too.
I think this post is valuable as a self-contained formal insight into the nature of defection. If I could vote on it, I’d give it a 4 (or perhaps a 3, if the voting system allowed it).