A well-known issue in game theory is that of multiple Nash equilibria in coordination games.
For example, consider the following game. Players 1 and 2 both choose either “high” or “low”. If both choose “high”, they both get 10 utility; if both choose “low”, they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play “high”, one where both always play “low”, and one where each plays “high” with 1⁄3 probability and “low” with 2⁄3 probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.
You could imagine TDT arguments based on symmetry between the different players to argue that “high” is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.
The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play “low”, then playing “low” is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.
What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a near-optimal outcome for everyone is attained?
Somewhat surprisingly, the answer is “kind of.” It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.
Problem setup
We will consider the setting of normal-form common-payoff games. Let n be the number of players. Let A be some finite set of actions; each player i will select the action Ai. Let U:An→R be the shared utility function. Each player receives the same utility U(a1,…,an).
I will now assume that the players play some number of practice rounds before the “real” round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous one-shot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)
Let T be the number of practice rounds. Let Ati be the action selected by player i in practice round t. I will assume Ai=ATi, i.e. the last practice round is the “real” round.
We will assume that, in each round t, some shared random number Rt∼Uniform({1,…,n}) is generated, and that players’ policies in round t may depend on this number.
How are players’ policies represented? We will assume that each player has only black-box access to each others’ actions in previous practice rounds. Let πi:List(An)×{1,…,n}→ΔA be player i’s policy, which selects an action distribution for player i in round t given knowledge of the actions in all previous rounds and of the current shared random number.
The generative model is now obvious: For each t in sequence, sample Rt∼Uniform({1,…,n}) and then sample each Ati∼πi(A1:t−11:n,Rt) independently, where the notation A1:t−11:n refers to the list (A11:n,…,At−11:n).
Given a set of policies, it is possible to compute, for each t, expected utilities at each time step (which are the expected utilities attained if T=t):
vt(π1:n):=E[U(At1:n)|π1:n]
The limit limt→∞vt(π1:n) and corresponding liminf and limsup are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.
Policies that solve the game
The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.
Let umin be a lower bound on U. Let α>0 be some “rationality parameter”. Consider the following set of policies:
παi(A1:t−11:n,Rt)=Uniform(A) if t=1παi(A1:t−11:n,Rt)=δ(At−1i) if t>1,Rt≠iπαi(A1:t−11:n,Rt)=eα⋅(umin−U(At−11:n))⋅Uniform(A)+(1−eα⋅(umin−U(At−11:n)))⋅δ(At−1i) otherwise
where δ(a) is a delta distribution on a, and probability distributions are taken to be vectors.
These policies start out with each player selecting a random action. In each round, a random player is selected (according to Rt); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.
An important aspect of these policies is that each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players’ actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.
We will prove the following theorem:
Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.
i.e. by setting α high enough, these policies achieve an arbitrarily-close-to-optimal expected utility in the limit.
Analyzing the process as a Markov chain
To prove this theorem, we can analyze the multi-round process with these policies as a Markov chain where the state is the action vector At1:n. The transition function (giving the distribution of At+11:n given At1:n) is:
tα(a1:n)(a1:i−1,a′i,ai+1:n):=eα⋅(umin−U(a1:n))n|A| if a′i≠ai
tα(a1:n)(a′1:n):=0 if a1:n and a′1:n differ on at least two elements
Consider the following distribution over states:
pα(a1:n):=eαU(a1:n)∑a′∈AneαU(a′1:n)
i.e. it is a softmax that is more likely to select states with higher utilities.
We will show:
Lemma 1: p is the unique stationary distribution of the Markov chain defined by t.
Proof:
It is well-known (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.
The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.
This Markov chain satisfies a detailed balance condition relative to the state distribution p. Specifically, for any states a1:n and a′1:n:
pα(a1:n)tα(a1:n)(a′1:n)=pα(a′1:n)tα(a′1:n)(a1:n)
i.e. if a state is sampled from p then transitioned once, the probability of a transition from a1:n to a′1:n equals the probability of a transition from a′1:n to a1:n.
The proof of this detailed balance property proceeds by case analysis on a1:n and a′1:n.
If a1:n=a′1:n, then the condition is trivial.
If a1:n and a′1:n differ on more than one element, then both transition probabilities are 0, so the condition is trivial.
Since the Markov chain is ergodic and satisfies detailed balance with respect to pα, it follows that pα is the unique stationary distribution of the Markov chain.
□
Now we will prove Theorem 1.
Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.
Proof:
Clearly, limt→∞vt(πα1:n)=Ea1:n∼pα[U(a1:n)], since pα is the unique stationary distribution of At1:n considered as a Markov chain.
Define u∗:=maxa1:n∈AnU(a1:n), i.e. the highest achievable utility. Define u′ to be the second-highest achievable utility.
At this point, we need only set α to be large enough that the pα assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.
Fix ϵ>0. Set α:=nlog|A|+log(u∗−umin)−logϵu∗−u′.
Let a∗1:n be an optimal action profile, and let a′1:n be a non-optimal action profile. We have
Since there is at least one optimal state and fewer than |A|n non-optimal states, the ratio between the probability (under pα) of a state being optimal and the probability of a state being non-optimal is at least u∗−uminϵ. Since in general x1+x≥1−1x for positive x, this means that the probability of a state being optimal is at least 1−ϵu∗−umin. A non-optimal state has utility at least umin, so the overall expected suboptimality (i.e. difference between u∗ and u(a1:n)) is at most ϵ, as desired.
□
Extension to non-common-payoff games
The policies defined above can be extended to play non-common-payoff games. Specifically, each agent can have a higher likelihood of switching action in each round if their own utility is lower. Unfortunately, this results in mutual defection in a prisoner’s dilemma; while I won’t present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in non-common-payoff games.
Conclusion
In this post I have shown that, in common-payoff games under a multi-round structure, there is a set of policies that only does local optimization but which, together, yield a globally near-optimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in common-payoff games.
Solve Cartesian common-payoff decision theory problems where players reason about each other using something like reflective oracles.
Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.
Extend the solution in 2 to a logically uncertain naturalized setting.
This post has done something pretty close to solving 1, with the multi-round structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to non-common-payoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve non-common-payoff games.
Alternatively, the problem of non-common-payoff games may be attacked directly, though this seems difficult given that non-common-payoff games are more complex than common-payoff games, and include all the dynamics present in common-payoff games. But perhaps restricted subsets of general games, other than the set of common-payoff games, can be found and solved.
Reducing collective rationality to individual optimization in common-payoff games using MCMC
A well-known issue in game theory is that of multiple Nash equilibria in coordination games.
For example, consider the following game. Players 1 and 2 both choose either “high” or “low”. If both choose “high”, they both get 10 utility; if both choose “low”, they both get 5 utility; if they choose different actions, then they get no utility. There are 3 Nash equilibria: one where both always play “high”, one where both always play “low”, and one where each plays “high” with 1⁄3 probability and “low” with 2⁄3 probability. While (high, high) is obviously the best Nash equilibrium, unilaterally switching to high when the current equilibrium is (low, low) is probably unwise.
You could imagine TDT arguments based on symmetry between the different players to argue that “high” is the only rational choice, but it is easy to imagine asymmetric variants of this problem where the symmetry arguments are much less clear.
The presence of suboptimal Nash equilibria looks like a serious problem for reconciling individual and collective rationality, even when players have common payoffs. If everyone is going to play “low”, then playing “low” is individually rational (at least according to CDT); however, the collective of agents fails to act rationally due to a coordination failure.
What I hoped to do before coming up with the idea in this post is to somehow reduce collective rationality to individual rationality. Can we prescribe individual rationality conditions to each player, such that if each player follows these conditions, then a near-optimal outcome for everyone is attained?
Somewhat surprisingly, the answer is “kind of.” It certainly seems like solving coordination problems requires global optimization instead of just local optimization, but under appropriate conditions, local optimization is actually sufficient.
Problem setup
We will consider the setting of normal-form common-payoff games. Let n be the number of players. Let A be some finite set of actions; each player i will select the action Ai. Let U:An→R be the shared utility function. Each player receives the same utility U(a1,…,an).
I will now assume that the players play some number of practice rounds before the “real” round. These practice rounds are very similar to those in the fictitious play algorithm. Ontologically, what are these practice rounds? One interpretation is that they are actual rounds of the game that have been played in the past; another interpretation is that they are a sequence of analogous one-shot rounds played by agents who have more and more compute and are thus able to simulate previous rounds. (The second interpretation might seem strange, but this is essentially how logical inductors achieve generalization bounds in decision theory problems)
Let T be the number of practice rounds. Let Ati be the action selected by player i in practice round t. I will assume Ai=ATi, i.e. the last practice round is the “real” round.
We will assume that, in each round t, some shared random number Rt∼Uniform({1,…,n}) is generated, and that players’ policies in round t may depend on this number.
How are players’ policies represented? We will assume that each player has only black-box access to each others’ actions in previous practice rounds. Let πi:List(An)×{1,…,n}→ΔA be player i’s policy, which selects an action distribution for player i in round t given knowledge of the actions in all previous rounds and of the current shared random number.
The generative model is now obvious: For each t in sequence, sample Rt∼Uniform({1,…,n}) and then sample each Ati∼πi(A1:t−11:n,Rt) independently, where the notation A1:t−11:n refers to the list (A11:n,…,At−11:n).
Given a set of policies, it is possible to compute, for each t, expected utilities at each time step (which are the expected utilities attained if T=t):
vt(π1:n):=E[U(At1:n)|π1:n]
The limit limt→∞vt(π1:n) and corresponding liminf and limsup are interesting; they indicate how well a set of policies does in expectation after a large number of practice rounds.
Policies that solve the game
The policies we will consider are ones that usually stick with the same action they played in the last round, but sometimes switch to a different action, and are more likely to switch the worse the utility in the last round was.
Let umin be a lower bound on U. Let α>0 be some “rationality parameter”. Consider the following set of policies:
παi(A1:t−11:n,Rt)=Uniform(A) if t=1 παi(A1:t−11:n,Rt)=δ(At−1i) if t>1,Rt≠i παi(A1:t−11:n,Rt)=eα⋅(umin−U(At−11:n))⋅Uniform(A)+(1−eα⋅(umin−U(At−11:n)))⋅δ(At−1i) otherwise
where δ(a) is a delta distribution on a, and probability distributions are taken to be vectors.
These policies start out with each player selecting a random action. In each round, a random player is selected (according to Rt); this player switches their action to some uniformly random action with a probability that is lower the higher the utility in the previous round was, and otherwise sticks with the same action. Players other than the randomly selected player always stick with the same action.
An important aspect of these policies is that each player only does local optimization. That is, each player shifts their own action depending on the utility in the last round, such that over the course of many iterations they are more likely to take an action that attains a high utility given the other players’ actions. Naively, one would expect a local optimization algorithm like this one to get stuck in traps such as the (low, low) Nash equilibrium, but this turns out not to be the case.
We will prove the following theorem:
Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.
i.e. by setting α high enough, these policies achieve an arbitrarily-close-to-optimal expected utility in the limit.
Analyzing the process as a Markov chain
To prove this theorem, we can analyze the multi-round process with these policies as a Markov chain where the state is the action vector At1:n. The transition function (giving the distribution of At+11:n given At1:n) is:
tα(a1:n)(a1:i−1,a′i,ai+1:n):=eα⋅(umin−U(a1:n))n|A| if a′i≠ai
tα(a1:n)(a1:n):=1−∑a′i≠ait(a1:n)(a1:i−1,a′i,ai+1:n)
tα(a1:n)(a′1:n):=0 if a1:n and a′1:n differ on at least two elements
Consider the following distribution over states:
pα(a1:n):=eαU(a1:n)∑a′∈AneαU(a′1:n)
i.e. it is a softmax that is more likely to select states with higher utilities.
We will show:
Lemma 1: p is the unique stationary distribution of the Markov chain defined by t.
Proof:
It is well-known (e.g. see this Wikipedia article) that an ergodic Markov chain satisfying detailed balance (a condition defined later) with respect to a probability distribution over states has this distribution of states as the unique stationary distribution.
The Markov chain is aperiodic, since every state has a nonzero probability of transitioning to itself. Additionally, it is irreducible (i.e. it is possible to get from any state to any other state), since any action in the state can change to any other action with nonzero probability while leaving the other actions fixed. Therefore, the Markov chain is ergodic.
This Markov chain satisfies a detailed balance condition relative to the state distribution p. Specifically, for any states a1:n and a′1:n:
pα(a1:n)tα(a1:n)(a′1:n)=pα(a′1:n)tα(a′1:n)(a1:n)
i.e. if a state is sampled from p then transitioned once, the probability of a transition from a1:n to a′1:n equals the probability of a transition from a′1:n to a1:n.
The proof of this detailed balance property proceeds by case analysis on a1:n and a′1:n.
If a1:n=a′1:n, then the condition is trivial.
If a1:n and a′1:n differ on more than one element, then both transition probabilities are 0, so the condition is trivial.
If they differ on a single element i, then:
pα(a1:n)tα(a1:n)(a′1:n)=eαU(a1:n)∑b1:n∈AneαU(b1:n)eα⋅(umin−U(a1:n))n|A| =eαuminn|A|∑b1:n∈AneαU(b1:n)
and similarly
pα(a′1:n)tα(a′1:n)(a1:n)=eαU(a′1:n)∑b1:n∈AneαU(b1:n)eα⋅(umin−U(a′1:n))n|A| =eαuminn|A|∑b1:n∈AneαU(b1:n).
Since the Markov chain is ergodic and satisfies detailed balance with respect to pα, it follows that pα is the unique stationary distribution of the Markov chain.
□
Now we will prove Theorem 1.
Theorem 1: For all ϵ>0 there exists α such that limt→∞vt(πα1:n)≥maxa1:n∈AnU(a1:n)−ϵ.
Proof:
Clearly, limt→∞vt(πα1:n)=Ea1:n∼pα[U(a1:n)], since pα is the unique stationary distribution of At1:n considered as a Markov chain.
Define u∗:=maxa1:n∈AnU(a1:n), i.e. the highest achievable utility. Define u′ to be the second-highest achievable utility.
At this point, we need only set α to be large enough that the pα assigns almost all of its probability mass to optimal states. The argument that this is possible to do follows.
Fix ϵ>0. Set α:=nlog|A|+log(u∗−umin)−logϵu∗−u′.
Let a∗1:n be an optimal action profile, and let a′1:n be a non-optimal action profile. We have
pα(a∗1:n)pα(a′1:n)=eα(U(a∗1:n)−U(a′1:n))≥eα(u∗−u′)=|A|n(u∗−umin)ϵ
Since there is at least one optimal state and fewer than |A|n non-optimal states, the ratio between the probability (under pα) of a state being optimal and the probability of a state being non-optimal is at least u∗−uminϵ. Since in general x1+x≥1−1x for positive x, this means that the probability of a state being optimal is at least 1−ϵu∗−umin. A non-optimal state has utility at least umin, so the overall expected suboptimality (i.e. difference between u∗ and u(a1:n)) is at most ϵ, as desired.
□
Extension to non-common-payoff games
The policies defined above can be extended to play non-common-payoff games. Specifically, each agent can have a higher likelihood of switching action in each round if their own utility is lower. Unfortunately, this results in mutual defection in a prisoner’s dilemma; while I won’t present a detailed analysis here, the gist of the argument is that (C, C) is much more likely to switch into (C, D) or (D, C), than (C, D) or (D, C) is to switch into (C, C). So a different approach than this one will be needed for achieving Pareto optimal outcomes in non-common-payoff games.
Conclusion
In this post I have shown that, in common-payoff games under a multi-round structure, there is a set of policies that only does local optimization but which, together, yield a globally near-optimal outcome. This is important because getting global optimality from local optimization is required for decision theory (i.e. each agent taking an action based on its own expected utility) to produce Pareto optimal outcomes in common-payoff games.
In a previous post on coordination games, I presented a plan for solving formal decision theory:
Solve Cartesian common-payoff decision theory problems where players reason about each other using something like reflective oracles.
Extend the solution in 1 to Cartesian multiplayer game theory problems where players have different utility functions and reason about each other using something like reflective oracles.
Extend the solution in 2 to a logically uncertain naturalized setting.
This post has done something pretty close to solving 1, with the multi-round structure taking the place of reflective oracles. Unfortunately, step 2 fails since the solution method does not extend to non-common-payoff games. So it appears that, for my plan to work, 1 must be solved in a different way that can be extended to also solve non-common-payoff games.
Alternatively, the problem of non-common-payoff games may be attacked directly, though this seems difficult given that non-common-payoff games are more complex than common-payoff games, and include all the dynamics present in common-payoff games. But perhaps restricted subsets of general games, other than the set of common-payoff games, can be found and solved.