Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).
The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires considering shallow time discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models, but doesn’t seem very interesting: to receive the maximin payoff, the agents only have to learn the rules of the game, they need not learn the reward functions of the other players or anything else about them.
We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a population of agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let Ai be the pure strategy set of the i-th agent and O:=∏iAi the set of pure outcomes. The set of n-round outcome histories is On. The population of agents on the n-round can then be described as a probability measureμn∈ΔOn. Suppose the policy of the i-th player (that is, of all the agents that take the role of the i-th player) is πi:On→ΔAi. Then we can define a time evolution rule that produces μn+1 from μn. This rule works as follows: in order to sample μn+1 we sample μn once per player (this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn’t matter which). A set of policies is considered to be in equilibrium, when for any i, and any alternative policy π′i, letting π′i play against the same population (i.e. all other copies of the i-th player still play πi) doesn’t improve expected utility. In other words, on each round the “mutant” agent retains its own history but the other player histories are still sampled from the same μn. It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the πi to be learning algorithms would result in convergence to a Nash payoff in the γ→1 (shallow time discount) limit.
For example, consider the Prisoner’s dilemma. In the repeated Prisoner’s dilemma with shallow time discount, CC is an equilibrium because of the tit-for-tat policy. On the other hand, in the “population” (massively multi-player?) repeated Prisoner’s dilemma, DD is the only equilibrium. Tit-for-tat doesn’t work because a single “defect bot” can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn’t know the defect bot defected on the previous round.
Note that we get a very different setting if we allow the players to see each other’s histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner’s Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other’s source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.
We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents’ histories. This should lead to superrationality for the same reasons I discussedbefore. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.
What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don’t know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.
Instead of postulating access to a portion of the history or some kind of limited access to the opponent’s source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like “counting pieces of evidence” should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.
What do you mean by equivalent? The entire history doesn’t say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.
For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can’t get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called “superrationality”.
In the previous “population game” setting, we assumed all players are “born” at the same time and learn synchronously, so that they always play against players of the same “age” (history length). Instead, we can consider a “mortal population game” setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the “thermodynamic” N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player’s utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.
As before, each player type i chooses a policy πi:On→ΔAi. (We can also consider the case where players of the same type may have different policies, but let’s keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn’t depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points?O∗ can be regarded as a subspace of (O⊔{⊥})ω. The latter is compact (in the product topology) by Tychonoff’s theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also compact but ΔO∗ isn’t. However, it is easy to see that Pis closed in Δ(O⊔{⊥})ω and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of (O⊔{⊥})ω). Moreover, it is easy to see Tπ is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that Tπ has a fixed point.
What are the fixed points like? Of course it depends on π. In a fixed point, every player observes a sequence of IID plays in all of eir games. Therefore, if π satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the γ→1 limit, then, in the same limit, fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as “assume the plays in the next game will be sampled from a random past game”, and it works for any Bayesian or “quasi-Bayesian” (i.e. using incomplete/fuzzy models) agent that includes all IID processes in its prior.
This raises a range of interesting questions:
Are any/all of the fixed points attractors?
Does convergence to a fixed point occur for all or at least almost all initial conditions?
Do all Nash equilibria correspond to fixed points?
Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?
Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn’t have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn’t depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence is guaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.
Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).
The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requires considering shallow time discounts, otherwise there is no notion of long-term planning. For shallow time discounts, the game theory of a repeated game is very different from the game theory of the original normal form game. In fact, the folk theorem asserts that any payoff vector above the maximin of each player is a possible Nash payoff. So, proving convergence to a Nash equilibrium amounts (more or less) to proving converges to at least the maximin payoff. This is possible using incomplete models, but doesn’t seem very interesting: to receive the maximin payoff, the agents only have to learn the rules of the game, they need not learn the reward functions of the other players or anything else about them.
We arrive at the question, what setting is realistic (in the sense of involving learning with shallow time discount) and is expected to produce Nash equilibria for a normal form game? I suggest the following. Instead of a fixed set of agents repeatedly playing against each other, we consider a population of agents that are teamed-off randomly on each round of the game. The population is assumed to be large enough for agents not to encounter each other more than once. This can be formalized as follows. Let Ai be the pure strategy set of the i-th agent and O:=∏iAi the set of pure outcomes. The set of n-round outcome histories is On. The population of agents on the n-round can then be described as a probability measure μn∈ΔOn. Suppose the policy of the i-th player (that is, of all the agents that take the role of the i-th player) is πi:On→ΔAi. Then we can define a time evolution rule that produces μn+1 from μn. This rule works as follows: in order to sample μn+1 we sample μn once per player (this is the history the given player has seen), sample the policy of each player on its own history, and produce a new history by appending the resulting outcome to one of the old histories (it doesn’t matter which). A set of policies is considered to be in equilibrium, when for any i, and any alternative policy π′i, letting π′i play against the same population (i.e. all other copies of the i-th player still play πi) doesn’t improve expected utility. In other words, on each round the “mutant” agent retains its own history but the other player histories are still sampled from the same μn. It is easy to see that any equilibrium payoff in this setting is a Nash payoff in the original normal form game. We can then legitimately ask whether taking the πi to be learning algorithms would result in convergence to a Nash payoff in the γ→1 (shallow time discount) limit.
For example, consider the Prisoner’s dilemma. In the repeated Prisoner’s dilemma with shallow time discount, CC is an equilibrium because of the tit-for-tat policy. On the other hand, in the “population” (massively multi-player?) repeated Prisoner’s dilemma, DD is the only equilibrium. Tit-for-tat doesn’t work because a single “defect bot” can exploit a population of tit-for-tats: on each round it plays with a new opponent that doesn’t know the defect bot defected on the previous round.
Note that we get a very different setting if we allow the players to see each other’s histories, more similar (equivalent?) to the regular repeated game. For example, in the Prisoner’s Dilemma we have a version of tit-for-tat that responds to what its current opponent played in its previous round (against a different opponent). This may be regarded as a confirmation of the idea that agents that know each other’s source code are effectively playing a repeated game: in this setting, knowing the source code amounts to knowing the history.
We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents’ histories. This should lead to superrationality for the same reasons I discussed before. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent.
What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don’t know to define this limitation.
EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.
Instead of postulating access to a portion of the history or some kind of limited access to the opponent’s source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like “counting pieces of evidence” should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.
What do you mean by equivalent? The entire history doesn’t say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.
For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code.
The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can’t get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called “superrationality”.
In the previous “population game” setting, we assumed all players are “born” at the same time and learn synchronously, so that they always play against players of the same “age” (history length). Instead, we can consider a “mortal population game” setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the “thermodynamic” N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player’s utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age.
As before, each player type i chooses a policy πi:On→ΔAi. (We can also consider the case where players of the same type may have different policies, but let’s keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems.
If we consider only the age distribution, then its evolution doesn’t depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P.
Does the dynamics have fixed points?O∗ can be regarded as a subspace of (O⊔{⊥})ω. The latter is compact (in the product topology) by Tychonoff’s theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also compact but ΔO∗ isn’t. However, it is easy to see that P is closed in Δ(O⊔{⊥})ω and therefore compact. It may also be regarded as a convex subset of an appropriate Banach space (the dual of the space of Lipschitz functions on some metrization of (O⊔{⊥})ω). Moreover, it is easy to see Tπ is continuous (for populations that are close in the Kantorovich-Rubinstein metric, only the old players may have very different distributions, but old players are a small fraction of the population so their effect on the next round is small). By the Schauder fixed-point theorem, it follows that Tπ has a fixed point.
What are the fixed points like? Of course it depends on π. In a fixed point, every player observes a sequence of IID plays in all of eir games. Therefore, if π satisfies the (very mild!) learning-theoretic desideratum that, upon observing an IID sequence, it converges to optimal response in the γ→1 limit, then, in the same limit, fixed points are Nash equilibria. This works even for extremely simple learning algorithms, such as “assume the plays in the next game will be sampled from a random past game”, and it works for any Bayesian or “quasi-Bayesian” (i.e. using incomplete/fuzzy models) agent that includes all IID processes in its prior.
This raises a range of interesting questions:
Are any/all of the fixed points attractors?
Does convergence to a fixed point occur for all or at least almost all initial conditions?
Do all Nash equilibria correspond to fixed points?
Do stronger game theoretic solution concepts (e.g. proper equilibria) have corresponding dynamical properties?
Mortal population games are obviously reminiscent of evolutionary game theory. However, there are substantial differences. In mortal population games, the game doesn’t have to be symmetric, we consider a single policy rather than many competing policies, the policies learn from experience instead of corresponding to fixed strategies, and mortality rate doesn’t depend on the reward. In evolutionary game theory, convergence usually cannot be guaranteed. For example, in the rock-scissors-paper game, the population may cycle among the different strategies. On the other hand, in mortal population games, if the game is two-player zero-sum (which includes rock-paper-scissors), and the policy is quasi-Bayesian with appropriate prior, convergence is guaranteed. This is because each player can easily learn to guarantee maximin payoff. Continuity arguments probably imply that at least for small perturbations of zero-sum, there will still be convergence. This leads to some hope that convergence can be guaranteed even in general games, or at least under some relatively mild conditions.