Pavlov Generalizes
Sarah Constantine recently wrote about the Pavlov strategy for iterated prisoner’s dilemma. She points out that it deserves to be better-known, given that it has similar virtues to Tit-for-Tat (TfT). I strongly agree.
Much of my thinking about puzzles in decision theory such as cooperation/coordination/bargaining and logical causality (through the lense of logical time) has been shaped by an assumption that TfF type behavior is “correct” in some sense, and needs to be appropriately generalized. These intuitions have been formed without an awareness of Pavlov, and so, are worth reconsidering.
It turns out that Pavlov-type behavior is much easier to generalize that TfT. Pavlov-esque strategies only require seeing when things are going well or poorly for you. TfT-esque strategies require an agent to understand what cooperation vs defection means in a particular game. This requires the agent to understand the wider rules of the game (not just the payoff on a given round), locate the other agents involved (distinguishing them from the environment), infer the values of the other agents, define a set of common cooperative norms which can be coordinated on, and figure out whether other agents have followed them! Several of these elements are both difficult in practice and difficult to define in principle.
In this post, I would like to explain how to generalize Pavlov to a wide set of games (achieving some surprisingly strong, but not actually very good, coordination properties), and also convey some intuitions about Pavlov-style vs TfT-style coordination.
Generalizing Pavlov
The paper Achieving Pareto Optimality Through Distributed Learning by Marden, Young, and Pao describes a strategy which allows players in an iterated game to eventually converge to a Pareto-optimal action profile with high probability. In fact, the strategy described converges to action profiles which maximize the sum of all the utilities! That should sound too good to be true. Some big caveats are coming.
The strategy is, roughly, as follows. Agents have two different states: content and discontent. When content, the agent copies its own action from the previous round, whatever it was. When discontent, the agent plays randomly.
An agent flips between content and discontent based on how well things are going for it. There is a parameter, experimentation rate (e.r.), which determines how likely an agent is to flip states.
Discontent: Discontent agents look at how good the previous round was for them. They change to “content” based on the payoff of the previous round and the e.r.: the higher the payoff and the higher the e.r., the more likely they are to become content. A low e.r. makes them very picky, remaining discontent the majority of the time, but much less likely to become content after seeing low payoffs, and only moderately less likely after high payoffs.
Content, consistent payoff: When content, the agent’s behavior depends on whether the payoff is the same as the twice-previous round. If the previous and twice-previous rounds had the same payoffs, content agents remain content, and are very likely to take the same action as on the previous round. They take a different action at the exploration rate.
Content, inconsistent payoff: If the previous round had a payoff not consistent with the twice-previous round, a content agent is likely to become discontent (remaining content with probability increasing in the payoff and the e.r.).
The nice property this strategy has is that for a low e.r., players end up eventually taking an efficient set of actions with high probability. It accomplishes this without agents knowing anything about each other; everyone acts on the payoffs they receive alone.
This is kind of nice, but looking at the strategy, we can see that the convergence rate will be very poor. The reason it works is that all agents have to become content at once in order to stay that way with significant probability. If only some agents become content, then the payoffs of the game will remain inconsistent, due to the agents who are still randomizing. As e.r. gets small, agents become very unlikely to become content, but comparatively very very very unlikely to become content when they’ve personally gotten a poor outcome. So, what happens overall is that everyone thrashes around for a long long time, ensuring that all the different possible combined action-states get explored. When everyone happens to become content at the exact same time, then they stay that way for a long long long time. This is an unlikely coincidence, but far more likely when the combined set of actions in a round maximizes joint utility.
What’s Pavlov-like About This?
The basics of TfT are to copy the other player’s move from the previous round, whereas the basics of Pavlov are to copy your own move from the previous round if things went well, and otherwise change. This is captured nicely by the content/discontent idea.
On the other hand, the convergence behavior described—where players flail around for a very long time before they settle on a jointly efficient strategy—doesn’t sound much like Pavlov.
In Prisoner’s Dilemma, we (outside of the game) have an idea of what “good” and “bad” outcomes are, so that “copy your strategy if they cooperate; change your strategy if they defect” makes sense. The content/discontent algorithm, on the other hand, doesn’t know what payoffs are reasonable to expect. That’s why Pavlov is easily contented whereas the content/discontent algorithm isn’t.
If we know ahead of time which outcomes are reasonable (which is easy in a symmetric game, but more generally has to rely on a notion of fairness such as the Nash Bargaining Solution), we can program agents to switch to “content” as soon as they see payoffs at least as high as what they can reasonably expect to get. This makes the strategy more like simple Pavlov. However, if the agents try to do this kind of reasoning, it will tend to make them exploitable: “bully” agents can convince them to be content with low payoffs by making high payoffs seem impossible.
Even so, it doesn’t make sense to set e.r. extremely low, which is what’s necessary to guarantee eventual Pareto-optimality. In practice you’d rather have a moderate e.r., accepting the chance of settling on suboptimal action profiles.
There’s something of a trade-off between exploitability and convergence rate. If e.r. is high, then agents settle easily, so bullies can take advantage by only offering exploitative opportunities. If e.r. is moderate, then bullies have to be verrrry patient to pull that off. The generalized-Pavlov agent incentivises cooperation by being more likely to settle for better offers. If e.r. is low, though, the probability of becoming content is so low that there’s no incentive for other agents to care; they’ll just deal with your randomization in whatever way is optimal for them.
Cluster Comparison
Using the MTG color wheel to discuss AI alignment has been popular recently, so… according to me, TfT is a very “white” strategy, and Pavlov is a very “red” strategy. (For the rest of the post, “Pavlov” and “TfT” will refer to more general clusters rather than the specific prisoner’s dilemma strategies.)
TfT is eye-for-an-eye justice. There are rules, and there are proportional consequences. Cooperation is good, and is met with cooperation. Defection is evil, and is punished with defection in turn. TfT is strongest in an environment with many TfT agents and many less-cooperative agents; the group of TfT works together and provides effective punishment for the defector agents.
Pavlov is do-what-works-for-you freedom. There are no rules; there is no concept of good and evil. When things aren’t working well, you leave a situation; when things are working well, you stay. Pavlov doesn’t really punish others, it just randomizes; it incentivises cooperation mainly through positive reinforcement. Likewise, Pavlov responds only to positive reinforcement; negative reinforcement can only solicit randomization, so you can’t usually use it to get a Pavlov agent to do what you want. Pavlov doesn’t punish defection as well as TfT does, but it takes advantage of overly cooperative agents in a way TfT doesn’t. So, Pavlov agents thrive in environments with naive cooperators, but will tend to exploit those naive cooperators out of existence.
Here’s some discussion of what the strategy clusters look like: (I’ll leave open the question of what strategies might correspond to the other MTG colors.)
TfT Cluster
Even within the domain of iterated prisoner’s dilemma, there are many strategies which are still generally within the TfT cluster. Sarah discusses some of them.
Moving away from an iterated setting and into a logical one, TfT-like reasoning tries to copy the move of the other player on this round, rather than the previous round. Reasoning about the other player with logical uncertainty, a natural strategy is NicerBot, which cooperates with probability epsilon higher than its estimate of the other player’s probability of cooperation. Adding epsilon ensures that two NicerBots cooperate with each other (otherwise they could end up converging to any probability of cooperation); more generally, NicerBot matches the cooperativeness of other players.
There is an important difficulty in generalizing TfT-like strategies further: you don’t want to cooperate with a rock which has “I cooperate” engraved on it. In other words, it doesn’t make sense to cooperate with non-agents. TfT is a strategy which makes sense when we’ve identified that we’re dealing with another agent, but we need to combine its behavior with more general decision-theoretic reasoning, somehow.
In open-source game theory with proof-based agents, the Löbian handshake could be considered a TfT-like concept. However, unlike with NicerBot, we naturally only cooperate with other agents; we defect against a rock with “I cooperate” written on it. So, naively, it seems like we’re in a better position to generalize TfT-like reasoning in pure logic than we are in logically-uncertain reasoning. This puzzle is discussed more in this post.
The Löbian handshake is still fragile to Agent Simulates Predictor type problems. One way of interpreting this (which I think is the right way) is that although a Löbian-handshake-capable agent won’t cooperate with rocks, its notion of “rock” is anything with much less processing power than itself. This is problematic, because it only allows cooperation with similarly-intelligent agents.
The intuitive notion of logical causality I associate with TfT is one which combines the logically-uncertain reasoning of Nicerbot and the Löbian handshake of the proof-based setting. It is very TDT-ilke: you reason about your decision as if you have logical control over relevantly similar agents. You coordinate with other agents by reasoning in the same way about what the optimal joint strategy would be.
Pavlov Cluster
Pavlov-type reasoning doesn’t have the same difficulties generalizing as TfT. Whereas TfT seems to require extra pieces added on to take advantage of non-agents (like PrudentBot), Pavlov does this naturally, while also naturally finding cooperative equilibria with other Pavlov agents.
It seems possible that Pavlov-cluster algorithms can do well in Agent Simulates Predictor -type problems, too. Pavlov happily defects against a rock because the rock exerts no agentic pressure toward cooperation. However, in Agent Simulates Predictor, the predictor does exert “agentic pressure”—it makes things go poorly for agents who predictably defect. A Pavlov will respond to this by being less likely to settle on defection as a strategy. (This is very hand-wavy, however. It would be good to construct a setting in which illustrates this behavior.)
I’ve already described a generalization to deterministic iterated games, although the convergence behavior is admittedly quite bad (perhaps it can be improved). It isn’t hard to generalize the strategy I described to games with stochastic payoffs, if one is willing to use hacks like aggregating iterations into batches to de-noise the payout information. It would be interesting to see a non-hacky way of generalizing it.
Moving to a more logical setting, we may again want to substitute the notion of “previous round” with “prediction”. TfT-generalization matched its behavior to the predicted behavior of the other agent; Pavlov would instead copy its own predicted behavior! Or, rather: if Pavlov expects things to go well, it tries to do what it would expect itself to do. If Pavlov expects things to go poorly, it tries to invalidate its self-prediction.
(This doesn’t seem obviously good, but, perhaps it can lead somewhere.)
I’m not sure what notion of logical control should be associated with Pavlov. It feels connected with Decision Theory is for Making Bad Outcomes Inconsistent: Pavlov is trying to “shake off” bad outcomes, as if that made them less probable, much like the (apparently) bad reasoning discussed in The Happy Dance Problem. The content/discontent algorithm shows that something like this can work in the long run (though not in the short run).
Another idea is that the content/discontent algorithm controls the stationary distribution of the Markov chain (the iterated play) by setting the transition probabilities so as to steer toward optimal outcomes preferentially. Perhaps a notion of logical control in which an agent thinks of itself as steering in this way?
Overall, the Pavlov side of things is very sketchy. As competing visions, TfT-like reasoning may hold up better in the long run. Still, it seems useful to try to flesh out Pavlov-type reasoning as an alternative.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Alignment Newsletter #52 by 6 Apr 2019 1:20 UTC; 19 points) (
- 5 Sep 2019 21:56 UTC; 13 points) 's comment on Formalising decision theory is hard by (
- 4 Aug 2021 17:59 UTC; 5 points) 's comment on Re-Define Intent Alignment? by (
Not super important but maybe worth mentioning in the context of generalizing Pavlov: the strategy Pavlov for the iterated PD can be seen as an extremely shortsighted version of the law of effect, which basically says: repeat actions that have worked well in the past (in similar situations). Of course, the LoE can be applied in a wide range of settings. For example, in their reinforcement learning textbook, Sutton and Barto write that LoE underlies all of (model-free) RL.
Somewhat true, but without further bells and whistles, RL does not replicate the Pavlov strategy in Prisoner’s Dilemma, so I think looking at it that way is missing something important about what’s going on.
But you also don’t want the costs of cooperation to be too high.
How would you test this?