The Commitment Races problem
[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]
This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:
Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.
For this post I use “consequentialists” to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don’t care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. “obeying the moral law” is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.
Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible
Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won’t retaliate.
Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.
In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won’t give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.
So, amongst consequentialists, there is sometimes a huge advantage to “winning the commitment race.” If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so—unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.” In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.
Note: “first” and “later” in the above don’t refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what’s going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they’ve learned about by t.
Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can’t be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won’t say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That’s my take on it anyway)
Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) “I commit to making you lose if you do that move.” In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike—and possibly outright bullying—by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn’t comply. Practice what you preach!)
When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.
This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.
Sometimes commitments can be made “at the same time”—i.e. in ignorance of each other—in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)
Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request—perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying—so many years later they meet each other, learn about each other, and end up locked into all-out war.
I’m not saying disastrous AGI commitments are the default outcome; I’m saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We’d wish we built a paperclip maximizer instead.
Objection: “Surely they wouldn’t be so stupid as to make those commitments—even I could see that bad outcome coming. A better commitment would be...”
Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven’t anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don’t know so much.
Objection: “Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?”
Reply: Several points:
1. Devastating commitments (e.g. “Grim Trigger”) are much more possible with AGI—just alter the code! Inigo Montoya is a fictional character and even he wasn’t able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.
2. Credibility is much easier also, especially in an acausal context (see above.)
3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.
4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.
5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today—especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.
Objection: “Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it’s-too-late arguments, and thus will be incapable of hurting anyone.”
Reply: That would be nice, wouldn’t it? Let’s hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...
Anecdote: A friend of mine, when she was a toddler, would threaten her parents: “I’ll hold my breath until you give me the candy!” Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.
Conclusion
Overall, I’m not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if “solving bargaining” turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.
Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.
- UDT shows that decision theory is more puzzling than ever by 13 Sep 2023 12:26 UTC; 206 points) (
- Can you control the past? by 27 Aug 2021 19:39 UTC; 175 points) (
- Center on Long-Term Risk: 2023 Fundraiser by 9 Dec 2022 18:03 UTC; 169 points) (EA Forum;
- Responses to apparent rationalist confusions about game / decision theory by 30 Aug 2023 22:02 UTC; 142 points) (
- A longtermist critique of “The expected value of extinction risk reduction is positive” by 1 Jul 2021 21:01 UTC; 131 points) (EA Forum;
- Updatelessness doesn’t solve most problems by 8 Feb 2024 17:30 UTC; 128 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- The Main Sources of AI Risk? by 21 Mar 2019 18:28 UTC; 121 points) (
- Cohabitive Games so Far by 28 Sep 2023 15:41 UTC; 121 points) (
- AI things that are perhaps as important as human-controlled AI by 3 Mar 2024 18:07 UTC; 113 points) (EA Forum;
- Less Realistic Tales of Doom by 6 May 2021 23:01 UTC; 113 points) (
- Long Reflection Reading List by 24 Mar 2024 16:27 UTC; 92 points) (EA Forum;
- CLR Summer Research Fellowship 2024 by 15 Feb 2024 18:26 UTC; 89 points) (EA Forum;
- Effective Altruism Foundation: Plans for 2020 by 23 Dec 2019 11:51 UTC; 82 points) (EA Forum;
- Implications of evidential cooperation in large worlds by 23 Aug 2023 0:43 UTC; 79 points) (EA Forum;
- Center on Long-Term Risk: Annual review and fundraiser 2023 by 13 Dec 2023 16:42 UTC; 78 points) (EA Forum;
- Open-minded updatelessness by 10 Jul 2023 11:08 UTC; 65 points) (
- MIRI’s 2019 Fundraiser by 3 Dec 2019 1:16 UTC; 55 points) (
- AI things that are perhaps as important as human-controlled AI by 3 Mar 2024 18:07 UTC; 55 points) (
- CLR’s recent work on multi-agent systems by 9 Mar 2021 2:28 UTC; 54 points) (
- My Current Take on Counterfactuals by 9 Apr 2021 17:51 UTC; 54 points) (
- When would AGIs engage in conflict? by 14 Sep 2022 19:38 UTC; 52 points) (
- Misc. questions about EfficientZero by 4 Dec 2021 19:45 UTC; 51 points) (
- Superrational Agents Kelly Bet Influence! by 16 Apr 2021 22:08 UTC; 47 points) (
- Can you control the past? by 27 Aug 2021 19:34 UTC; 46 points) (EA Forum;
- AXRP Episode 25 - Cooperative AI with Caspar Oesterheld by 3 Oct 2023 21:50 UTC; 43 points) (
- My take on higher-order game theory by 30 Nov 2021 5:56 UTC; 40 points) (
- Implications of evidential cooperation in large worlds by 23 Aug 2023 0:43 UTC; 39 points) (
- Confusions re: Higher-Level Game Theory by 2 Jul 2021 3:15 UTC; 38 points) (
- Framing AI strategy by 7 Feb 2023 19:20 UTC; 33 points) (
- Defining Myopia by 19 Oct 2019 21:32 UTC; 32 points) (
- Project ideas: Backup plans & Cooperative AI by 4 Jan 2024 7:26 UTC; 25 points) (EA Forum;
- [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence by 10 Sep 2019 19:10 UTC; 21 points) (
- Morality and constrained maximization, part 1 by 22 Dec 2021 8:47 UTC; 20 points) (
- MIRI’s 2019 Fundraiser by 7 Dec 2019 0:30 UTC; 19 points) (EA Forum;
- Dath Ilani Rule of Law by 10 May 2022 6:17 UTC; 18 points) (
- Project ideas: Backup plans & Cooperative AI by 8 Jan 2024 17:19 UTC; 18 points) (
- 12 Mar 2022 9:11 UTC; 18 points) 's comment on Ukraine Post #2: Options by (
- [AN #150]: The subtypes of Cooperative AI research by 12 May 2021 17:20 UTC; 17 points) (
- Meetup: Daniel Kokotaljo on Commitment Races (Sunday Feb 7th, 12pm PT) by 5 Feb 2021 9:34 UTC; 17 points) (
- Abadarian Trades by 30 Jun 2022 16:41 UTC; 17 points) (
- Morality and constrained maximization, part 1 by 22 Dec 2021 8:53 UTC; 13 points) (EA Forum;
- 14 Jan 2020 21:39 UTC; 12 points) 's comment on The Main Sources of AI Risk? by (
- 20 Mar 2022 0:03 UTC; 10 points) 's comment on Anirandis’s Shortform by (
- 30 Jun 2022 19:11 UTC; 9 points) 's comment on Abadarian Trades by (
- Linch’s Quick takes by 19 Sep 2019 0:28 UTC; 8 points) (EA Forum;
- 21 Jul 2021 12:03 UTC; 8 points) 's comment on My Marriage Vows by (
- 23 Aug 2019 6:37 UTC; 7 points) 's comment on Tabooing ‘Agent’ for Prosaic Alignment by (
- 13 Sep 2019 1:30 UTC; 6 points) 's comment on Formalising decision theory is hard by (
- 27 Sep 2019 20:04 UTC; 6 points) 's comment on Ambitious vs. narrow value learning by (
- Acknowledgements & References by 14 Dec 2019 7:04 UTC; 6 points) (
- 15 Sep 2019 0:05 UTC; 6 points) 's comment on A Critique of Functional Decision Theory by (
- 23 Sep 2020 8:10 UTC; 4 points) 's comment on AI Advantages [Gems from the Wiki] by (
- 9 Sep 2021 18:03 UTC; 2 points) 's comment on Formalizing Objections against Surrogate Goals by (
- 6 Jan 2020 17:47 UTC; 2 points) 's comment on [AN #80]: Why AI risk might be solved without additional intervention from longtermists by (
- 6 Jan 2020 14:42 UTC; 1 point) 's comment on [AN #80]: Why AI risk might be solved without additional intervention from longtermists by (
- 17 Dec 2020 10:42 UTC; 1 point) 's comment on Machine learning could be fundamentally unexplainable by (
This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.