Disentangling Perspectives On Strategy-Stealing in AI Safety
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Additional thanks to Ameya Prabhu and Callum McDougall for their thoughts and feedback on this post.
Introduction
I’ve seen that in various posts people will make an offhanded reference to “strategy-stealing” or “the strategy-stealing assumption” without very clearly defining what they mean by “strategy-stealing”. Part of the trouble with this is that these posts are often of wildly different flavors, and it’s often not clear what their connection to each other is, or perhaps it’s unclear under what conditions it might be feasible to “steal the strategy” of an AI, or sometimes it’s unclear where there is any notion of doing any stealing of strategies at any point.
Spoiler upfront, part of the reason for this is that the term comes from a relatively esoteric game-theoretic concept that basically never applies in a direct sense to the scenarios in which it is name-dropped. In this post, I’ll try to explain the common use of this seemingly-inappropriate term and why people are using it. In order to do this, I’ll first attempt to develop a more general framework for thinking about strategy-stealing, and then I’ll try to extend the intuition that develops to bridge the gap between a couple of seemingly highly disparate posts. Along the way I’ll attempt to clarify the connection to a few other ideas in alignment, and clarify what implications the strategy-stealing framework might have for the direction of other alignment research. I’ll declare up front that I think that in an AI safety context, the strategy-stealing framework is best thought of as a tool for managing certain intuitions about competition; rhetorically the goal of my post is to distill out those aspects of “game-theoretic” strategy-stealing that are broadly applicable to alignment so that we can mostly throw out the rest.
As an aside, I think that due to the (very relatable) confusion I’ve seen elsewhere, strategy-stealing is not actually a good name for the concept in an AI safety context, but the term is already in use and I don’t have any ideas for any better ones, so I’ll continue to just use the term strategy-stealing even in contexts where no strategies are ever stolen. My hope is that I do a good enough job of unifying the existing uses of the term to avoid a proliferation of definitions.
Strategy Stealing in Game Theory
The strategy-stealing argument page on Wikipedia has a very good short explanation of the term, reproduced here:
In combinatorial game theory, the strategy-stealing argument is a general argument that shows, for many two-player games, that the second player cannot have a guaranteed winning strategy. The strategy-stealing argument applies to any symmetric game (one in which either player has the same set of available moves with the same results, so that the first player can “use” the second player’s strategy) in which an extra move can never be a disadvantage.
The argument works by obtaining a contradiction. A winning strategy is assumed to exist for the second player, who is using it. But then, roughly speaking, after making an arbitrary first move – which by the conditions above is not a disadvantage – the first player may then also play according to this winning strategy. The result is that both players are guaranteed to win – which is absurd, thus contradicting the assumption that such a strategy exists.
Although they were made pretty clear, note explicitly the following assumptions and limitations of the argument:
The game is a two-player, turn-based game [1].
The game is symmetric (i.e, both players have access to the same set of actions; even more explicitly, both players’ actions induce winning states for themselves under identical conditions). [2]
Excess moves never put you at a disadvantage (though this point is not particularly impactful for the following discussion). [3]
Most importantly, the proof is non-constructive, i.e, it doesn’t tell you how to actually deny P2 a win as P1. The strategy-stealing argument is used to make a claim about whether P2 has a guaranteed winning strategy assuming optimal play. We’re able to obtain a contradiction to this claim by assuming that if P2 had a winning strategy, P1 would be able to deduce what it is. For sufficiently computationally complex games, this either requires extremely large amounts of computational power (to search the game tree for P2’s optimal strategy), or at least some sort of insight into P2’s available set of policies. For example that the strategy-stealing argument can be used to show that P1 never loses gomoku under theoretically optimal play, but between, say, humans, P2 routinely wins. [4]
Strategy Stealing Intuitions For Competition Among AI Agents: Paul Christiano’s Model
In The Strategy-Stealing Assumption, Paul Christiano says to a crude approximation that “If you squint kinda hard, you can model trying to influence the future as a game which is symmetric with respect to some notion of power.” More concretely, he makes the following argument:
Model the world as consisting of coalitions consisting of humans and AIs, some of which may be unaligned, and model these coalitions as each controlling some share of generic “resources”.
Assume that a coalition’s ability to acquire “influence” over the future is directly proportional to their control over generic resources. This point is what Paul in this context calls the “strategy-stealing assumption”; in his words: “for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group of humans can use in order to capture a similar amount of flexible influence over the future”.
Then if, for example, the majority of coalitions are approximately aligned with human values, then we expect human values to win out in the long run. [5]
Other Assumptions and Sources of Justification for Paul’s Model
You might have noticed by now that even in principle, strictly speaking the game-theoretic notion of strategy-stealing doesn’t apply here, because we’re no longer in the context of a two-player turn-based adversarial game. Actually, instead of what’s commonly meant by “strategy-stealing” in game theory, Paul justifies his model by making reference to Jessica Taylor’s Strategies for Coalitions In Unit Sum Games. In this setting, we’re actually talking about games with multiple agents acting simultaneously, and where payoffs are not necessarily binary but merely unit sum [6].
I think intuitively Paul justifies the use of the term “strategy-stealing” with the idea that at least in a very similar fashion, we use the symmetry of the game to come to the intuitive conclusion that “coalitions can take advantage of public knowledge, i.e, steal strategies, to obtain power approximately proportionate to their size”. Actually, the results in Jessica’s post are even weaker than that—Theorem 1 only shows that the conclusion holds for a very, very specific type of game, and Theorem 2 assumes that the coalition is of a certain size, and also assumes prior knowledge of other players’ strategies. I personally don’t think that these theoretical results justify the assumptions of Paul’s model in the more general setting that he describes particularly strongly, so it’s not surprising if you can come up with critiques to those assumptions. Anyway, none of this really affects the structure of the rest of the post, since the whole point rhetorically is that we ultimately just want to use intuitions that game theory helps us develop.
**Problems Applying the Strategy-Stealing Framework **
Rhetorically, the position of Paul’s post is that we may want to find ways to make this “strategy-stealing assumption” approximately true, at which point “all” we have to do is make sure that aligned humans control a sufficiently large proportion of resources. (Of course, we’d also have to address problems with/loopholes in the model and argument above, but his post spends a lot of time rather comprehensively doing that, so I won’t redo his analysis here.) I think the problem with this position is that from a practical perspective, it is basically impossible to make this assumption true, and it would involve solving a ton of other subproblems. To elaborate, the intuitively confusing thing to me about Paul’s post is that in many commonly imagined AI takeoff scenarios, it’s fairly clear that most of the assumptions you need in the game-theoretic strategy-stealing setting do not hold, not even approximately:
The structure of the game may not actually turn out to be amenable to analysis through the lens of “competition between agents”. Even attempting to analyze a scenario in which “a coalition of aligned humans control the majority of resources” makes a large number of assumptions, not only about the nature of human values and the ability of humans to coordinate, but also that the world we’ll eventually find ourselves in is multipolar. That is to say, we’re assuming the existence of some actually aligned agents, and that humans are ever in a position to actually “use” or “cooperate” with AI to compete with other coalitions. [7]
Intuitively this game is not symmetric, for a couple broad classes of reasons:
Possibly the AI has access to “actions” that coalitions of other humans (or even AIs) don’t, due to broadly superior intelligence, or perhaps just due the way it interfaces with the world. People have written hundreds of thousands of words about ways this can be true, so I won’t go into it that much further.
Possibly some values are easier to optimize for than others (i.e, influence doesn’t convert to utility, or perhaps having certain values give you extra bargaining power, e.g you might be able to threaten other coalitions because you don’t mind if the world is destroyed), which Paul provides various examples of.
The non-constructiveness of the game-theoretic strategy-stealing argument above bites us pretty badly: even if you did have access to the “same set of actions” as an unaligned AI, intuitively you can’t easily emulate its power-acquiring strategy, for reasons broadly related to the existence of the terms “deception” and “unaligned”, unless you also want to make other really strong assumptions about the nature of future AI or future AI interpretability/alignment research. (Really, the boundary between this point and the “asymmetry of actions” point above is not that sharp. Basically, you can think of superintelligent behavior as being a fundamentally inaccessible type of action to human coalitions, or you can think of it as being a sense in which we can’t determine an AI’s intended policy due to computational constraints. Either way, I feel that these critiques of the strategy-stealing framework are broad enough that to patch them up constitutes solving most of what people typically think of as “the alignment problem”, so this isn’t a very useful factoring of the problem, but perhaps Paul and others see something I don’t here.)
**Why might the strategy-stealing framework still be useful? **
As I’ve suggested in the introduction and despite my tone up until now, I don’t think the strategy-stealing framework is completely worthless just because Paul’s particular research direction seems infeasible. I think that when people speak about “strategy-stealing”, they’re often pointing to a concept which is not that closely related to the game-theoretic concept, but which takes advantage of some subset of the intuitions it involves, and this can be a really useful framing for thinking somewhat concretely about potential problems for alignment in various scenarios. One major source of intuition I’m thinking of is that often, success in a game may boil down to control of a single “resource”, for which a strong policy is easy to determine, so that no actual stealing of strategies is needed:
In Paul’s case this is what he calls “flexible influence”.
In many simple economic models this is money or cost-efficiency, e.g, the largest of a set of undifferentiated corporations will eventually obtain a monopoly on its industry due to economies of scale.
In StarCraft, which is a comparatively very complex game (asymmetric, real-time, imperfect information), basic conventional wisdom is that the player with greater income is going to win in the long run (modulo a bunch of other strategic considerations that make the game actually interesting, so this is only a suggestive example).
In some sense this is the entire intuition behind the notion of instrumental convergence. Some kinds of resources or objectives are sufficiently useful in a general enough sense to impose “symmetry”, in the sense that many agents want to acquire them and can do similar things with them. A very common example of this in AI-relevant scenarios is computational power.
Ways of Factoring Potential Research Directions
You can think of Paul’s various uses of the term “strategy-stealing” as an attempt to unify the core problems in several potential future scenarios under a single common framework. For starters, there are the 11 different objections he raises in The Strategy-Stealing Assumption, but thinking about the computational constraints on your ability to “steal strategies” also directly motivates his ideas on inaccessible information. Tangentially, you can also see that his framing of the AI safety problem compels him to think about even things like question-answering systems in the context of the way they affect “strategy-stealing” dynamics.
Strategy Stealing Intuitions Within the Alignment Problem
Up until now all the scenarios we’ve explicitly described have assumed the existence of (coalitions of) AIs aligned in different ways, and we’ve applied strategy-stealing concepts to determine how their objectives shape the far future. However, some of the intuitions about “parties competing to affect a result in a game with functionally one important resource” can be applied to processes internal to an individual machine learning model, or optimization process (which, importantly, makes this a relevant concept even in the unipolar takeoff case). It’s not as clear how to think about these as it is to think about the world described by Paul Christiano, but I’ll just list some dynamics in machine learning, how they might loosely fit to a strategy-stealing framework, and suggest research directions that these might imply.
Broadly speaking, you can think of models in parameter space competing to be the thing eventually instantiated at the end of an optimization process like SGD. In this context the resource they use to compete is something like “attractor basin size”. You can think in a similar way to apply the analysis to “qualitatively similar families of models”, where in addition to attractor basins, you also factor in the proportion of models that implement a qualitatively similar behavior (i.e, perhaps there are 20 times as many actual parameter configurations that implement an unsafe policy A as there are parameter configurations that implement a desired policy B, and you’d weight the parameter configurations by the size of their attractor basins). I think this sort of analysis might be a useful framing for understanding the mechanistic dynamics of inductive biases in neural networks.
Subnetworks of a large neural net compete for terminal impact using the single resource of “numerical influence on final inference result”. This might be a useful framing for thinking about deceptive alignment problems, particularly things related to gradient hacking, though I haven’t found any work that adopts this framework explicitly.
In Operationalizing Compatibility With Strategy-Stealing, Evan poses the following scenario: Suppose we successfully manage to make models robustly attempt to optimize an objective that we specify, and you specify its objective as a linear combination of some values. Then, you might find that the values compete using the single resource of “ease-of-being-optimized-for”, with the undesirable result those values which are “more tractable” (in his post, “make Google money”) to optimize might end up being optimized for in exchange for less optimizable values (“put the world in a good place”) not being realized at all.
This is a fairly subtle point, but Evan explicitly attempts to distinguish the two similar concepts of “how intrinsically hard something is to optimize” in the sense of the size of the space we’re optimizing over, and “ease-of-being-optimized-for” which is a measure of how readily available optimization power is applied to one objective over others. Evan’s implied research strategy here is to use strategy-stealing-ish considerations about “ways in which subsets of the specified objective might be systematically disadvantaged in an optimization process” to decompose the problem into some mathematically tractable components, which we might be able to independently prove guarantees about. I think that this particular research direction doesn’t seem very tractable due to its abstractness, but that might be just because I don’t know how to turn abstractly good ideas into working mathematics.
As a very speculative curiosity, although I’ve intentionally steered us away from trying to explicitly invoke game-theoretic reasoning, you could attempt to think explicitly about the implications of collusion/coercion among “coalitions” in any of the above scenarios, and this would probably yield new esoteric concerns about ways alignment can fail. This is a little bit conceptually tricky, because some models which you’d think of as “colluding” in this context are never actually instantiated, but you can probably chant “acausal trade” while performing some kind of blood sacrifice in order to coherently go deeper down this rabbit hole.
- ↩︎
Some of the arguments can probably be extended to n-player turn-based games with relatively little difficulty, certain simultaneous games with also relatively little difficulty (as we’ll see below), and probably continuous-time games with moderate difficulty.
- ↩︎
This is the reason that the strategy-stealing argument can’t be used to prove a win for Black in Go with komi: the game is not actually symmetric; if you try to pass your first turn to “effectively become P2”, you can’t win by taking (half the board—komi + 0.5) points, like White can.
- ↩︎
For fun, though, this is also one reason why strategy-stealing can’t be used to prove a guaranteed win/draw for White in chess, due to zugzwang.
- ↩︎
Actually this isn’t the best example because first-player advantage in raw Gomoku is so huge that Gomoku has been explicitly solved by computers, but we can correct this example by imagining that instead of Gomoku I named a way more computationally complex version of Gomoku, where you win if you get like 800 in a row in like 15 dimensions or something.
- ↩︎
This assumes that the proportion of “influence” over the future a coalition holds is roughly proportional to the fraction of maximum possible utility they could achieve if everyone were aligned. There are obvious flaws to this assumption, which Paul discusses in his post.
- ↩︎
This means that the use of the phrase “human values… win out” above is doing a little bit of subtle lifting. Under the assumptions of Paul’s model, humans with 99% of flexible influence can achieve 99% of maximum utility in the long run. IMHO It’s a moral philosophical question whether this is an acceptable outcome; Paul bites the bullet and assumes that it is for his analysis.
- ↩︎
Furthermore, depending on the exact scenario you’re analyzing you might have to make the assumption that aligned AIs are designed such that humans can effectively cooperate with them, which starts to bleed into considerations about interpretability and corrigibility. This wasn’t discussed in Paul’s original post but was in the comments.
I understood the idea of Paul’s post as: if we start in a world where humans-with-aligned-AIs control 50% of relevant resources (computers, land, minerals, whatever), and unaligned AIs control 50% of relevant resources, and where the strategy-stealing assumption is true—i.e., the assumption that any good strategy that the unaligned AIs can do, the humans-with-aligned-AIs are equally capable of doing themselves—then the humans-with-aligned-AIs will wind up controlling 50% of the long-term future. And the same argument probably holds for 99%-1% or any other ratio. This part seems perfectly plausible to me, if all those assumptions hold.
Then we can talk about why the strategy-stealing assumption is not in fact true. The unaligned AIs can cause wars and pandemics and food shortages and removing-all-the-oxygen-from-the-atmosphere to harm the humans-with-aligned-AIs, but not so much vice-versa. The unaligned AI can execute a good strategy which the humans-and-aligned-AIs are too uncoordinated to do, instead the latter will just be bickering amongst themselves, hamstrung by following laws and customs and taboos etc., and not having a good coherent idea of what they’re trying to do anyway. The aligned AIs might be less capable than an unaligned AI because of “alignment tax”—we make them safe by making them less powerful (they act conservatively, there are humans in the loop, etc.). And so on and so forth. All this stuff is in Paul’s post, I think.
I feel like Paul’s post is a great post in all those details, but I would have replaced the conclusion section with
“So, in summary, for all these reasons, the strategy-stealing assumption (in this context) is more-or-less totally false and we shouldn’t waste our time thinking about it”
whereas Paul’s conclusion section is kinda the opposite. (Zvi’s comment along the same lines.)
I feel like a lot of this post is listing reasons that the strategy-stealing assumption is false (e.g. humans don’t know what they’re trying to do and can’t coordinate with each other regardless), which are mostly consistent with Paul’s post. It also notes that there are situations in which we don’t care whether the strategy-stealing assumption is true or false (e.g. unipolar AGI outcomes, situations where all the AIs are misaligned, etc.).
And then other parts of the post are, umm, I’m not sure, sending “something is wrong” vibes that I’m not really understanding or sympathizing with…