Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let SA be the set of pure strategies of player A and SB the set of pure strategies of player B. Let uA:SA×SB→R be the utility function of player A. Let (α,β)∈ΔSA×ΔSB be a particular (mixed) outcome. Then the alignment of player B with player A in this outcome is defined to be:
Ofc so far it doesn’t depend on uB at all. However, we can make it depend on uB if we use uB to impose assumptions on (α,β), such as:
β is a uB-best response to α or
(α,β) is a Nash equilibrium (or other solution concept)
Caveat: If we go with the Nash equilibrium option, aB/A can become “systematically” ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where B chooses their strategy after seeing A’s strategy.
In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that “it doesn’t necessarily depend on uB” is a feature, not a bug.
EDIT: To handle common- constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can’t affect A’s expected utility, and so it’s not possible for B to act against A’s interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?
In common-payoff games the denominator is not zero, in general. For example, suppose that SA=SB={a,b}, uA(a,a)=uA(b,b)=1, uA(a,b)=uA(b,a)=0, uB≡eA, α=β=δa. Then aB/A(α,β)=1, as expected: current payoff is 1, if B played b it would be 0.
You’re right. Per Jonah Moss’s comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.
I don’t think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It is true that if, out of some set of objects you consider the subset of those that have aB/A=1, then it’s natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it’s also natural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.
This also suggests that “selfless” perfect B/A alignment is possible in zero-sum games, with the “maximal misalignment” only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.
✅ Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B’s impact alignmentwith A, relative to the counterfactuals where B did the best or worst job possible.
There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).
I agree that this is measuring something of interest, but it doesn’t feel to me as if it solves the problem I thought you said you had.
This describes how well aligned an individual action by B is with A’s interests. (The action in question is B’s choice of (mixed) strategy β, when A has chosen (mixed) strategy α.) The number is 0 when B chooses the worst-for-A option available, 1 when B chooses the best-for-A option available, and in between scales in proportion to A’s expected utility.
But your original question was, on the face of it, looking for something that describes the effect on alignment of a game rather than one particular outcome:
In my experience, constant-sum games are considered to provide “maximally unaligned” incentives, and common-payoff games are considered to provide “maximally aligned” incentives. How do we quantitatively interpolate between these two extremes?
or perhaps the alignment of particular agents playing a particular game.
I think Vanessa’s proposal is the right answer to the question it’s answering, but the question it’s answering seems rather different from the one you seemed to be asking. It feels like a type error: outcomes can be “good”, “bad”, “favourable”, “unfavourable”, etc., but it’s things like agents and incentives that can be “aligned” or “unaligned”.
When we talk about some agent (e.g., a hypothetical superintelligent AI) being “aligned” to some extent with our values, it seems to me we don’t just mean whether or not, in a particular case, it acts in ways that suit us. What we want is that in general, over a wide range of possible situations, it will tend to act in ways that suit us. That seems like something this definition couldn’t give us—unless you take the “game” to be the entirety of everything it does, so that a “strategy” for the AI is simply its entire program, and then asking for this coefficient-of-alignment to be large is precisely the same thing as asking for the expected behaviour of the AI, across its whole existence, to produce high utility for us. Which, indeed, is what we want, but this formalism doesn’t seem to me to add anything we didn’t already have by saying “we want the AI’s behaviour to have high expected utility for us”.
It feels to me as if there’s more to be done in order to cash out e.g. your suggestion that constant-sum games are ill-aligned and common-payoff games are well-aligned. Maybe it’s enough to say that for these games, whatever strategy A picks, B’s payoff-maximizing strategy yields Kosoy coefficient 0 in the former case and 1 in the latter. That is, B’s incentives point in a direction that produces (un)favourable outcomes for A. The Kosoy coefficient quantifies the (un)favourableness of the outcomes; we want something on top of that to express the (mis)alignment of the incentives.
(To be clear, of course it may be that what you were intending to ask for is exactly what Vanessa provided, and you have every right to be interested in whatever questions you’re interested in. I’m just trying to explain why the question Vanessa answered doesn’t feel to me like the key question if you’re asking about how well aligned one agent is with another in a particular context.)
Consider any finite two-player game in normal form (each player can have any finite number of strategies, we can also easily generalize to certain classes of infinite games). Let SA be the set of pure strategies of player A and SB the set of pure strategies of player B. Let uA:SA×SB→R be the utility function of player A. Let (α,β)∈ΔSA×ΔSB be a particular (mixed) outcome. Then the alignment of player B with player A in this outcome is defined to be:
aB/A(α,β):=Eα×β[uA]−minβ′∈SBEα×β′[uA]maxβ′∈SBEα×β′[uA]−minβ′∈SBEα×β′[uA]∈[0,1]
Ofc so far it doesn’t depend on uB at all. However, we can make it depend on uB if we use uB to impose assumptions on (α,β), such as:
β is a uB-best response to α or
(α,β) is a Nash equilibrium (or other solution concept)
Caveat: If we go with the Nash equilibrium option, aB/A can become “systematically” ill-defined (consider e.g. the Nash equilibrium of matching pennies). To avoid this, we can switch to the extensive-form game where B chooses their strategy after seeing A’s strategy.
In a sense, your proposal quantifies the extent to which B selects a best response on behalf of A, given some mixed outcome. I like this. I also think that “it doesn’t necessarily depend on uB” is a feature, not a bug.
EDIT: To handle
common-constant-payoff games, we might want to define the alignment to equal 1 if the denominator is 0. In that case, the response of B can’t affect A’s expected utility, and so it’s not possible for B to act against A’s interests. So we might as well say that B is (trivially) aligned, given such a mixed outcome?In common-payoff games the denominator is not zero, in general. For example, suppose that SA=SB={a,b}, uA(a,a)=uA(b,b)=1, uA(a,b)=uA(b,a)=0, uB≡eA, α=β=δa. Then aB/A(α,β)=1, as expected: current payoff is 1, if B played b it would be 0.
You’re right. Per Jonah Moss’s comment, I happened to be thinking of games where playoff is constant across players and outcomes, which is a very narrow kind of common-payoff (and constant-sum) game.
I don’t think in this case aB/A should be defined to be 1. It seems perfectly justified to leave it undefined, since in such a game B can be equally well conceptualized as maximally aligned or as maximally anti-aligned. It is true that if, out of some set of objects you consider the subset of those that have aB/A=1, then it’s natural to include the undefined cases too. But, if out of some set of objects you consider the subset of those that have aB/A=0, then it’s also natural to include the undefined cases. This is similar to how (0,0)∈R2 is simultaneously in the closure of {xy=1} and in the closure of {xy=−1}, so 00 can be considered to be either 1 or −1 (or any other number) depending on context.
This also suggests that “selfless” perfect B/A alignment is possible in zero-sum games, with the “maximal misalignment” only occuring if we assume B plays a best response. I think this is conceptually correct, and not something I had realized pre-theoretically.
✅ Pending unforeseen complications, I consider this answer to solve the open problem. It essentially formalizes B’s impact alignment with A, relative to the counterfactuals where B did the best or worst job possible.
There might still be other interesting notions of alignment, but I think this is at least an important notion in the normal-form setting (and perhaps beyond).
I agree that this is measuring something of interest, but it doesn’t feel to me as if it solves the problem I thought you said you had.
This describes how well aligned an individual action by B is with A’s interests. (The action in question is B’s choice of (mixed) strategy β, when A has chosen (mixed) strategy α.) The number is 0 when B chooses the worst-for-A option available, 1 when B chooses the best-for-A option available, and in between scales in proportion to A’s expected utility.
But your original question was, on the face of it, looking for something that describes the effect on alignment of a game rather than one particular outcome:
or perhaps the alignment of particular agents playing a particular game.
I think Vanessa’s proposal is the right answer to the question it’s answering, but the question it’s answering seems rather different from the one you seemed to be asking. It feels like a type error: outcomes can be “good”, “bad”, “favourable”, “unfavourable”, etc., but it’s things like agents and incentives that can be “aligned” or “unaligned”.
When we talk about some agent (e.g., a hypothetical superintelligent AI) being “aligned” to some extent with our values, it seems to me we don’t just mean whether or not, in a particular case, it acts in ways that suit us. What we want is that in general, over a wide range of possible situations, it will tend to act in ways that suit us. That seems like something this definition couldn’t give us—unless you take the “game” to be the entirety of everything it does, so that a “strategy” for the AI is simply its entire program, and then asking for this coefficient-of-alignment to be large is precisely the same thing as asking for the expected behaviour of the AI, across its whole existence, to produce high utility for us. Which, indeed, is what we want, but this formalism doesn’t seem to me to add anything we didn’t already have by saying “we want the AI’s behaviour to have high expected utility for us”.
It feels to me as if there’s more to be done in order to cash out e.g. your suggestion that constant-sum games are ill-aligned and common-payoff games are well-aligned. Maybe it’s enough to say that for these games, whatever strategy A picks, B’s payoff-maximizing strategy yields Kosoy coefficient 0 in the former case and 1 in the latter. That is, B’s incentives point in a direction that produces (un)favourable outcomes for A. The Kosoy coefficient quantifies the (un)favourableness of the outcomes; we want something on top of that to express the (mis)alignment of the incentives.
(To be clear, of course it may be that what you were intending to ask for is exactly what Vanessa provided, and you have every right to be interested in whatever questions you’re interested in. I’m just trying to explain why the question Vanessa answered doesn’t feel to me like the key question if you’re asking about how well aligned one agent is with another in a particular context.)