I agree that this is measuring something of interest, but it doesn’t feel to me as if it solves the problem I thought you said you had.
This describes how well aligned an individual action by B is with A’s interests. (The action in question is B’s choice of (mixed) strategy β, when A has chosen (mixed) strategy α.) The number is 0 when B chooses the worst-for-A option available, 1 when B chooses the best-for-A option available, and in between scales in proportion to A’s expected utility.
But your original question was, on the face of it, looking for something that describes the effect on alignment of a game rather than one particular outcome:
In my experience, constant-sum games are considered to provide “maximally unaligned” incentives, and common-payoff games are considered to provide “maximally aligned” incentives. How do we quantitatively interpolate between these two extremes?
or perhaps the alignment of particular agents playing a particular game.
I think Vanessa’s proposal is the right answer to the question it’s answering, but the question it’s answering seems rather different from the one you seemed to be asking. It feels like a type error: outcomes can be “good”, “bad”, “favourable”, “unfavourable”, etc., but it’s things like agents and incentives that can be “aligned” or “unaligned”.
When we talk about some agent (e.g., a hypothetical superintelligent AI) being “aligned” to some extent with our values, it seems to me we don’t just mean whether or not, in a particular case, it acts in ways that suit us. What we want is that in general, over a wide range of possible situations, it will tend to act in ways that suit us. That seems like something this definition couldn’t give us—unless you take the “game” to be the entirety of everything it does, so that a “strategy” for the AI is simply its entire program, and then asking for this coefficient-of-alignment to be large is precisely the same thing as asking for the expected behaviour of the AI, across its whole existence, to produce high utility for us. Which, indeed, is what we want, but this formalism doesn’t seem to me to add anything we didn’t already have by saying “we want the AI’s behaviour to have high expected utility for us”.
It feels to me as if there’s more to be done in order to cash out e.g. your suggestion that constant-sum games are ill-aligned and common-payoff games are well-aligned. Maybe it’s enough to say that for these games, whatever strategy A picks, B’s payoff-maximizing strategy yields Kosoy coefficient 0 in the former case and 1 in the latter. That is, B’s incentives point in a direction that produces (un)favourable outcomes for A. The Kosoy coefficient quantifies the (un)favourableness of the outcomes; we want something on top of that to express the (mis)alignment of the incentives.
(To be clear, of course it may be that what you were intending to ask for is exactly what Vanessa provided, and you have every right to be interested in whatever questions you’re interested in. I’m just trying to explain why the question Vanessa answered doesn’t feel to me like the key question if you’re asking about how well aligned one agent is with another in a particular context.)
I agree that this is measuring something of interest, but it doesn’t feel to me as if it solves the problem I thought you said you had.
This describes how well aligned an individual action by B is with A’s interests. (The action in question is B’s choice of (mixed) strategy β, when A has chosen (mixed) strategy α.) The number is 0 when B chooses the worst-for-A option available, 1 when B chooses the best-for-A option available, and in between scales in proportion to A’s expected utility.
But your original question was, on the face of it, looking for something that describes the effect on alignment of a game rather than one particular outcome:
or perhaps the alignment of particular agents playing a particular game.
I think Vanessa’s proposal is the right answer to the question it’s answering, but the question it’s answering seems rather different from the one you seemed to be asking. It feels like a type error: outcomes can be “good”, “bad”, “favourable”, “unfavourable”, etc., but it’s things like agents and incentives that can be “aligned” or “unaligned”.
When we talk about some agent (e.g., a hypothetical superintelligent AI) being “aligned” to some extent with our values, it seems to me we don’t just mean whether or not, in a particular case, it acts in ways that suit us. What we want is that in general, over a wide range of possible situations, it will tend to act in ways that suit us. That seems like something this definition couldn’t give us—unless you take the “game” to be the entirety of everything it does, so that a “strategy” for the AI is simply its entire program, and then asking for this coefficient-of-alignment to be large is precisely the same thing as asking for the expected behaviour of the AI, across its whole existence, to produce high utility for us. Which, indeed, is what we want, but this formalism doesn’t seem to me to add anything we didn’t already have by saying “we want the AI’s behaviour to have high expected utility for us”.
It feels to me as if there’s more to be done in order to cash out e.g. your suggestion that constant-sum games are ill-aligned and common-payoff games are well-aligned. Maybe it’s enough to say that for these games, whatever strategy A picks, B’s payoff-maximizing strategy yields Kosoy coefficient 0 in the former case and 1 in the latter. That is, B’s incentives point in a direction that produces (un)favourable outcomes for A. The Kosoy coefficient quantifies the (un)favourableness of the outcomes; we want something on top of that to express the (mis)alignment of the incentives.
(To be clear, of course it may be that what you were intending to ask for is exactly what Vanessa provided, and you have every right to be interested in whatever questions you’re interested in. I’m just trying to explain why the question Vanessa answered doesn’t feel to me like the key question if you’re asking about how well aligned one agent is with another in a particular context.)