I’ll take a shot at this. Let A and B be the sets of actions of Alice and Bob. Let on:B→{1,...n} (where ‘n’ means ‘nice’) be function that orders B by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let os:B→{1,...,n} (where ‘s’ means ‘selfish’) be the function that orders B by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function ψ measuring similarity between two orderings of a finite set (should range over [−1,1]); the alignment of B with A is then ψ(on,os).
Example: in the prisoner’s dilemma, B={c,d}, and on orders c>d whereas os orders d>c. Hence ψ(on,om) should be −1, i.e., Bob is maximally unaligned with Alice. Note that this makes it different from Mykhailo’s answer which gives alignment 0.5, i.e., medium aligned rather than maximally unaligned.
This seems like an improvement over correlation since it’s not symmetrical. In the game where Alice and Bob both get to choose numbers x,y∈{1,2} and Alice’s utility function outputs y+x whereas Bob’s outputs y−x, Bob would be perfectly aligned with Alice (his on and os both order 2>1) but Alice perfectly unaligned with Bob (her on orders 1>2 but her os orders 2>1).
I believe this metric meets criteria 1,3,4 you listed. It could be changed to be sensitive to players’ decision theories by changing os (for alignment from Bob to Alice) to be the order output by Bob’s decision theory, but I think that would be a mistake. Suppose I build an AI that is more powerful than myself, and the game is such that we can both decide to steal some of the other’s stuff. If the AI does this, it leads to −10 utils for me and +2 for it (otherwise 0⁄0); if I do it, it leads to −100 utils for me because the AI kills me in response (otherwise 0⁄0). This game is trivial: the AI will take my stuff and I’ll do nothing. Also, the AI is maximally unaligned with me. Now suppose I become as powerful as the AI and my ‘take AI’s stuff’ becomes −10 for AI, +2 for me. This makes the game a prisoner’s dilemma. If we both run UDT or FDT, we would now cooperate. If os is the ordering of the AI’s decision theory, this would mean the AI is now aligned with me, which is odd since the only thing that changed is me getting more powerful. With the original proposal, the AI is still maximally unaligned with me. More abstractly, game theory assumes your actions have influence on the other player’s rewards (else the game is trivial), so if you cooperate for game-theoretical reasons, this doesn’t seem to capture what we mean by alignment.
I’ll take a shot at this. Let A and B be the sets of actions of Alice and Bob. Let on:B→{1,...n} (where ‘n’ means ‘nice’) be function that orders B by how good the choices are for Alice, assuming that Alice gets to choose second. Similarly, let os:B→{1,...,n} (where ‘s’ means ‘selfish’) be the function that orders B by how good the choices are for Bob, assuming that Alice gets to choose second. Choose some function ψ measuring similarity between two orderings of a finite set (should range over [−1,1]); the alignment of B with A is then ψ(on,os).
Example: in the prisoner’s dilemma, B={c,d}, and on orders c>d whereas os orders d>c. Hence ψ(on,om) should be −1, i.e., Bob is maximally unaligned with Alice. Note that this makes it different from Mykhailo’s answer which gives alignment 0.5, i.e., medium aligned rather than maximally unaligned.
This seems like an improvement over correlation since it’s not symmetrical. In the game where Alice and Bob both get to choose numbers x,y∈{1,2} and Alice’s utility function outputs y+x whereas Bob’s outputs y−x, Bob would be perfectly aligned with Alice (his on and os both order 2>1) but Alice perfectly unaligned with Bob (her on orders 1>2 but her os orders 2>1).
I believe this metric meets criteria 1,3,4 you listed. It could be changed to be sensitive to players’ decision theories by changing os (for alignment from Bob to Alice) to be the order output by Bob’s decision theory, but I think that would be a mistake. Suppose I build an AI that is more powerful than myself, and the game is such that we can both decide to steal some of the other’s stuff. If the AI does this, it leads to −10 utils for me and +2 for it (otherwise 0⁄0); if I do it, it leads to −100 utils for me because the AI kills me in response (otherwise 0⁄0). This game is trivial: the AI will take my stuff and I’ll do nothing. Also, the AI is maximally unaligned with me. Now suppose I become as powerful as the AI and my ‘take AI’s stuff’ becomes −10 for AI, +2 for me. This makes the game a prisoner’s dilemma. If we both run UDT or FDT, we would now cooperate. If os is the ordering of the AI’s decision theory, this would mean the AI is now aligned with me, which is odd since the only thing that changed is me getting more powerful. With the original proposal, the AI is still maximally unaligned with me. More abstractly, game theory assumes your actions have influence on the other player’s rewards (else the game is trivial), so if you cooperate for game-theoretical reasons, this doesn’t seem to capture what we mean by alignment.