Exciting to see new people tackling AI Alignment research questions! (and I’m already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).
That being said, I’m a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don’t feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I’m just not seeing how important it is, but then it is not obvious from the post alone.
On the positive side, it was quite agreeable to read, and I followed all the formal parts. My only criticism of the form is that I would have liked a statement of what will be proved/done in the post upfront, instead of having to wait the last section.
This might be harsh criticism, but I really encourage you to keep working in the field, and hopefully prove me wrong by expanding on this work in more advanced and exciting ways.
Alternatively, imagine that your team spends the meeting breaking your knees and your laptop.
This is an example of wit done well in a “serious” post. I approve.
Strategies (technically, mixed strategies) in a Bayesian game are given by functions σi:Ti→ΔAi. Thus, even given a fixed strategy profile σ, any notion of “expected reward of an action” will have to account for uncertainty in other players’ types. We do so be defining interim expected utility for player i as follows:
fi(ti,ai,σ−i):=E[ri(ti,a)]
You haven’t defined σ−iat that point, and you don’t introduce the indexing −i for other strategies before the next line. So is this a typo (where you wanted to write σi) or am I just misunderstanding the formula? I’m even more confused because you use σ−i to compute a−i, and so if it’s not a typo this means that your interim utility considers that every other agent uses the same strategy?
Coming back after reading more, do you use σ−i to mean “the strategy profile for every process except i”? That would make more sense of the formulas (since you fix ai, there’s no reason to have a σi) but if it’s the case, then this notation is horrible (no offense).
By the way, indexing the other strategies by −i instead of, let’s say j or k is quite unconventional and confusing.
It initially seems unintuitive that as players’ strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like “as your strategy improves, other players lose the power to capitalize off of your mistakes”.
I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.
Thanks so much for your comment! I’m going to speak for myself here, and not for Jacob.
That being said, I’m a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don’t feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I’m just not seeing how important it is, but then it is not obvious from the post alone.
I think this is an understandable reaction. I personally feel excited by the formalism and theorem and I’ll try to explain why.
Coming off of Optimal Policies Tend to Seek Powerlast summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, “understand multi-agent power” was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn’t necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.
The crux was, in an MDP, you’ve got a state, and it’s pretty clear what an agent can do. But in the multi-agent case, now you’ve got other reasoners, and now you have to account for their influence. So at first I thought,
maybe Power is about being able to enforce your will even against the best efforts of the other players
which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn’t quite right. I thought about this for a while, and I didn’t make much progress, and somehow I didn’t come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it’s obvious:
in an MDP, the relevant “situation” is the current state; measure the agent’s average optimal value at that state.
in a non-iterated multi-agent game, the relevant “situation” is just the other players’ strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.
But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn’t buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.
But the thing I’m most excited about is how I had this intuitive picture of “if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after ‘some point’.”
Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I’m excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn’t obvious to me, at least.
-----
That said, there are some “logical time” aspects of “game theoretic power” that we don’t cover, and aren’t trying to cover. For example, some decision-making algorithms might be really good at ensuring they come “first” in logical time, which gives them a kind of power over other reasoners in the game. For example, if you’re convinced that I’ll tear off my steering wheel in Chicken, then I precede you in logical time and bully you into swerving, and I benefit from this greatly.
I think this is an intriguing problem, but out of scope: we want to understand competition over “resources” (whatever kind of thing that is), and how one player “gaining power” can make other players “lose power.”
I want to go a bit deeper into the fine points, but my general reaction is “I wanted that in the post”. You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don’t think that stating a definition and proving a single theorem that has the “obvious” quality (whether or not it is actually obvious, mind you) is that convincing.
The best way to describe my interpretation is that I feel that you two went for the “scientific paper” style, but the current state of the research, as well as the argument for its value, fit more the “here’s-a-cool-formal-idea blogpost or workshop paper”. And that’s independently of the importance of the result. To say it again differently, I’m ready to accept the importance of a formalism without much explanations of why I should care if it shows a lot of cool results, but when the results are few, I need a more detailed story of why I should care.
About your specific story now:
Coming off of Optimal Policies Tend to Seek Powerlast summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, “understand multi-agent power” was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn’t necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.
Nothing to say here, except that you have the frustrating (for me) ability to make me want to read 5 of your posts in detail when explaining something completely different. I am also supposed to make my own research, you know? (Related: I’ll be excited with reviewing one of your post with the review project we’re doing with a bunch of other researchers. Not sure what post of you would be most appropriate though. If you have some idea, you can post it here. ;) )
The crux was, in an MDP, you’ve got a state, and it’s pretty clear what an agent can do. But in the multi-agent case, now you’ve got other reasoners, and now you have to account for their influence. So at first I thought,
maybe Power is about being able to enforce your will even against the best efforts of the other players
which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn’t quite right. I thought about this for a while, and I didn’t make much progress, and somehow I didn’t come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it’s obvious:
in an MDP, the relevant “situation” is the current state; measure the agent’s average optimal value at that state.
in a non-iterated multi-agent game, the relevant “situation” is just the other players’ strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.
When phrased that way, I think my “issue” is that the subtlety you add is mostly hidden within the additional parameter of the strategy profile. That is, with the original intuition, you don’t have to find out what the other players will actually do; here you kind of have to. It’s a good thing as I agree with you that it makes the intuition subtler, but it also creates a whole new complex problem of inferring strategies.
At this point, I went to reread the last sections, and realized that you’re partially dealing with my problem by linking power with well-known strategy profiles (the nash-equilibriums).
But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn’t buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.
This part pushed me to reread the statements in detail. If I get it correctly, you had the intuition that the power behaved like “will this player win”, whereas it actually work as “keeping everything else fixed, how well can this player end up”. The trick that makes the theorem true and the power bigger than the sum is that for a strategy profile that isn’t a nash equilibrium, multiple players might get a lot if they change their action in turn while keeping everything else fixed.
I’m a bit ashamed, because that’s actually explained in the intuition of the proof, but I didn’t get it on the first reading. I also see now that it was the point of the discussion before the theorem, but that part flew over my head. So my advice for this would be to explain even more in detail the initial intuition and why it is wrong, including where in the maths this happens (the fixing of σ−i).
My updated take after getting this point is that I’m a bit more excited about your formalism.
But the thing I’m most excited about is how I had this intuitive picture of “if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after ‘some point’.”
Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I’m excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn’t obvious to me, at least.
I agree that this is exciting, but this is only mentioned in the last line of the post, as one perspective among others. Notably, it wasn’t clear at all that this was the main application of this work.
Thank you so much for the comments! I’m pretty new to the platform (and to EA research in general), so feedback is useful for getting a broader perspective on our work.
To add to TurnTrout’s comments about power-scarcity and the CCC, I’d say that the broader vision of the multi-agent formulation is to establish a general notion of power-scarcity as a function of “similarity” between players’ reward functions (I mention this in the post’s final notes). In this paradigm, the constant-sum case is one limiting case of “general power-scarcity”, which I see as the “big idea”. As a simple example, general power-scarcity would provide a direct motivation for fearing robustly instrumental goals, since we’d have reason to believe an AI with goals orthogonal(ish) from human goals would be incentivized to compete with humanity for Power.
We’re planning to continue investigating multi-agent Power and power-scarcity, so hopefully we’ll have a more fleshed-out notion of general power-scarcity in the months to come.
Also, re: “as players’ strategies improve, their collective Power tends to decrease”, I think your intuition is correct? Upon reflection, the effect can be explained reasonably well by “improving your actions has no effect on your Power, but a negative effect on opponents’ Power”.
I go into more detail in my answer to Alex, but what I want to say here is that I don’t feel like you use the power-scarcity idea enough in the post itself. As you said, it’s one of three final notes, and without any emphasis on it.
So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.
Probably going to reply to the rest later (and midco can as well, of course), but regarding:
Coming back after reading more, do you use σ−i to mean “the strategy profile for every process except i”? That would make more sense of the formulas (since you fix ai, there’s no reason to have a σi) but if it’s the case, then this notation is horrible (no offense).
By the way, indexing the other strategies by −i instead of, let’s say j or k is quite unconventional and confusing.
Using “σ−i” to mean “the strategy profile of everyone but player i” is common notation; I remember it being used in 2-3 game theory textbooks I read, and you can see its prominence by consulting the Wikipedia page for Nash equilibrium.
Do I agree this is horrible notation? Meh. I don’t know. But it’s not a convention we pioneered in this work.
Ok, that’s fair. It’s hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven’t studied game theory formally.
Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.
It initially seems unintuitive that as players’ strategies improve, their collective Power tends to decrease. The proximate cause of this effect is something like “as your strategy improves, other players lose the power to capitalize off of your mistakes”.
“I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.”
IMO, the unintuitive and potentially problematic thing is not that in a zero-sum game playing better makes things worse for everybody else. That part is fine. The unintuitive and potentially problematic thing is that, according to this formalism, the total collective Power is greater the worse everybody plays. This seems adjacent to saying that everybody would be better off if everyone played poorly, which is true in some games (maybe) but definitely not true in zero-sum games. (Right? This isn’t my area of expertise)
EDIT: Currently I suppose what you’d say is that power =/= utility, and so even though we’d all have more power if we were all less competent, we wouldn’t actually be better off. But perhaps a better way forward would be to define a new concept of “Useful power” or something like that, which equals your share of the total power in a zero-sum game. Then we could say that everyone getting less competent wouldn’t result in everyone becoming more usefully-powerful, which seems like an important thing to be able to say. Ideally we could just redefine power that way instead of inventing a new concept of useful power, but maybe that would screw up some of your earlier theorems?
But perhaps a better way forward would be to define a new concept of “Useful power” or something like that, which equals your share of the total power in a zero-sum game.
I don’t see why useful power is particularly useful, since it’s taking a non-constant-sum quantity (outside of nash equilibria) and making it constant-sum, which seems misleading.
But I also don’t see a problem with the “better play → less exploitability → less total Power” reasoning. this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon.
this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon.
Different strokes for different folks, I guess. It feels very different to me.
Exciting to see new people tackling AI Alignment research questions! (and I’m already excited by what Alex is doing, so him having more people work in his kind of research feels like a good thing).
That being said, I’m a bit underwhelmed by this post. Not that I think the work is wrong, but it looks like it boils down to saying (with a clean formal shape) things that I personally find pretty obvious: playing better at a zero (or constant sum) games means that the other players have less margin to get what they want. I don’t feel that either the formalization of power nor the theorem bring me any new insight, and so I have trouble getting interested. Maybe I’m just not seeing how important it is, but then it is not obvious from the post alone.
On the positive side, it was quite agreeable to read, and I followed all the formal parts. My only criticism of the form is that I would have liked a statement of what will be proved/done in the post upfront, instead of having to wait the last section.
This might be harsh criticism, but I really encourage you to keep working in the field, and hopefully prove me wrong by expanding on this work in more advanced and exciting ways.
This is an example of wit done well in a “serious” post. I approve.
You haven’t defined σ−i at that point, and you don’t introduce the indexing −i for other strategies before the next line. So is this a typo (where you wanted to write σi) or am I just misunderstanding the formula? I’m even more confused because you use σ−i to compute a−i, and so if it’s not a typo this means that your interim utility considers that every other agent uses the same strategy?
Coming back after reading more, do you use σ−i to mean “the strategy profile for every process except i”? That would make more sense of the formulas (since you fix ai, there’s no reason to have a σi) but if it’s the case, then this notation is horrible (no offense).
By the way, indexing the other strategies by −i instead of, let’s say j or k is quite unconventional and confusing.
I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.
Thanks so much for your comment! I’m going to speak for myself here, and not for Jacob.
I think this is an understandable reaction. I personally feel excited by the formalism and theorem and I’ll try to explain why.
Coming off of Optimal Policies Tend to Seek Power last summer, I felt like I understood single-agent Power reasonably well (at that point in time, I had already dropped the assumption of optimality). Last summer, “understand multi-agent power” was actually the project I intended to work on under Andrew Critch. I ended up understanding defection instead (and how it wasn’t necessarily related to Power-seeking), and corrigibility-like properties, and further expanding the single-agent results. But I was still pretty confused about the multi-agent situation.
The crux was, in an MDP, you’ve got a state, and it’s pretty clear what an agent can do. But in the multi-agent case, now you’ve got other reasoners, and now you have to account for their influence. So at first I thought,
which would correspond to everyone else minmax-ing you on any goal you chose. But this wasn’t quite right. I thought about this for a while, and I didn’t make much progress, and somehow I didn’t come up with the formalism in this post until this winter when I started working with Jacob. In hindsight, maybe it’s obvious:
in an MDP, the relevant “situation” is the current state; measure the agent’s average optimal value at that state.
in a non-iterated multi-agent game, the relevant “situation” is just the other players’ strategy profile; measure your average maximum reward, assuming everyone else follows the strategy profile.
This should extend naturally into Bayesian stochastic games, to account for sequential decision-making and truly generalize the MDP results.
But for me, I was excited about the Power formalism when (IIRC) I proposed to Jacob that we prove results about that formalism. Jacob was the one who formulated the theorem, and I actually didn’t buy it at first; my naive intuition was that Power should always be constant when summed over players who have their types drawn from the constant-sum distribution. This was wrong, so I was pretty surprised.
But the thing I’m most excited about is how I had this intuitive picture of “if your goals are unaligned, then in worlds like ours, one person gaining power means other people must lose power, after ‘some point’.”
Intuitively this seems obvious, just like the community knew about instrumental convergence before my formal results. But I’m excited now that we can prove the intuitively correct conclusion, using a notion of Power that mirrors the one used in the single-agent case for the existing power-seeking results. And this wasn’t obvious to me, at least.
-----
That said, there are some “logical time” aspects of “game theoretic power” that we don’t cover, and aren’t trying to cover. For example, some decision-making algorithms might be really good at ensuring they come “first” in logical time, which gives them a kind of power over other reasoners in the game. For example, if you’re convinced that I’ll tear off my steering wheel in Chicken, then I precede you in logical time and bully you into swerving, and I benefit from this greatly.
I think this is an intriguing problem, but out of scope: we want to understand competition over “resources” (whatever kind of thing that is), and how one player “gaining power” can make other players “lose power.”
Thanks for the detailed reply!
I want to go a bit deeper into the fine points, but my general reaction is “I wanted that in the post”. You make a pretty good case for a way to come around at this definition that makes it particularly exciting. On the other hand, I don’t think that stating a definition and proving a single theorem that has the “obvious” quality (whether or not it is actually obvious, mind you) is that convincing.
The best way to describe my interpretation is that I feel that you two went for the “scientific paper” style, but the current state of the research, as well as the argument for its value, fit more the “here’s-a-cool-formal-idea blogpost or workshop paper”. And that’s independently of the importance of the result. To say it again differently, I’m ready to accept the importance of a formalism without much explanations of why I should care if it shows a lot of cool results, but when the results are few, I need a more detailed story of why I should care.
About your specific story now:
Nothing to say here, except that you have the frustrating (for me) ability to make me want to read 5 of your posts in detail when explaining something completely different. I am also supposed to make my own research, you know? (Related: I’ll be excited with reviewing one of your post with the review project we’re doing with a bunch of other researchers. Not sure what post of you would be most appropriate though. If you have some idea, you can post it here. ;) )
When phrased that way, I think my “issue” is that the subtlety you add is mostly hidden within the additional parameter of the strategy profile. That is, with the original intuition, you don’t have to find out what the other players will actually do; here you kind of have to. It’s a good thing as I agree with you that it makes the intuition subtler, but it also creates a whole new complex problem of inferring strategies.
At this point, I went to reread the last sections, and realized that you’re partially dealing with my problem by linking power with well-known strategy profiles (the nash-equilibriums).
This part pushed me to reread the statements in detail. If I get it correctly, you had the intuition that the power behaved like “will this player win”, whereas it actually work as “keeping everything else fixed, how well can this player end up”. The trick that makes the theorem true and the power bigger than the sum is that for a strategy profile that isn’t a nash equilibrium, multiple players might get a lot if they change their action in turn while keeping everything else fixed.
I’m a bit ashamed, because that’s actually explained in the intuition of the proof, but I didn’t get it on the first reading. I also see now that it was the point of the discussion before the theorem, but that part flew over my head. So my advice for this would be to explain even more in detail the initial intuition and why it is wrong, including where in the maths this happens (the fixing of σ−i).
My updated take after getting this point is that I’m a bit more excited about your formalism.
I agree that this is exciting, but this is only mentioned in the last line of the post, as one perspective among others. Notably, it wasn’t clear at all that this was the main application of this work.
Thank you so much for the comments! I’m pretty new to the platform (and to EA research in general), so feedback is useful for getting a broader perspective on our work.
To add to TurnTrout’s comments about power-scarcity and the CCC, I’d say that the broader vision of the multi-agent formulation is to establish a general notion of power-scarcity as a function of “similarity” between players’ reward functions (I mention this in the post’s final notes). In this paradigm, the constant-sum case is one limiting case of “general power-scarcity”, which I see as the “big idea”. As a simple example, general power-scarcity would provide a direct motivation for fearing robustly instrumental goals, since we’d have reason to believe an AI with goals orthogonal(ish) from human goals would be incentivized to compete with humanity for Power.
We’re planning to continue investigating multi-agent Power and power-scarcity, so hopefully we’ll have a more fleshed-out notion of general power-scarcity in the months to come.
Also, re: “as players’ strategies improve, their collective Power tends to decrease”, I think your intuition is correct? Upon reflection, the effect can be explained reasonably well by “improving your actions has no effect on your Power, but a negative effect on opponents’ Power”.
Glad to be helpful!
I go into more detail in my answer to Alex, but what I want to say here is that I don’t feel like you use the power-scarcity idea enough in the post itself. As you said, it’s one of three final notes, and without any emphasis on it.
So while I agree that the power-scarcity is an important research question, it would be helpful IMO if this post put more emphasis on that connection.
Probably going to reply to the rest later (and midco can as well, of course), but regarding:
Using “σ−i” to mean “the strategy profile of everyone but player i” is common notation; I remember it being used in 2-3 game theory textbooks I read, and you can see its prominence by consulting the Wikipedia page for Nash equilibrium.
Do I agree this is horrible notation? Meh. I don’t know. But it’s not a convention we pioneered in this work.
Ok, that’s fair. It’s hard to know which notation is common knowledge, but I think that adding a sentence explaining this one will help readers who haven’t studied game theory formally.
Maybe making all vector profiles bold (like for the action profile) would help to see at a glance the type of the parameter. If I had seen it was a strategy profile, I would have inferred immediately what it meant.
“I disagree. The whole point of a zero-sum game (or even constant sum game) is that not everyone can win. So playing better means quite intuitively that the others can be less sure of accomplishing their own goals.”
IMO, the unintuitive and potentially problematic thing is not that in a zero-sum game playing better makes things worse for everybody else. That part is fine. The unintuitive and potentially problematic thing is that, according to this formalism, the total collective Power is greater the worse everybody plays. This seems adjacent to saying that everybody would be better off if everyone played poorly, which is true in some games (maybe) but definitely not true in zero-sum games. (Right? This isn’t my area of expertise)
EDIT: Currently I suppose what you’d say is that power =/= utility, and so even though we’d all have more power if we were all less competent, we wouldn’t actually be better off. But perhaps a better way forward would be to define a new concept of “Useful power” or something like that, which equals your share of the total power in a zero-sum game. Then we could say that everyone getting less competent wouldn’t result in everyone becoming more usefully-powerful, which seems like an important thing to be able to say. Ideally we could just redefine power that way instead of inventing a new concept of useful power, but maybe that would screw up some of your earlier theorems?
I don’t see why useful power is particularly useful, since it’s taking a non-constant-sum quantity (outside of nash equilibria) and making it constant-sum, which seems misleading.
But I also don’t see a problem with the “better play → less exploitability → less total Power” reasoning. this feels like a situation where our naive intuitions about power are just wrong, and if you think about it more, the formal result reflects a meaningful phenomenon.
Different strokes for different folks, I guess. It feels very different to me.