Despite agreeing with your conclusion, I’m unconvinced by the reasons you propose. Sure, once the interface is chosen, then the MDP is pretty much constrained by the real-world (for a reasonable modeling process). But that just means the subjectivity comes from the choice of the interface!
To be more concrete, maybe the state space of Pacman could be red-ghost, starting-state and live-happily-ever-after (replacing the right part of the MDP). Then taking the right action wouldn’t be power-seeking either.
What I think is happening here is that in reality, there is a tradeoff in modeling between simplicity/legibility/usability of the model (pushing for fewer states and fewer actions) and performance/competence/optimality (pushing for more states and actions to be able to capture more subtle cases). The fact that we want performance rules out my Pacman variant, and the fact that we want simplicity rules out ofer’s example.
It’s not clear to me that there is one true encoding that strikes a perfect balance, but I’m okay with the idea that there is an acceptable tradeoff and models around that point are mostly similar, in ways that probably doesn’t change the power-seeking.
That’s also a claim that we can, in theory, specify reward functions which distinguish between 5 googolplex variants of red-ghost-game-over. If that were true, then yes—optimal policies really would tend to “die” immediately, since they’d have so many choices.
The “5 googolplex” claim is both falsifiable and false. Given an agent architecture (specifically, the two encodings), optimal policy tendencies are not subjective. We may be uncertain about the agent’s state- and action-encodings, but that doesn’t mean we can imagine whatever we want.
Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn’t help.
I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
But that just means the subjectivity comes from the choice of the interface!
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn’t help.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
My current understanding is something like:
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
I would say the choice of agent architecture is the subjective decision. That’s the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.
If you don’t think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
(I continued this discussion with Adam in private—here are some thoughts for the public record)
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
I think I’m claiming first bullet. I am not claiming the second.
Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree
Yes, that.
then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.
Despite agreeing with your conclusion, I’m unconvinced by the reasons you propose. Sure, once the interface is chosen, then the MDP is pretty much constrained by the real-world (for a reasonable modeling process). But that just means the subjectivity comes from the choice of the interface!
To be more concrete, maybe the state space of Pacman could be red-ghost, starting-state and live-happily-ever-after (replacing the right part of the MDP). Then taking the right action wouldn’t be power-seeking either.
What I think is happening here is that in reality, there is a tradeoff in modeling between simplicity/legibility/usability of the model (pushing for fewer states and fewer actions) and performance/competence/optimality (pushing for more states and actions to be able to capture more subtle cases). The fact that we want performance rules out my Pacman variant, and the fact that we want simplicity rules out ofer’s example.
It’s not clear to me that there is one true encoding that strikes a perfect balance, but I’m okay with the idea that there is an acceptable tradeoff and models around that point are mostly similar, in ways that probably doesn’t change the power-seeking.
Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn’t help.
I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
My current understanding is something like:
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
I would say the choice of agent architecture is the subjective decision. That’s the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.
If you don’t think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
(I continued this discussion with Adam in private—here are some thoughts for the public record)
I think I’m claiming first bullet. I am not claiming the second.
Yes, that.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.