I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
But that just means the subjectivity comes from the choice of the interface!
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
Sure, but if you actually have to check the power-seeking to infer the structure of the MDP, it becomes unusable for not building power-seeking AGIs. Or put differently, the value of your formalization of power-seeking IMO is that we can start from the models of the world and think about which actions/agent would be power-seeking and for which rewards. If I actually have to run the optimal agents to find out about power-seeking actions, then that doesn’t help.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
My current understanding is something like:
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
I would say the choice of agent architecture is the subjective decision. That’s the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.
If you don’t think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
(I continued this discussion with Adam in private—here are some thoughts for the public record)
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
I think I’m claiming first bullet. I am not claiming the second.
Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree
Yes, that.
then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.
I’m wondering whether I properly communicated my point. Would you be so kind as to summarize my argument as best you understand it?
There’s no subjectivity? The interface is determined by the agent architecture we use, which is an empirical question.
You don’t have to run anything to check power-seeking. Once you know the agent encodings, the rest is determined and my theory makes predictions.
My current understanding is something like:
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
I would say the choice of agent architecture is the subjective decision. That’s the point at which we decide what states and actions are possible, which completely determines the MDP. Granted, this argument is probably stronger for POMDP (for which you have more degrees of freedom in observations), but I still see it for MDP.
If you don’t think there is subjectivity involved, do you think that for whatever (non-formal) problem we might want to solve, there is only one way to encode it as a state space and action space? Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree with the latter, but then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
(I continued this discussion with Adam in private—here are some thoughts for the public record)
I think I’m claiming first bullet. I am not claiming the second.
Yes, that.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.