(I continued this discussion with Adam in private—here are some thoughts for the public record)
There is not really a subjective modeling decision involved because given an interface (state space and action space), the dynamics of the system are a real world property we can look for concretely.
Claims about the encoding/modeling can be resolved thanks to power-seeking, which predicts what optimal policies are more likely to do. So with enough optimal policies, we can check the claim (like the “5-googleplex” one).
I think I’m claiming first bullet. I am not claiming the second.
Or are you pointing out that with an architecture in mind, the state space and action space is fixed? I agree
Yes, that.
then it’s a question of how the states of the actual systems are encoded in the state space of the agent, and that doesn’t seem unique to me.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
But to falsify the “5 googolplex”, you do need to know what the optimal policies tend to do, right? Then you need to find optimal policies and know what they do (to check that they indeed don’t power-seek by going left). This means run/simulate them, which might cause them to take over the world in the worst case scenarios.
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.
(I continued this discussion with Adam in private—here are some thoughts for the public record)
I think I’m claiming first bullet. I am not claiming the second.
Yes, that.
It doesn’t have to be unique. We’re predicting “for the agents we build, will optimal policies in their MDP models seek power?”, and once you account for the environment dynamics, our beliefs about the agent architecture, and then our beliefs on the reward functions conditional on each architecture, this prediction has no subjective degrees of freedom.
I’m not claiming that there’s One Architecture To Rule Them All. I’m saying that if we want to predict what happens, we:
Consider the underlying environment (assumed Markovian)
Consider different state/action encodings we might supply the agent.
For each, fix a reward function distribution (what goals we expect to assign to the agent)
See what my theory predicts.
There’s a further claim (which seems plausible, but which I’m not yet making) that (2) won’t affect (4) very much in practice. The point of this post is that if you say “the MDP has a different model”, you’re either disagreeing with (1) the actual dynamics, or claiming that we will physically supply the agent with a different state/action encoding (2).
To falsify “5 googolplex”, all you have to know is the dynamics + the agent’s observation and action encodings. That determines the MDP structure. You don’t have to run anything. (Although I suppose your proposed direction of inference is interesting: power-seeking tendencies + dynamics give you evidence about the encoding)
The encodings + environment dynamics tell you what model the agent is interfacing with, which allows you to apply my theorems as usual.