Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.
I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the “strong distributional shift” assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can’t generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That’s why the title says “power-seeking can be predictive” not “training-compatible goals can be predictive”.
The hypothesis you mentioned seems compatible with the assumptions of this post. When you say “the policy develops motivations related to obvious correlates of its historical reinforcement signals”, these “motivations” seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.
I suspect a lot of the disagreement here comes from different interpretations of the “internal representations of goals” assumption, I will try to rephrase that part better.
That’s why the title says “power-seeking can be predictive” not “training-compatible goals can be predictive”.
You’re right. I was critiquing “power-seeking due to your assumptions isn’t probable, because I think your assumptions won’t hold” and not “power-seeking isn’t predictive.” I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:
I don’t see a notion of “objective” that can be confidently claimed is:
Probable: there is a good argument that the systems we build will have an “objective”, and
Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).
Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually:
The assumptions used in the post (“learns a randomly-selected training-compatible goal”) assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization),
Therefore the assumptions become less probable
Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large)
I suspect that you agree that “learns a training-compatible goal” isn’t very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the “can” in “Power-seeking can be probable and predictive.”
Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.
I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the “strong distributional shift” assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can’t generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That’s why the title says “power-seeking can be predictive” not “training-compatible goals can be predictive”.
The hypothesis you mentioned seems compatible with the assumptions of this post. When you say “the policy develops motivations related to obvious correlates of its historical reinforcement signals”, these “motivations” seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.
I suspect a lot of the disagreement here comes from different interpretations of the “internal representations of goals” assumption, I will try to rephrase that part better.
You’re right. I was critiquing “power-seeking due to your assumptions isn’t probable, because I think your assumptions won’t hold” and not “power-seeking isn’t predictive.” I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:
Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually:
The assumptions used in the post (“learns a randomly-selected training-compatible goal”) assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization),
Therefore the assumptions become less probable
Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large)
I suspect that you agree that “learns a training-compatible goal” isn’t very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the “can” in “Power-seeking can be probable and predictive.”