Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.