To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don’t really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don’t trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most reward functions “incentivize power-seeking actions.” However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)
People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I’m pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies. I want to fix that, but we aren’t gonna fix it by focusing on optimality.
Sorry about the cite in my “paradigms of alignment” talk, I didn’t mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I’m open to suggestions on how to phrase this differently when I next give this talk.
Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that “instrumental convergence is a thing” for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don’t work because of instrumental convergence. (There’s a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.)
The subtleties that you discuss are important in general, but don’t seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don’t care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard.
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn’t read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
I’m open to suggestions on how to phrase this differently when I next give this talk.
It’s a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don’t know if trained policies tend to autonomously seek power in various non game-playing regimes.
If I had to say something, I might say “If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies.” This argument appropriately behaves differently if the “outcomes” are simply different sentiment generations being sampled from an LM—sentiment shift doesn’t require power-seeking.
For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don’t work because of instrumental convergence.
My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.
Maybe you don’t care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard.
I think that it’s locally valid to point out “under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper.” But I feel a tad queasy about the overall point, since I don’t think alignment’s difficulty has much to do with the difficulties pointed out by “Optimal Policies Tend to Seek Power.” I feel better about saying “Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble.” Maybe that’s what you already communicate, though.
Thanks Alex! Your original comment didn’t read as ill-intended to me, though I wish that you’d just messaged me directly. I could have easily missed your comment in this thread—I only saw it because you linked the thread in the comments on my post.
Your suggested rephrase helps to clarify how you think about the implications of the paper, but I’m looking for something shorter and more high-level to include in my talk. I’m thinking of using this summary, which is based on a sentence from the paper’s intro: “There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
“There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
I think this is reasonable, although I might say “suggesting” instead of “showing.” I think I might also be more cautious about further inferences which people might make from this—like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
To be clear, I still endorse Parametrically retargetable decision-makers tend to seek power. Its content is both correct and relevant and nontrivial. The results, properly used, may enable nontrivial inferences about the properties of inner trained cognition. I don’t really want to retract that paper. I usually just fantasize about retracting Optimal policies tend to seek power.
The problem is that I don’t trust people to wield even the non-instantly-doomed results.
For example, one EAG presentation cited my retargetability results as showing that most reward functions “incentivize power-seeking actions.” However, my results have not shown this for actual trained systems. (And I think that Power-seeking can be probable and predictive for trained agents does not make progress on the incentives of trained policies.)
People keep talking about stuff they know how to formalize (e.g. optimal policies) instead of stuff that matters (e.g. trained policies). I’m pained by this emphasis and I think my retargetability results are complicit. Relative to an actual competent alignment community (in a more competent world), we just have no damn clue how to properly reason about real trained policies. I want to fix that, but we aren’t gonna fix it by focusing on optimality.
Sorry about the cite in my “paradigms of alignment” talk, I didn’t mean to misrepresent your work. I was going for a high-level one-sentence summary of the result and I did not phrase it carefully. I’m open to suggestions on how to phrase this differently when I next give this talk.
Similarly to Steven, I usually cite your power-seeking papers to support a high-level statement that “instrumental convergence is a thing” for ML audiences, and I find they are a valuable outreach tool. For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don’t work because of instrumental convergence. (There’s a nonzero chance he would look at a NeurIPS paper and basically no chance that he would read a LW post.)
The subtleties that you discuss are important in general, but don’t seem relevant to making the basic case for instrumental convergence to ML researchers. Maybe you don’t care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard.
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn’t read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
It’s a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don’t know if trained policies tend to autonomously seek power in various non game-playing regimes.
If I had to say something, I might say “If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies.” This argument appropriately behaves differently if the “outcomes” are simply different sentiment generations being sampled from an LM—sentiment shift doesn’t require power-seeking.
My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.
I think that it’s locally valid to point out “under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper.” But I feel a tad queasy about the overall point, since I don’t think alignment’s difficulty has much to do with the difficulties pointed out by “Optimal Policies Tend to Seek Power.” I feel better about saying “Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble.” Maybe that’s what you already communicate, though.
Thanks Alex! Your original comment didn’t read as ill-intended to me, though I wish that you’d just messaged me directly. I could have easily missed your comment in this thread—I only saw it because you linked the thread in the comments on my post.
Your suggested rephrase helps to clarify how you think about the implications of the paper, but I’m looking for something shorter and more high-level to include in my talk. I’m thinking of using this summary, which is based on a sentence from the paper’s intro: “There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
I think this is reasonable, although I might say “suggesting” instead of “showing.” I think I might also be more cautious about further inferences which people might make from this—like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
Thanks, this clarifies a lot for me.