Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn’t read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
I’m open to suggestions on how to phrase this differently when I next give this talk.
It’s a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don’t know if trained policies tend to autonomously seek power in various non game-playing regimes.
If I had to say something, I might say “If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies.” This argument appropriately behaves differently if the “outcomes” are simply different sentiment generations being sampled from an LM—sentiment shift doesn’t require power-seeking.
For example, last year I pointed David Silver to the optimal policies paper when he was proposing some alignment ideas to our team that we would expect don’t work because of instrumental convergence.
My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.
Maybe you don’t care about optimal policies, but many RL people do, and I think these results can help them better understand why alignment is hard.
I think that it’s locally valid to point out “under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper.” But I feel a tad queasy about the overall point, since I don’t think alignment’s difficulty has much to do with the difficulties pointed out by “Optimal Policies Tend to Seek Power.” I feel better about saying “Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble.” Maybe that’s what you already communicate, though.
Thanks Alex! Your original comment didn’t read as ill-intended to me, though I wish that you’d just messaged me directly. I could have easily missed your comment in this thread—I only saw it because you linked the thread in the comments on my post.
Your suggested rephrase helps to clarify how you think about the implications of the paper, but I’m looking for something shorter and more high-level to include in my talk. I’m thinking of using this summary, which is based on a sentence from the paper’s intro: “There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
“There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
I think this is reasonable, although I might say “suggesting” instead of “showing.” I think I might also be more cautious about further inferences which people might make from this—like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.
Thanks for your patient and high-quality engagement here, Vika! I hope my original comment doesn’t read as a passive-aggressive swipe at you. (I consciously tried to optimize it to not be that.) I wanted to give concrete examples so that Wei_Dai could understand what was generating my feelings.
It’s a tough question to say how to apply the retargetablity result to draw practical conclusions about trained policies. Part of this is because I don’t know if trained policies tend to autonomously seek power in various non game-playing regimes.
If I had to say something, I might say “If choosing the reward function lets us steer the training process to produce a policy which brings about outcome X, and most outcomes X can only be attained by seeking power, then most chosen reward functions will train power-seeking policies.” This argument appropriately behaves differently if the “outcomes” are simply different sentiment generations being sampled from an LM—sentiment shift doesn’t require power-seeking.
My guess is that the optimal policies paper was net negative for technical understanding and progress, but net positive for outreach, and agree it has strong benefits in the situations you highlight.
I think that it’s locally valid to point out “under your beliefs (about optimal policies mattering a lot), the situation is dangerous, read this paper.” But I feel a tad queasy about the overall point, since I don’t think alignment’s difficulty has much to do with the difficulties pointed out by “Optimal Policies Tend to Seek Power.” I feel better about saying “Look, if in fact the same thing happens with trained policies, which are sometimes very different, then we are in trouble.” Maybe that’s what you already communicate, though.
Thanks Alex! Your original comment didn’t read as ill-intended to me, though I wish that you’d just messaged me directly. I could have easily missed your comment in this thread—I only saw it because you linked the thread in the comments on my post.
Your suggested rephrase helps to clarify how you think about the implications of the paper, but I’m looking for something shorter and more high-level to include in my talk. I’m thinking of using this summary, which is based on a sentence from the paper’s intro: “There are theoretical results showing that many decision-making algorithms have power-seeking tendencies.”
(Looking back, the sentence I used in the talk was a summary of the optimal policies paper, and then I updated the citation to point to the retargetability paper and forgot to update the summary...)
I think this is reasonable, although I might say “suggesting” instead of “showing.” I think I might also be more cautious about further inferences which people might make from this—like I think a bunch of the algorithms I proved things about are importantly unrealistic. But the sentence itself seems fine, at first pass.