LLMs also suggest that AI can become as general-purpose as humans while remaining less agentic / consequentialist. LLMs have outer layers that are fairly myopic, aiming to predict a few thousand words of future text.
I want to put a lot of cold water on this, since RLHF breaks myopia and makes it more agentic, and I remember a comment from Paul Christiano on agents being better than non-agents in another post.
“Discovering Language Model Behaviors with Model-Written Evaluations” is a new Anthropic paper by Ethan Perez et al. that I (Evan Hubinger) also collaborated on. I think the results in this paper are quite interesting in terms of what they demonstrate about both RLHF (Reinforcement Learning from Human Feedback) and language models in general.
Among other things, the paper finds concrete evidence of current large language models exhibiting:
convergent instrumental goal following (e.g. actively expressing a preference not to be shut down),
non-myopia (e.g. wanting to sacrifice short-term gain for long-term gain), situational awareness (e.g. awareness of being a language model), coordination (e.g. willingness to coordinate with other AIs), and non-CDT-style reasoning (e.g. one-boxing on Newcomb’s problem). Note that many of these are the exact sort of things we hypothesized were necessary pre-requisites for deceptive alignment in “Risks from Learned Optimization”.
Furthermore, most of these metrics generally increase with both pre-trained model scale and number of RLHF steps. In my opinion, I think this is some of the most concrete evidence available that current models are actively becoming more agentic in potentially concerning ways with scale—and in ways that current fine-tuning techniques don’t generally seem to be alleviating and sometimes seem to be actively making worse.
Interestingly, the RLHF preference model seemed to be particularly fond of the more agentic option in many of these evals, usually more so than either the pre-trained or fine-tuned language models. We think that this is because the preference model is running ahead of the fine-tuned model, and that future RLHF fine-tuned models will be better at satisfying the preferences of such preference models, the idea being that fine-tuned models tend to fit their preference models better with additional fine-tuning.[1]
This is an important point, since this is evidence against the thesis that non-agentic systems will be developed by default. Also, RLHF came from OpenAI’s alignment plan, which is worrying for Goodhart.
I don’t really think RLHF “breaks myopia” in any interesting sense. I feel like LW folks are being really sloppy in thinking and talking about this. (Sorry for replying to this comment, I could have replied just as well to a bunch of other recent posts.)
I’m not sure what evidence you are referring to: in Ethan’s paper the RLHF models have the same level of “myopia” as LMs. They express stronger desire for survival, and a weaker acceptance of ends-justify-means reasoning
But more importantly, all of these are basically just personality questions that I expect would be affected in an extremely similar way by supervised data. The main thing the model does is predict that humans would rate particular responses highly (e.g. responses indicating that they don’t want to be shut down), which is really not about agency at all.
The direction of the change mostly seems sensitive to the content of the feedback rather than RLHF itself. As illustration, the biggest effects are liberalism and agreeableness, and right after that is confucianism. These models are not confucian “because of RLHF,” they are confucian because of the particular preferences expressed by human raters.
I think the case for non-myopia comes from some kind of a priori reasoning about RLHF vs prediction, but as far as I can currently tell the things people are saying here really don’t check out.
That said, people are training AI systems to be helpful assistants by copying (human) consequentialist behavior. Of course you are going to get agents in the sense of systems that select actions based on their predicted consequences in the world. Right now RLHF is relevant mostly because it trains them to be sensitive to the size of mistakes rather than simply giving a best guess about what a human would do, and to imitate the “best thing in their repertoire” rather than the “best thing in the human repertoire.” It will also ultimately lead to systems with superhuman performance and that try to fool and manipulate human raters, but this is essentially unrelated to the form of agency that is measured in Ethan’s paper or that matters for the risk arguments that are in vogue on LW.
I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.
RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).
RLHF basically predicts “what token would come next in a high-reward trajectory?” (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)
GPT predicts “what token would come next in this text,” where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.
I don’t think those are particularly different in the relevant sense. They both produce consequentialist behavior in the obvious way. The relationship between the objective and the model’s cognition is unclear in both cases and it seems like they should give rise to very similar messy things that differ in a ton of details.
The superficial difference in myopia doesn’t even seem like the one that’s relevant to deceptive alignment—we would be totally fine with an RLHF model that optimized over a single episode. The concern is that you get a system that wants some (much!) longer-term goal and then behaves well in a single episode for instrumental reasons, and that needs to be compared to a system which wants some long-term goal and then predicts tokens well for instrumental reasons. (I think this form of myopia is also really not related to realistic reward hacking concerns, but that’s slightly more complicated and also less central to the concerns currently in vogue here.)
I think someone should actually write out this case in detail so it can be critiqued (right now I don’t believe it). I think there is a version of this claim in the “conditioning generative models” sequence which I critiqued in the draft I read, I could go check the version that got published to see if I still disagree. I definitely don’t think it’s obvious, and as far as I can tell no evidence has yet come in.
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).
I want to put a lot of cold water on this, since RLHF breaks myopia and makes it more agentic, and I remember a comment from Paul Christiano on agents being better than non-agents in another post.
Here’s a link and evidence:
https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written#RHHrxiyHtuuibbcCi
This is an important point, since this is evidence against the thesis that non-agentic systems will be developed by default. Also, RLHF came from OpenAI’s alignment plan, which is worrying for Goodhart.
So in conclusion, yes we are making agentic AIs.
I don’t really think RLHF “breaks myopia” in any interesting sense. I feel like LW folks are being really sloppy in thinking and talking about this. (Sorry for replying to this comment, I could have replied just as well to a bunch of other recent posts.)
I’m not sure what evidence you are referring to: in Ethan’s paper the RLHF models have the same level of “myopia” as LMs. They express stronger desire for survival, and a weaker acceptance of ends-justify-means reasoning
But more importantly, all of these are basically just personality questions that I expect would be affected in an extremely similar way by supervised data. The main thing the model does is predict that humans would rate particular responses highly (e.g. responses indicating that they don’t want to be shut down), which is really not about agency at all.
The direction of the change mostly seems sensitive to the content of the feedback rather than RLHF itself. As illustration, the biggest effects are liberalism and agreeableness, and right after that is confucianism. These models are not confucian “because of RLHF,” they are confucian because of the particular preferences expressed by human raters.
I think the case for non-myopia comes from some kind of a priori reasoning about RLHF vs prediction, but as far as I can currently tell the things people are saying here really don’t check out.
That said, people are training AI systems to be helpful assistants by copying (human) consequentialist behavior. Of course you are going to get agents in the sense of systems that select actions based on their predicted consequences in the world. Right now RLHF is relevant mostly because it trains them to be sensitive to the size of mistakes rather than simply giving a best guess about what a human would do, and to imitate the “best thing in their repertoire” rather than the “best thing in the human repertoire.” It will also ultimately lead to systems with superhuman performance and that try to fool and manipulate human raters, but this is essentially unrelated to the form of agency that is measured in Ethan’s paper or that matters for the risk arguments that are in vogue on LW.
I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.
RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).
RLHF basically predicts “what token would come next in a high-reward trajectory?” (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)
GPT predicts “what token would come next in this text,” where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.
I don’t think those are particularly different in the relevant sense. They both produce consequentialist behavior in the obvious way. The relationship between the objective and the model’s cognition is unclear in both cases and it seems like they should give rise to very similar messy things that differ in a ton of details.
The superficial difference in myopia doesn’t even seem like the one that’s relevant to deceptive alignment—we would be totally fine with an RLHF model that optimized over a single episode. The concern is that you get a system that wants some (much!) longer-term goal and then behaves well in a single episode for instrumental reasons, and that needs to be compared to a system which wants some long-term goal and then predicts tokens well for instrumental reasons. (I think this form of myopia is also really not related to realistic reward hacking concerns, but that’s slightly more complicated and also less central to the concerns currently in vogue here.)
I think someone should actually write out this case in detail so it can be critiqued (right now I don’t believe it). I think there is a version of this claim in the “conditioning generative models” sequence which I critiqued in the draft I read, I could go check the version that got published to see if I still disagree. I definitely don’t think it’s obvious, and as far as I can tell no evidence has yet come in.
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).