I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.
RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).
RLHF basically predicts “what token would come next in a high-reward trajectory?” (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)
GPT predicts “what token would come next in this text,” where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.
I don’t think those are particularly different in the relevant sense. They both produce consequentialist behavior in the obvious way. The relationship between the objective and the model’s cognition is unclear in both cases and it seems like they should give rise to very similar messy things that differ in a ton of details.
The superficial difference in myopia doesn’t even seem like the one that’s relevant to deceptive alignment—we would be totally fine with an RLHF model that optimized over a single episode. The concern is that you get a system that wants some (much!) longer-term goal and then behaves well in a single episode for instrumental reasons, and that needs to be compared to a system which wants some long-term goal and then predicts tokens well for instrumental reasons. (I think this form of myopia is also really not related to realistic reward hacking concerns, but that’s slightly more complicated and also less central to the concerns currently in vogue here.)
I think someone should actually write out this case in detail so it can be critiqued (right now I don’t believe it). I think there is a version of this claim in the “conditioning generative models” sequence which I critiqued in the draft I read, I could go check the version that got published to see if I still disagree. I definitely don’t think it’s obvious, and as far as I can tell no evidence has yet come in.
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).
I’m surprised. It feels to me that there’s an obvious difference between predicting one token of text from a dataset and trying to output a token in a sequence with some objective about the entire sequence.
RLHF models optimize for the entire output to be rated highly, not just for the next token, so (if they’re smart enough) they perform better if they think what current tokens will make it easier for the entire output to be rated highly (instead of outputting just one current token that a human would like).
RLHF basically predicts “what token would come next in a high-reward trajectory?” (The only way it differs from the prediction objective is that worse-than-average trajectories are included with negative weight rather than simply being excluded altogether.)
GPT predicts “what token would come next in this text,” where the text is often written by a consequentialist (i.e. optimized for long-term consequences) or selected to have particular consequences.
I don’t think those are particularly different in the relevant sense. They both produce consequentialist behavior in the obvious way. The relationship between the objective and the model’s cognition is unclear in both cases and it seems like they should give rise to very similar messy things that differ in a ton of details.
The superficial difference in myopia doesn’t even seem like the one that’s relevant to deceptive alignment—we would be totally fine with an RLHF model that optimized over a single episode. The concern is that you get a system that wants some (much!) longer-term goal and then behaves well in a single episode for instrumental reasons, and that needs to be compared to a system which wants some long-term goal and then predicts tokens well for instrumental reasons. (I think this form of myopia is also really not related to realistic reward hacking concerns, but that’s slightly more complicated and also less central to the concerns currently in vogue here.)
I think someone should actually write out this case in detail so it can be critiqued (right now I don’t believe it). I think there is a version of this claim in the “conditioning generative models” sequence which I critiqued in the draft I read, I could go check the version that got published to see if I still disagree. I definitely don’t think it’s obvious, and as far as I can tell no evidence has yet come in.
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).