Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though—they would’ve been a central focus of what I imagined my post to be, points around which I’ve been thinking a lot recently. I haven’t talked to a lot of people about this explicitly so I don’t have high credence in my take, but it seems at least worth clarifying.
RLHF is less safe than imitation or conditioning generative models.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we’re wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here’s one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you’ve probably read the former). We get a posterior that doesn’t have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we’re using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model’s simulations because it’ll ripple through its various abstractions in ways specific to how they’re structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we’re making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don’t really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.
(Only skimmed the post for now because I’m technically on break, it’s possible I missed something crucial).
I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
If predicting webtext is a good way to get things done, people can do that. But probably it isn’t, and so people probably won’t do that unless you give them a good reason.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
The main way I can see it going is that you can condition the webtext model on other things like “there is a future AGI generating this text...” or “What action leads to consequence X?” But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”? There is one form of this I agree with—models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don’t think makes sense (yet).
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense.
So does conditioning the model to get it to do something useful. Also I think “focuses the model’s computation on agency in some sense” is probably too vague to be a helpful way to think about what’s going on—it seems like it leads the model to produce outputs that it thinks would have certain kinds of consequences, or that imitate the kinds of heuristics and processes used by consequentialists in the dataset. This happens quite a lot when you continue webtext, since it’s all written by consequentialists.
I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”?
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
So does conditioning the model to get it to do something useful.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
One consequence downstream of this that seems important to me in the limit:
Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.
Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.
I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.
Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models: 1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes. 2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models: 1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow. 2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.
Regarding your points on agentic simulacra (which I assume means “agentic personas the language model ends up imitating”):
1) My best guess about why Anthropic’s model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.
2) But I’m pretty skeptical about your intuition that RLHF makes the “imitating agentic personas” problem worse. When people I’ve spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic’s paper, they usually mean either:
(a) prompt engineering; or
(b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token “GOOD” and then ask the model to produce outputs which start with “GOOD”.)
Approach (b) really doesn’t seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.)
I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic’s chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF.
Interested to hear if you have other intuitions here.
I wasn’t really focusing on the RL part of RLHF in making the claim that it makes the “agentic personas” problem worse, if that’s what you meant. I’m pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won’t be apparent until we use stronger RL or something. Then I expect we’ll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).
My claim is pretty similar to how you put it—in RLHF as in fine-tuning of the kind relevant here, we’re focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it’s likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.
Glad to see both the OP as well as the parent comment.
I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper “Discovering Language Model Behaviors with Model-Written Evaluations” (paper, post):
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on.
1) My best guess about why Anthropic’s model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.
Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That’s true, but I think it’s just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size.
My take on the scaled-up models exhibiting the same behaviours feels more banal—larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
This, broadly-speaking, is also my best guess, but I’d rather phrase it as: larger LMs are better at making the personas they imitate “realistic” (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic’s language model was imitating, one correlate of being more realistic was preferring not to be shut down.
(This doesn’t obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor’s preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic’s findings, that means this story gets a complexity penalty.)
Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse
Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.
I haven’t seen evidence that RLHF’d text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.
Similar points regarding safety of pure imitation learning vs reinforcement learning have been raised by many others on LW. So I’m really interested what Paul has to say about this.
I haven’t engaged with this much, though I’ve e.g. talked with Evan some about why I’m not as excited about conditioning generative models as a strategy. I’m happy to engage with particular arguments but feel like I don’t really know what argument is being made by the parent (or most of the other places I’ve seen this in passing).
I think there is a simple reason imitation is safer: the model won’t deliberately produce actions that the demosntrator wouldn’t, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful.
I don’t think this is what people are talking about though (and it wouldn’t work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems.
I don’t thinks I know the precise articulation of these concerns or the argument for it.
On the empirics, sometimes people mention this paper and the RLHF’d model behavior “hey do you want to be shut down? --> no” as evidence of a higher probability of deceptive alignment from RLHF. I don’t really think that’s a reasonable interpretation of the evidence but if that’s a large part of the argument people are making I’d be happy to engage on it.
As one of the people who’s raised such points, I should note that they mostly apply to applications of language models qua language models (which Jozdien correctly does), and that different techniques can be appropriate for different domains.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though—they would’ve been a central focus of what I imagined my post to be, points around which I’ve been thinking a lot recently. I haven’t talked to a lot of people about this explicitly so I don’t have high credence in my take, but it seems at least worth clarifying.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we’re wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here’s one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF’d (so to speak), have different world priors in ways that aren’t really all that intuitive (see Janus’ work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you’ve probably read the former). We get a posterior that doesn’t have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we’re using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model’s simulations because it’ll ripple through its various abstractions in ways specific to how they’re structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don’t have the useful safety measures implied by being weighted by a true approximation of our world.
Another reason for not liking RLHF that’s somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model’s computation on agency in some sense. My guess is that this explains to an extent the results in that paper—RLHF’d models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we’re making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don’t really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.
(Only skimmed the post for now because I’m technically on break, it’s possible I missed something crucial).
I think Janus’ post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That’s clearly true and intentional, and you can’t get entropy back just by turning up temperature. The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.
If predicting webtext is a good way to get things done, people can do that. But probably it isn’t, and so people probably won’t do that unless you give them a good reason.
That said, almost all the differences that Janus and you are highlighting emerge from supervised fine-tuning. I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
The main way I can see it going is that you can condition the webtext model on other things like “there is a future AGI generating this text...” or “What action leads to consequence X?” But I think those things are radically less safe than predicting demonstrations in the lab, and lead to almost all the same difficulties if they in fact improve capabilities.
Maybe the safety loss comes from “produce things that evaluators in the lab like” rather than “predict demonstrations in the lab”? There is one form of this I agree with—models trained with RLHF will likely try to produce outputs humans rate highly, including by e.g. producing outputs that drive humans insane to give them a good rating or whatever. But overall people seem to be reacting to some different more associative reason for concern that I don’t think makes sense (yet).
So does conditioning the model to get it to do something useful. Also I think “focuses the model’s computation on agency in some sense” is probably too vague to be a helpful way to think about what’s going on—it seems like it leads the model to produce outputs that it thinks would have certain kinds of consequences, or that imitate the kinds of heuristics and processes used by consequentialists in the dataset. This happens quite a lot when you continue webtext, since it’s all written by consequentialists.
I think I agree with this being the most object-level takeaway; my take then would primarily be about how to conceptualize this loss of entropy (where and in what form) and what else it might imply. I found the “narrowing the prior” frame rather intuitive in this context.
I agree that everything I said above qualitatively applies to supervised fine-tuning as well. As I mentioned in another comment, I don’t expect the RL part to play a huge role until we get to wilder applications. I’m worried about RLHF more because I expect it to be scaled up a lot more in the future, and plausibly does what fine-tuning does better (this is just based on how more recent models have shifted to using RLHF instead of ordinary fine-tuning).
I don’t think “predict human demonstrators” is how I would frame the relevant effect from fine-tuning. More concretely, what I’m picturing is along the lines of: If you fine-tune the model such that continuations in a conversation are more polite/inoffensive (where this is a stand-in for whatever “better” rated completions are), then you’re not learning the actual distribution of the world anymore. You’re trying to learn a distribution that’s identical to ours except in that conversations are more polite. In other words, you’re trying to predict “X, but nicer”.
The problem I see with this is that you aren’t just affecting this in isolation, you’re also affecting the other dynamics that these interact with. Conversations in our world just aren’t that likely to be polite. Changing that characteristic ripples out to change other properties upstream and downstream of that one in a simulation. Making this kind of change seems to lead to rather unpredictable downstream changes. I say seems because -
- This is interesting. Could you elaborate on this? I think this might be a crux in our disagreement.
I don’t think the safety loss (at least the part I’m referring to here) comes from the first-order effects of predicting something else. It’s the second-order effects on GPT’s prior at large from changing a few aspects that seems to have hard-to-predict properties and therefore worrying to me.
I agree. I think there’s a qualitative difference when you’re changing the model’s learned prior rather than just conditioning, though. Specifically, where ordinary GPT has to learn a lot of different processes at relatively similar fidelity to accurately simulate all the different kinds of contexts it was trained on, fine-tuned GPT can learn to simulate some kinds of processes with higher fidelity at the expense of others that are well outside the context of what it’s been fine-tuned on.
(As stated in the parent, I don’t have very high credence in my stance, and lack of accurate epistemic status disclaimers in some places is probably just because I wanted to write fast).
I mostly care about how an AI selected to choose actions that lead to high reward might select actions that disempower humanity to get a high reward, or about how an AI pursuing other ambitious goals might choose low loss actions instrumentally and thereby be selected by gradient descent.
Perhaps there are other arguments for catastrophic risk based on the second-order effects of changes from fine-tuning rippling through an alien mind, but if so I either want to see those arguments spelled out or more direct empirical evidence about such risks.
One consequence downstream of this that seems important to me in the limit:
Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.
Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.
I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.
Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
What do you mean when you say the model is or is not “fighting you”?
I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.
Regarding your points on agentic simulacra (which I assume means “agentic personas the language model ends up imitating”):
1) My best guess about why Anthropic’s model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires.
2) But I’m pretty skeptical about your intuition that RLHF makes the “imitating agentic personas” problem worse. When people I’ve spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic’s paper, they usually mean either:
(a) prompt engineering; or
(b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token “GOOD” and then ask the model to produce outputs which start with “GOOD”.)
Approach (b) really doesn’t seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.)
I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic’s chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF.
Interested to hear if you have other intuitions here.
I wasn’t really focusing on the RL part of RLHF in making the claim that it makes the “agentic personas” problem worse, if that’s what you meant. I’m pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won’t be apparent until we use stronger RL or something. Then I expect we’ll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).
My claim is pretty similar to how you put it—in RLHF as in fine-tuning of the kind relevant here, we’re focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it’s likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.
Glad to see both the OP as well as the parent comment.
I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper “Discovering Language Model Behaviors with Model-Written Evaluations” (paper, post):
Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That’s true, but I think it’s just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size.
Thanks!
My take on the scaled-up models exhibiting the same behaviours feels more banal—larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
This, broadly-speaking, is also my best guess, but I’d rather phrase it as: larger LMs are better at making the personas they imitate “realistic” (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic’s language model was imitating, one correlate of being more realistic was preferring not to be shut down.
(This doesn’t obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor’s preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic’s findings, that means this story gets a complexity penalty.)
Janus’ post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME), not RLHF. It’s evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does.
I haven’t seen evidence that RLHF’d
text-davinci-003
appears less safe compared to the imitation-basedtext-davinci-002
.Refer my other reply here. And as the post mentions, RLHF also does exhibit mode collapse (check the section on prior work).
Similar points regarding safety of pure imitation learning vs reinforcement learning have been raised by many others on LW. So I’m really interested what Paul has to say about this.
I haven’t engaged with this much, though I’ve e.g. talked with Evan some about why I’m not as excited about conditioning generative models as a strategy. I’m happy to engage with particular arguments but feel like I don’t really know what argument is being made by the parent (or most of the other places I’ve seen this in passing).
I think there is a simple reason imitation is safer: the model won’t deliberately produce actions that the demosntrator wouldn’t, whereas RLHF may produce actions that are very creative ways to get reward and may be hamful.
I don’t think this is what people are talking about though (and it wouldn’t work for their broader arguments). I think they are imagining a higher probability of deceptive alignment and other generalization problems.
I don’t thinks I know the precise articulation of these concerns or the argument for it.
On the empirics, sometimes people mention this paper and the RLHF’d model behavior “hey do you want to be shut down? --> no” as evidence of a higher probability of deceptive alignment from RLHF. I don’t really think that’s a reasonable interpretation of the evidence but if that’s a large part of the argument people are making I’d be happy to engage on it.
As one of the people who’s raised such points, I should note that they mostly apply to applications of language models qua language models (which Jozdien correctly does), and that different techniques can be appropriate for different domains.