One consequence downstream of this that seems important to me in the limit:
Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.
Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.
I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.
Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models: 1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes. 2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models: 1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow. 2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.
One consequence downstream of this that seems important to me in the limit:
Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you.
Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you.
I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.
Just a note here: I would not interpret fine-tuned GPTs as still “predicting” tokens. Base models predict tokens by computing a probability distribution conditional on the prompt, but for fine-tuned models this distribution no longer represents probabilities, but some “goodness” relative to the fine-tuning, how good the continuation is. Tokens with higher numbers are then not necessarily more probable continuations of the prompt (though next token probability may also play a role) but overall “better” in some opaque way. We hope that what the model thinks is a better token for the continuation of the prompt corresponds to the goals of being helpful, harmless and honest (to use the Anthropic terminology), but whether the model has really learned those goals, or merely something which looks similar, is ultimately unknown.
So RLHF (and equally supervised fine-tuning) also leads to a lack of interpretability. It is unknown what exactly an instruction model like ChatGPT or text-davinci-003 optimizes for. In contrast to this, we know pretty exactly what a base model optimized for: Next token prediction.
You know exactly what both models are optimized for: log loss on the one hand, an unbiased estimator of reward on the other.
You don’t know what either model is optimizing: how would you? In both cases you could guess that they may be optimizing something similar to what they are optimized for.
This relates to what you wrote in the other thread:
It think the difference is that a base language model is trained on vast amounts of text, so it seems reasonable that it is actually quite good at next token prediction, while the fine-tuning is apparently done with comparatively tiny amounts of preference data. So misalignment seems much more likely in the latter case.
Moreover, human RLHF raters are probably biased in various ways, which encourages the model to reproduce those biases, even if the model doesn’t “believe them” in some sense. For example, some scientists have pointed out that ChatGPT gives politically correct but wrong answers to certain politically taboo but factual questions. (I can go into more detail if required.) Whether the model is honest here and in fact “believes” those things, or whether it is deceptive and just reproduces rater bias rather than being honest, is unknown.
So learning to predict webtext from large amounts of training data, and learning some kind of well-aligned utility function from a small number of (biased) human raters seem problems of highly uneven difficulty and probability of misalignment.
Agreed, though I do find framing them as a warped predictor helpful in some cases. In principle, the deviation from the original unbiased prediction over all inputs should include within it all agentic behaviors, and there might exist some way that you could extract goals from that bias vector. (I don’t have anything super concrete here and I’m not super optimistic that this framing gives you anything extra compared to other interpretability mechanisms, but it’s something I’ve thought about poking.)
What do you mean when you say the model is or is not “fighting you”?
I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.