I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models: 1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes. 2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models: 1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow. 2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I don’t know in what sense “predict human demonstrators” is missing an important safety property from “predict internet text,” and right now it feels to me like kind of magical thinking.
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.
I mean a model “fights” you if the model itself has goals and those goals are at odds with yours. In this context, a model cannot “fight” you if it does not have goals. It can still output things which are bad for you, like an agentic simulacrum that does fight you.
I suspect effective interventions are easier to find when dealing with a goal agnostic model simulating a potentially dangerous agent, compared to a goal-oriented model that is the potentially dangerous agent.
In both cases the model produces actions that are expected to have certain kinds of effects. Could you spell out what kind of “fighting” happens, or what kind of “intervention” is possible when you are merely conditioning your model and not fine-tuning it?
I haven’t engaged much with this kind of thinking on LW or the broader safety community, but right now I don’t really get it and it feels like anthropomorphizing or magical thinking.
I’ll start with a pretty uncontroversial example that’s neither RLHF nor conditioning but tries to point at a shared intuition; two different models:
1. LLM fine tuned with RL, where reward comes from some kind of activation-reading truth probes.
2. LLM that trains on the output of the first model to the point where it ~perfectly matches its final output, but does not undergo any additional fine tuning.
Despite having identical final outputs, I would expect the first model to have higher probe-reported truthiness because it was optimized against that metric.
With the way I was using the word “fighting”, I would say that the first model is fighting you (a little bit), and the second one isn’t. The first model itself has learned adversarial weights that directly interfere with efforts to understand it.
Next, an impractical and extreme example, again with two models:
1. LLM fine tuned with RLHF for apparent honesty, but (for the purposes of the hypothetical) it ended up deceptive somehow.
2. “LLM” operating at an intractably low level of simulation, closer to physics, without fine tuning, which was conditioned to output a sequence which maps to the exact same deceptive behavior as the first model.
To be clear, I don’t think operating “closer to physics” here is helpful for safety, but I think it’s easier to say that the deception in the second model’s outputs is not arising from attractors defined in the model directly. The second model is simulating processes which yield deception, but it does not directly “care” about deception any more than the real laws of physics do when a real sociopath is being deceptive.
The first model is more like the sociopath itself, because the deceptive processes are a part of what the model was optimized to compute (in this hypothetical).
Directly querying the first model gets you lies; querying the second model gets you a an authentic piece of the process that yields lies.
The next obvious question here is: how does caring about some intermediate state in the computation that yields deception help? Why is this any better than looking at some intermediate layer’s activations in the first deceptive model?
My answer would be: myopic action through calibration.
The final output of the nonconditioned fine tuned deceptive model is no longer representing the probability distribution across a ground truth. The “probabilities” it outputs include a bias which, here, encodes goals corresponding to deception. Any intermediate layer in the model could be arbitrarily warped to serve those goals.
The second model (the almost-physics simulator) does not have those degrees of freedom. Goal oriented behavior diverging from the training distribution must show up as some form of bias in prediction, and that bias is directly counteracted by the training objective. (Raw GPT-esque models consider the future in their predictions in various ways, but the actual prediction remains myopic- the model is not incentivized to pick the “wrong” token now to serve a later prediction or to otherwise smuggle in distribution-breaking bias.)
In other words, in the absence of strong gradient hacking or similarly exotic phenomena, the pure token predictor model remains goal agnostic. There is no direct or indirect training pressure to obfuscate its internal process, and its outputs will not be “intentionally” deceptive; they’re just authentic attempts at probability distributions.
Similar to the original example’s second model (which was also a pure predictor), I would expect that such a model is not adversarial and is easier to interpret.
With that in mind, we can pull the predictor up to a more practical level of execution, similar to that of the other deceptive model (instead of ‘almost physics’), and all of the same properties still hold.
Conditioning models for helpfulness, e.g. with decision transformers, doesn’t change the training objective either. It’s just a change of inputs (conditions) like any other tokens, so the same properties should hold again.
In another comment, you mention:
I agree with this. My concern is about forms of fine tuning that aren’t equivalent to well-calibrated predictions of human demonstrators, and about training mechanisms that take an indirect/exploit-prone route to something that looks like predictions of human demonstrators.
I don’t think the more general form of RLHF is inherently broken. I just suspect that fine tuning that preserves model-level goal agnosticism will produce less adversarial models.