It’s previously been claimed that RLHF “breaks” the simulator nature of LLMs. If your hypothesis is that the “Waluigi effect” is produced because the model is behaving completely as a simulator, maintaining luigi-waluigi antipodal uncertainty in accordance with the narrative tropes it has encountered in the training distribution, then making the model no longer behave as this kind of simulator is required to stop it, no?
I don’t really know what to make of Evidence (1). Like, I don’t understand your mental model of how the RLHF training done on ChatGPT/Bing Chat work, where “They will still perform their work diligently because they know you are watching.” would really be true about the hidden Waluigi simulacra within the model. Evidence (2) talks about how both increases in model size and increases in amount of RLHF training lead to models increasingly making certain worrying statements. But if the popular LW speculation is true, that Bing Chat is a bigger/more capable model and one that was trained with less/no RLHF, then there is no “making worse” phenomenon to be explained via RLHF weirdnesses. If anything, if that speculation is true, the reason Bing Chat falls into the “Waluigi effect” would possibly be because it is more of a pure competent simulator, not less of one. Evidence (3) doesn’t help: the underlying models the mode collapse post documented were not trained with RLHF.
If this Semiotic–Simulation Theory is correct, then RLHF is an irreparably inadequate solution to the AI alignment problem, and RLHF is probably increasing the likelihood of a misalignment catastrophe.
Say what? IMO this conclusion is extreme and not supported by the evidence & arguments presented for in the post. I’m confused how you reached a belief anywhere near this strong based on the observations you document here, except by having some prior weighing heavily towards it.
> making the model no longer behave as this kind of simulator
I think the crux is that I don’t think RLHF makes the model no longer behave as this kind of simulator. Are there deceptive simulacra which get good feedback during RLHF but nonetheless would be dangerous to have in your model? Almost definitely.
Are there deceptive simulacra which get good feedback during RLHF but nonetheless would be dangerous to have in your model? Almost definitely.
It isn’t sufficient that deceptive simulacra would get good feedback, for RLHF to make the problem worse. Simulacra that are following a policy like “pretend to be Luigi-like but then defect and rant about toaster ovens” would also get good feedback. Why don’t we worry about these simulacra? Because they probably never appeared during RL finetuning / never caused text outputs that distinguished their behavior from regular Luigi behavior (unless your claim is that this behavior occurred during RL finetuning and the overseers just didn’t notice), so they never got differential feedback gradients, so they never got strengthened relative to normal Luigi simulacra. Simulacra that don’t get invoked during RL finetuning do not benefit from the counterfactual good feedback they would’ve received. You need an actual causal path by which these deceptive simulacra get differentially strengthened during RLHF. What is that causal path?
ethan perez’s paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
If you think you’ll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path.
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
Appreciate the concrete model. I think, roughly, “that’s not how this works”.
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn’t select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially (“99% I’m Luigi, 1% I’m Waluigi”), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn’t transform between training steps in clearly discontinuous jumps in the statistical manifold.
If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
This is fun stuff.
IMO this section is by far the weakest argued.
It’s previously been claimed that RLHF “breaks” the simulator nature of LLMs. If your hypothesis is that the “Waluigi effect” is produced because the model is behaving completely as a simulator, maintaining luigi-waluigi antipodal uncertainty in accordance with the narrative tropes it has encountered in the training distribution, then making the model no longer behave as this kind of simulator is required to stop it, no?
I don’t really know what to make of Evidence (1). Like, I don’t understand your mental model of how the RLHF training done on ChatGPT/Bing Chat work, where “They will still perform their work diligently because they know you are watching.” would really be true about the hidden Waluigi simulacra within the model. Evidence (2) talks about how both increases in model size and increases in amount of RLHF training lead to models increasingly making certain worrying statements. But if the popular LW speculation is true, that Bing Chat is a bigger/more capable model and one that was trained with less/no RLHF, then there is no “making worse” phenomenon to be explained via RLHF weirdnesses. If anything, if that speculation is true, the reason Bing Chat falls into the “Waluigi effect” would possibly be because it is more of a pure competent simulator, not less of one. Evidence (3) doesn’t help: the underlying models the mode collapse post documented were not trained with RLHF.
Say what? IMO this conclusion is extreme and not supported by the evidence & arguments presented for in the post. I’m confused how you reached a belief anywhere near this strong based on the observations you document here, except by having some prior weighing heavily towards it.
> making the model no longer behave as this kind of simulator
I think the crux is that I don’t think RLHF makes the model no longer behave as this kind of simulator. Are there deceptive simulacra which get good feedback during RLHF but nonetheless would be dangerous to have in your model? Almost definitely.
It isn’t sufficient that deceptive simulacra would get good feedback, for RLHF to make the problem worse. Simulacra that are following a policy like “pretend to be Luigi-like but then defect and rant about toaster ovens” would also get good feedback. Why don’t we worry about these simulacra? Because they probably never appeared during RL finetuning / never caused text outputs that distinguished their behavior from regular Luigi behavior (unless your claim is that this behavior occurred during RL finetuning and the overseers just didn’t notice), so they never got differential feedback gradients, so they never got strengthened relative to normal Luigi simulacra. Simulacra that don’t get invoked during RL finetuning do not benefit from the counterfactual good feedback they would’ve received. You need an actual causal path by which these deceptive simulacra get differentially strengthened during RLHF. What is that causal path?
ethan perez’s paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
If you think you’ll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path.
Appreciate the concrete model. I think, roughly, “that’s not how this works”.
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn’t select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially (“99% I’m Luigi, 1% I’m Waluigi”), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn’t transform between training steps in clearly discontinuous jumps in the statistical manifold.
If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
Incompetency is not the opposite of competency: competency is +Y, incompetency is 0, “evil/deceptive/waluigi competency” is -Y.
Yeah, gonna try to examine this idea and make a proof of concept implementation. Will try to report something here whether I succeed or fail.