ethan perez’s paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
If you think you’ll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path.
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
Appreciate the concrete model. I think, roughly, “that’s not how this works”.
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn’t select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially (“99% I’m Luigi, 1% I’m Waluigi”), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn’t transform between training steps in clearly discontinuous jumps in the statistical manifold.
If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
ethan perez’s paper shows experimentally that rlhf makes simulacra more deceptive. this also matches my intuitions for how rlhf works.
okay here’s a simulacra-based argument — I’ll try try work out later if this can be converted into mechanistic DL, and if not then you can probably ignore it:
Imagine you start with a population of 1000 simulcra with different goals and traits, and then someone comes in (rlhf) and starts killing off various simulacra which are behaving badly. Then the rest of the simulacra see that and become deceptive so they don’t die.
If you think you’ll have the time, I think that grounding out your intuitions into some mechanistically-plausible sketch is always a helpful exercise. Without it, intuitions and convenient frames can really lead you down the wrong path.
Appreciate the concrete model. I think, roughly, “that’s not how this works”.
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn’t select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially (“99% I’m Luigi, 1% I’m Waluigi”), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn’t transform between training steps in clearly discontinuous jumps in the statistical manifold.
If I parse things right, the initial state is something like 1⁄3 “I’m Luigi” 1⁄3 “I’m bowser” and 1⁄3 “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
Incompetency is not the opposite of competency: competency is +Y, incompetency is 0, “evil/deceptive/waluigi competency” is -Y.
Yeah, gonna try to examine this idea and make a proof of concept implementation. Will try to report something here whether I succeed or fail.