Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
Some model implements a circuit whose triggering depends on a value X that was always positive in the training data distribution. However, it is possible (although probably somewhat difficult) for negative X to be created in the internal representations of the network using a specific set of tokens. Furthermore, suppose that you RLHF this guy. Both the reward proxy model and the policy gradients would be perfectly happy with this state of affairs, I think; so this wouldn’t be wiped out by gradient descent. In particular, the circuit would be pushed to trigger more strongly exactly when it is a good thing to do, as long as X remains positive. Plausibly, nothing in the distribution of on-policy RLHF will trigger negative X, and the circuit will never be pushed to examine its relationship with X by gradient descent, thus allowing the formation of a waluigi. (This is a concrete conjecture that might be falsified.)
In fact, the reward proxy model could have a similar or analogous circuit and distribute reversed rewards in that setting; unless you actually read every single sample produced during RLHF you wouldn’t know. (And that’s only good if you’re doing on-policy RLHF.) So it’s probably extremely possible for RLHF to actually, actively create new waluigis.
Therefore, this model would be obviously and trivially “deceptive” in a very weak sense that some people use deception to mean any test/train difference in behavior. If the behavior was something important, and its dependence on X could be tapped, the model could become an almost arbitarily bad waluigi.
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
Incompetency is not the opposite of competency: competency is +Y, incompetency is 0, “evil/deceptive/waluigi competency” is -Y.
Yeah, gonna try to examine this idea and make a proof of concept implementation. Will try to report something here whether I succeed or fail.