cfoster0 comments on The Waluigi Effect (mega-post)

cfoster0 5 Mar 2023 5:57 UTC
3 points
0
To summarize, you’re imagining a circuit that jointly associates feature +X with good behavioral pattern +Y and feature -X with bad behavioral pattern -Y, and the idea is that if you don’t give RL feedback for -X, then you’ll continually keep/strengthen this circuit on the basis of the +X->+Y goodness, and backprop/RL can’t disentangle these (maybe?), which will lead to preserved/strengthened -X->-Y behavior?
- MadHatter 5 Mar 2023 15:29 UTC
  2 points
  0
  Parent
  That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
  - Roman Leventov 9 Mar 2023 14:06 UTC
    1 point
    0
    Parent
    Incompetency is not the opposite of competency: competency is +Y, incompetency is 0, “evil/deceptive/waluigi competency” is -Y.
- MadHatter 5 Mar 2023 6:33 UTC
  2 points
  0
  Parent
  Yeah, gonna try to examine this idea and make a proof of concept implementation. Will try to report something here whether I succeed or fail.