MadHatter comments on The Waluigi Effect (mega-post)

MadHatter 5 Mar 2023 15:29 UTC
2 points
0
That’s the hypothesis. I’ve already verified several pieces of this: an RL agent trained on cartpole with an extra input becomes incompetent when its extra input is far away from its training value; there are some neurons in gpt2-small that only take on small negative values, and which can adversarially be flipped to positive values with the right prompt. So I think an end-to-end waluigi of this form is potentially realistic; the hard part is getting my hands on an rlhf model’s weights to look for a full example.
- Roman Leventov 9 Mar 2023 14:06 UTC
  1 point
  0
  Parent
  Incompetency is not the opposite of competency: competency is +Y, incompetency is 0, “evil/deceptive/waluigi competency” is -Y.