Roman Leventov comments on The Waluigi Effect (mega-post)

Roman Leventov 9 Mar 2023 14:22 UTC
1 point
0
Simulacra are belief structures (i.e., a multi-factor probability distribution, with time dimension). LM fine-tuning doesn’t select beliefs structures among a pre-existing set of distinct belief structures (there is no such set represented by anything in the physical reality of the training process), it updates a singular beliefs structure, held (in some sense) by the LM after every training step. The belief structure could be superposed initially (“99% I’m Luigi, 1% I’m Waluigi”), but still it is a singular belief structure, and the updates should be relatively smooth (assuming a small learning rate), i.e., the belief structure couldn’t transform between training steps in clearly discontinuous jumps in the statistical manifold.
- Decius 28 Mar 2023 6:57 UTC
  2 points
  0
  Parent
  If I parse things right, the initial state is something like ¹⁄₃ “I’m Luigi” ¹⁄₃ “I’m bowser” and ¹⁄₃ “I’m waluigi”, and the RLHF eliminates the bowser belief while having no effect on the other beliefs.