Cleo Nardo comments on The Waluigi Effect (mega-post)

Cleo Nardo 5 Mar 2023 3:13 UTC
LW: 9 AF: 1
0
AF
Yep I think you might be right about the maths actually.

I’m thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.

So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.

I’m not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.

Actually, maybe “attractor” is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What’s the right dynamical-systemy term for that?
- cmck 13 Mar 2023 1:37 UTC
  6 points
  5
  Parent
  Describing the waluigi states as stable equilibria and the luigi states as unstable equilibria captures most of what you’re describing in the last paragraph here, though without the amplitude of each.
- abramdemski 13 Mar 2023 19:35 UTC
  LW: 5 AF: 2
  2
  AF Parent
  I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can’t keep dropping in probability forever, since evidence is lost. The probability only becomes small—but this means if you run for long enough you do in fact expect the transition.
- [ ]
  [deleted]