Yep I think you might be right about the maths actually.
I’m thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
I’m not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
Actually, maybe “attractor” is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What’s the right dynamical-systemy term for that?
Describing the waluigi states as stable equilibria and the luigi states as unstable equilibria captures most of what you’re describing in the last paragraph here, though without the amplitude of each.
I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can’t keep dropping in probability forever, since evidence is lost. The probability only becomes small—but this means if you run for long enough you do in fact expect the transition.
Yep I think you might be right about the maths actually.
I’m thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
I’m not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
Actually, maybe “attractor” is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What’s the right dynamical-systemy term for that?
Describing the waluigi states as stable equilibria and the luigi states as unstable equilibria captures most of what you’re describing in the last paragraph here, though without the amplitude of each.
I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can’t keep dropping in probability forever, since evidence is lost. The probability only becomes small—but this means if you run for long enough you do in fact expect the transition.