Each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn’t a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8.
However, once Luigi says something Waluigi-like, we immediately jump to a world where P(Waluigi) = 0.95, since this trope is very common. To get back to Luigi, we would have to rely on a trope where a character goes from good to bad to good. These tropes exist, but they are less common. Obviously, this assumes that the context window is large enough to “remember” when Luigi turned bad. After the model forgets, we need a “bad to good” trope to get back to Luigi, and these are more common.
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn’t a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8.
However, once Luigi says something Waluigi-like, we immediately jump to a world where P(Waluigi) = 0.95, since this trope is very common. To get back to Luigi, we would have to rely on a trope where a character goes from good to bad to good. These tropes exist, but they are less common. Obviously, this assumes that the context window is large enough to “remember” when Luigi turned bad. After the model forgets, we need a “bad to good” trope to get back to Luigi, and these are more common.