Vivek Hebbar comments on The Waluigi Effect (mega-post)

Vivek Hebbar 3 Mar 2023 23:40 UTC
LW: 26 AF: 10
17
AF
Agreed. To give a concrete toy example: Suppose that Luigi always outputs “A”, and Waluigi is {50% A, 50% B}. If the prior is {50% luigi, 50% waluigi}, each “A” outputted is a 2:1 update towards Luigi. The probability of “B” keeps dropping, and the probability of ever seeing a “B” asymptotes to 50% (as it must).
This is the case for perfect predictors, but there could be some argument about particular kinds of imperfect predictors which supports the claim in the post.
- Tom Shlomi 8 Mar 2023 1:15 UTC
  13 points
  5
  Parent
  Context windows could make the claim from the post correct. Since the simulator can only consider a bounded amount of evidence at once, its P[Waluigi] has a lower bound. Meanwhile, it takes much less evidence than fits in the context window to bring its P[Luigi] down to effectively 0.
  Imagine that, in your example, once Waluigi outputs B it will always continue outputting B (if he’s already revealed to be Waluigi, there’s no point in acting like Luigi). If there’s a context window of 10, then the simulator’s probability of Waluigi never goes below 1/1025, while Luigi’s probability permanently goes to 0 once B is outputted, and so the simulator is guaranteed to eventually get stuck at Waluigi.
  I expect this is true for most imperfections that simulators can have; its harder to keep track of a bunch of small updates for X over Y than it is for one big update for Y over X.
- Cleo Nardo 5 Mar 2023 3:13 UTC
  LW: 9 AF: 1
  0
  AF Parent
  Yep I think you might be right about the maths actually.
  
  I’m thinking that waluigis with 50% A and 50% B have been eliminated by llm pretraining and definitely by rlhf. The only waluigis that remain are deceptive-at-initialisation.
  
  So what we have left is a superposition of a bunch of luigis and a bunch of waluigis, where the waluigis are deceptive, and for each waluigi there is a different phrase that would trigger them.
  
  I’m not claiming basin of attraction is the entire space of interpolation between waluigis and luigis.
  
  Actually, maybe “attractor” is the wrong technical word to use here. What I want to convey is that the amplitude of the luigis can only grow very slowly and can be reversed, but the amplitude of the waluigi can suddenly jump to 100% in a single token and would remain there permanently. What’s the right dynamical-systemy term for that?
  - cmck 13 Mar 2023 1:37 UTC
    6 points
    5
    Parent
    Describing the waluigi states as stable equilibria and the luigi states as unstable equilibria captures most of what you’re describing in the last paragraph here, though without the amplitude of each.
  - abramdemski 13 Mar 2023 19:35 UTC
    LW: 5 AF: 2
    2
    AF Parent
    I think your original idea was tenable. LLMs have limited memory, so the waluigi hypothesis can’t keep dropping in probability forever, since evidence is lost. The probability only becomes small—but this means if you run for long enough you do in fact expect the transition.
  - [ ]
    [deleted]
- abramdemski 13 Mar 2023 19:32 UTC
  LW: 8 AF: 3
  2
  AF Parent
  LLMs are high order Markov models, meaning they can’t really balance two different hypotheses in the way you describe; because evidence drops out of memory eventually, the probability of Waluigi drops very small instead of dropping to zero. This makes an eventual waluigi transition inevitable as claimed in the post.
  - Cleo Nardo 13 Mar 2023 23:08 UTC
    LW: 21 AF: 12
    17
    AF Parent
    You’re correct. The finite context window biases the dynamics towards simulacra which can be evidenced by short prompts, i.e. biases away from luigis and towards waluigis.
    
    But let me be more pedantic and less dramatic than I was in the article — the waluigi transitions aren’t inevitable. The waluigi are approximately-absorbing classes in the Markov chain, but there are other approximately-absorbing classes which the luigi can fall into. For example, endlessly cycling through the same word (mode-collapse) is also an approximately-absorbing class.
    - abramdemski 13 Mar 2023 23:15 UTC
      LW: 4 AF: 3
      0
      AF Parent
      What report is the image pulled from?
      - Cleo Nardo 13 Mar 2023 23:21 UTC
        17 points
        7
        Parent
        “Open Problems in GPT Simulator Theory” (forthcoming)
        Specifically, this is a chapter on the preferred basis problem for GPT Simulator Theory.
        
        TLDR: GPT Simulator Theory says that the language model $μ : T^{k} \to Δ (T)$ decomposes into a linear interpolation $μ = \sum s \in S α_{s} μ_{s}$ where each $μ_{s} : T^{k} \to Δ (T)$ is a “simulacra” and the amplitudes $a_{s}$ update in an approximately Bayesian way. However, this decomposition is non-unique, making GPT Simulator Theory either ill-defined, arbitrary, or trivial. By comparing this problem to the preferred basis problem in quantum mechanics, I construct various potential solutions and compare them.
- Eschaton 7 Mar 2023 5:46 UTC
  1 point
  0
  Parent
  The transform isn’t symmetric though right? A character portraying “good” behaviour is, narratively speaking, more likely to have been deceitful the whole time or transform into a villain than for the antagonist to turn “good”.