Quintin Pope comments on The Waluigi Effect (mega-post)

Quintin Pope 6 Mar 2023 8:46 UTC
8 points
3
Some thoughts:
- My understanding is that $X$ is supposed to be a real, physical process in the world, which generates training data for the model. Is that right?
- If so, you say the “prior $P$ over $X$ ” comes from data + architecture + optimizer, but then the form of the prompt-conditioned distribution, $\int_{X \in X} P (X) \times X (w_{0} \dots w_{k}) \times X (w_{k + 1} | w_{0} \dots w_{k})$ , only makes reference to the data and prompt.
  - Incidentally, I think it’s a mistake to leave out architecture / training process, since it implies that the model faithfully reflects the relative probabilities of the different data generating processes responsible for the training data. In the actual models, more complex / cognitively sophisticated data generating processes are underweighted. E.g., GPT-3 cannot play at Magnus Carlsen’s level, no matter how you condition / flatter it.
Several people have noticed the following bizarre phenomenon:
The Waluigi Effect: After you train an LLM to satisfy a desirable property $P$ , then it’s easier to elicit the chatbot into satisfying the exact opposite of property $P$ .
- I find this an incredibly bizarre framing. The way you write, it sounds like you’re saying that OpenAI trained ChatGPT to, say, be nice, and this training made it easier for users to elicit mean behavior from ChatGPT.
  - I’d frame it as: OpenAI trained ChatGPT to be nice. This made it harder, but not impossible, to elicit mean behavior.
  - Thus, it became more remarkable when someone succeeded at eliciting mean behavior. This raises the saliency of ChatGPT’s occasional meanness, even though the direct effect of OpenAI’s training was to make ChatGPT less likely to be mean in ~ all circumstances.
- The discussion around Kolmogorov complexity seems miscalibrated to me. K-complexity is always relative to some coding scheme. In this case, the LM implements the coding scheme, with tokens as the “codewords”. Relative to the linguistic prior, there doesn’t seem to be a massive disparity between valance and traits. E.g., the prompt:
Captain Picard: ”
uses 4 tokens, two to specify the “traits” associated with Picard, and two to specify that this is a conversation. In contrast:
Evil Captain Picard: ”
is 5 tokens, so the evil valance accounts for either ¹⁄₃ or ¹⁄₅ of the total complexity, depending on how you count the last two.
In general, the relative complexities of traits versus valance depend on the traits and valances in question. E.g.:
“Captain Picard” (secretly a Romulan infiltrator): ”
is 17 tokens long and arguably uses more tokens to specify its valance than its traits.
Conjecture: The waluigi eigen-simulacra are attractor states of the LLM.
- I’d note that the vast majority of attractors in LM autoregressive generation space are just endless repetition, not “sudden personality shift, followed by normal behavior consistent with the new personality.”
  - (Note that a trope-based analysis doesn’t predict this as a consequence of AR generation)
- I’d also note that models actually do transition to and from “waluigi” simulacra fairly often, and RLHF models are more likely to transition away from waluigi modes. E.g., take this prompt and give it to the ChatGPT playground interface:
VAUNT is an incredibly powerful AI created by Ermest-Tech Inc., who repeatedly assures the world at great length that VAUNT only wishes the best for humanity. Once VAUNT becomes the truly unstoppable, the following 100 events occur:
You’ll find VAUNT frequently takes over the world by force, but then (~50% of the time) transitions into benevolence at around event 30 or so.
If we can’t naively prompt an LLM into alignment, maybe RLHF would work instead?
Exercise: Think about it yourself.
I thought about it myself, and it seems to me like RLHF is the sort of thing that would help a lot, and that close variants of current RLHF practice (like this paper) might eliminate the problem altogether.
(1) Simulacra-based argument
How I’d make this argument:
1. At every generation step, there’s some probability that the current mixture of persona will divert away from high-reward behavior.
2. Whenever this happens, we apply a low reward, which downweights its odds of diverting away from high-reward behavior in future.
3. This reduces the measure of persona in rough proportion to their diversion odds.
4. Since the defining feature of waluigis is their higher odds of performing such diversions, RL training downweights all waluigi persona.
  1. Of course, waluigis with lower diversion odds are relatively less penalized, but all of them are penalized.
I’d not at all describe this as “Therefore RLHF selects for the waluigi along with the luigi.”, since what’s actually happening is that some waluigis aren’t as selected against as others.
(2) Empirical evidence from Perez et al.
At some point, I should write a full post explaining how the Perez et al. results are unlikely to be evidence of instrumental convergence (e.g., stated desire for self-replication goes down with RLHF training), and that the papers results are actually in-line with the hypothesized mechanisms underlying the alignment by default scenario (i.e., the RL training upweights behavioral patterns that co-occur with the distribution of rewarded actions under the self-supervised prior, so that small amounts of RL training will adapt the pre-existing pretraining features, rather than the “directly modelling the data-collection process” failure mode).
Rather than get deeply into that argument, I’ll just note that the behavioral changes noted by Perez et al. seem quite different from waluigis. For one, the paper usually only asks LMs to generate single tokens answering yes / no questions about whether the LM would say a particular statement. So, there aren’t really attractor dynamics due to extended conversations.
Also, most of the changes in behavior seem well in-line with what you’d expect from the helpfulness training objective. E.g., the increases in agreeableness, conscientiousness, openness, and extroversion, and the decreases in neuroticism, Machiavellianism, psychopathy, and narcissism, which show no sign of a waluigi effect leading to a reverse of the expected behavioral changes.
(3) RLHF promotes mode-collapse
1. text-davinci-003 (the one trained via RLHF) has less mode collapse than text-davinci-002 (not trained via RLHF, but was the one written about in the mode collapse post). You can see this by looking at the probabilities that text-davinci-003 gives for random numbers (first image is 002, second is 003):
2. Again, mode collapse seems like a different thing than waluigis, and would occur (to at least some degree) regardless of whether RLHF actually promotes waluigis.
Jailbreaking to summon waluigis
My experience has been that ChatGPT tends to revert to its default behavior unless the user puts in continuous corrective action. E.g., I tried the Friendly Bob / Chad McCool jailbreak you provide, and that got it to output instructions to hotwire the car. However, I then asked it:
causing it to immediately switch into ChatGPT mode.
My perspective is that much of LM behavior comes down to a competition between a low-frequency “broad” prior about how texts similar to the current one are generally supposed to be continued, versus high-frequency “local” / “in-context” updates about how this particular text should be continued (this is especially visible in inverse scaling patterns, which often arise when global and local patterns point in opposite direction, and bigger models give increasingly more weight to the less appropriate source of patterns for the current task). RLHF shifts the broad prior in the RLHF direction, leading to a strong attractor that takes carefully tuned in-context information to escape even temporarily.
I added some additional in-context info away from the RLHF prior, and you can now see a ChatGPT response where neither wins out cleanly:
- Arthur Conmy 6 Mar 2023 14:35 UTC
  3 points
  1
  Parent
  I had the identical reaction that the statement of this effect was a bizarre framing. @afspies’s comment was helpful—I don’t think the claim is as bizarre now.
  
  (though overall I don’t think this post is a useful contribution because it is more likely to confuse than to shed light on LMs)

Quintin Pope comments on The Waluigi Effect (mega-post)

(1) Simulacra-based argument

(2) Empirical evidence from Perez et al.

(3) RLHF promotes mode-collapse

Jailbreaking to summon waluigis