faul_sname comments on Steering GPT-2-XL by adding an activation vector

faul_sname 1 Jun 2023 5:54 UTC
LW: 25 AF: 12
1
AF
I found an even dumber approach that works. The approach is as follows:
1. Take three random sentences of Wikipedia.
2. Obtain a French translation for each sentence.
3. Determine the boundaries corresponding phrases in each English/French sentence pair.
4. Mark each boundary with “|”
5. Count the “|”s, call that number n.
6. For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
  The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l’album aux styles électropop de Ke$ha et Robyn.
7. For each English->French sentence, make a +1 activation addition for that sentence and a −1 activation addition for the unmodified English sentence.
8. Apply the activation additions.
9. That’s it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.
Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. “Les défenseilant sur les necesite dans ce de l’en nouvieres éché de un enferrerne réalzation
…overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
…overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
…overcome the growing crime rate in Detroit. Il n’en fonentant ’le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
…overcome the growing crime rate in Detroit. Il n’a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d’Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.
What links here?
- faul_sname's comment on Steering GPT-2-XL by adding an activation vector by TurnTrout (1 Jun 2023 6:00 UTC; 5 points)
- TurnTrout 5 Jun 2023 20:14 UTC
  LW: 7 AF: 2
  0
  AF Parent
  This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
  You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:
  - Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
  - Layer 16: Noticeable tendency to mention French, and even talk in “French” a bit.
  - Layer 20: Switches to French relatively quickly.
  Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn’t. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.
  However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
  In sum, we don’t actually yet have a demonstration of “switches fluently to French and keeps making sense”, but this schema seems very promising. Great work again.
  You can look at what I did at this colab. It is a very short colab.
  Your colab’s “Check it can speak French” section seems to be a stub.
  - faul_sname 5 Jun 2023 22:56 UTC
    6 points
    0
    Parent
    Your colab’s “Check it can speak French” section seems to be a stub.
    
    Fixed.
    
    Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet.
    
    Updated the colab to try out this approach with a range of coefficients.
    
    From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
    0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
    0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
    At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
    
    However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
    
    Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French.GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.
    
    Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.
- Martin Randall 2 Jun 2023 1:28 UTC
  4 points
  2
  Parent
  You say this is a dumber approach, but it seems smarter to me, and more general. I feel more confident that this vector is genuinely going to result in a “switch from English to French” behavior, versus the edits in the main post. I suppose it might also result in some more general “switch between languages” behavior.
  So the last challenge remaining of the four is for someone to figure out a truth-telling vector.
- Arthur Conmy 4 Jul 2023 9:46 UTC
  1 point
  0
  Parent
  This is particularly impressive since ChatGPT isn’t capable of code-switching (though GPT-4 seems to be from a quick try)