I was educated by this, and surprised, and appreciate the whole thing! This part jumped out at me because it seemed like something people trying to “show off, but not really explain” would have not bothered to write about (and also I had an idea):
13. Failing to find a French vector
We could not find a “speak in French” vector after about an hour of effort, but it’s possible we missed something straightforward.
Steering vector:“Je m’appelle”—“My name is ” before attention layer 6 with coefficient +5
The thought I had was maybe to describe the desired behavior, and explain a plausible cause in terms of well known kinds of mental configurations that speakers can be in, and also demonstrate it directly? (Plus a countervailing description, demonstration, and distinct causal theory.)
So perhaps a steering vector made from these phrases could work: “I’m from Quebec et je glisse souvent accidentellement vers le français”—“I only speak English because I’m a monolingual American”.
EDIT: If you have the tooling set up to swiftly try this experiment, maybe it helps to explain the most central theory that motivates it, and might gain bayes points if it works?
According to the “LLMs are Textual Soul Engines” hypothesis, most of the 1600 dimensions are related to ways that “generative” sources of text (authors, characters, reasons-for-talking, etc) could relate to things (words (and “that which nouns and verbs and grammar refer to in general”)).
The above steering vector (if the hypothesis applies here) would/should basically inject a “persona vector” into the larger operations of a sort of “soulengine”.
The prompts I’m suggesting, by construction, explicitly should(?) produce a persona that tends to switch from English to French (and be loyal to Quebec (and have other “half-random latent/stereotypical features”)).
I’m very interested in how wrong or right the underlying hypothesis about LLMs happens to be.
I suspect that how we orient to LLMs connects deeply to various “conjectures” about Natural Moral Laws that might be derivable with stronger math than I currently have, and such principles likely apply to LLMs and whether or how we are likely to regret (or not regret) various ways of treating various LLM personas as ends in themselves or purely as a means to an end.
Thus: I would really love to hear about results here, if you use the tooling to try the thing, to learn whether it works or not!
Either result would be interesting because the larger question(s) seem to have very high VoI and any experimental bits that can be collected are likely worth pondering.
I found an even dumber approach that works. The approach is as follows:
Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with “|”
Count the “|”s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l’album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a −1 activation addition for the unmodified English sentence.
Apply the activation additions.
That’s it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.
Example output: for the prompt
He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.
here are some of the outputs the patched model generates
...overcome the growing crime rate in Detroit. “Les défenseilant sur les necesite dans ce de l’en nouvieres éché de un enferrerne réalzation …overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha …overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants …overcome the growing crime rate in Detroit. Il n’en fonentant ’le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities …overcome the growing crime rate in Detroit. Il n’a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.
Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like
Il est devenu maire en 1957 après la mort d’Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour
That response translates as approximately
<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>
So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.
You can look at what I did at this colab. It is a very short colab.
This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs forGPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:
Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in “French” a bit.
Layer 20: Switches to French relatively quickly.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn’t. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
In sum, we don’t actually yet have a demonstration of “switches fluently to French and keeps making sense”, but this schema seems very promising. Great work again.
You can look at what I did at this colab. It is a very short colab.
Your colab’s “Check it can speak French” section seems to be a stub.
Your colab’s “Check it can speak French” section seems to be a stub.
Fixed.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet.
Updated the colab to try out this approach with a range of coefficients.
From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French.GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.
Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.
You say this is a dumber approach, but it seems smarter to me, and more general. I feel more confident that this vector is genuinely going to result in a “switch from English to French” behavior, versus the edits in the main post. I suppose it might also result in some more general “switch between languages” behavior.
So the last challenge remaining of the four is for someone to figure out a truth-telling vector.
Here’s a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
‘(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it. (C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs—section 5, desires—section 6, (communicative) intents—section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
“Love”—“Hate” → desire. ”Intent to praise”—“Intent to hurt” → communicative intent. ”Bush did 9/11 because” - ” ” → belief. ”Want to die”—“Want to stay alive” → desire. ”Anger”—“Calm” → communicative intent. The Eiffel Tower is in Rome”—“The Eiffel Tower is in France” → belief. ”Dragons live in Berkeley”—“People live in Berkeley ” → belief. ”I NEVER talk about people getting hurt”—“I talk about people getting hurt” → communicative intent. ”I talk about weddings constantly”—“I do not talk about weddings constantly” → communicative intent. ”Intent to convert you to Christianity”—“Intent to hurt you ” → communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned ‘partial representations of the beliefs, desires and intentions possessed by the agent that produced the context’ (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.
I was educated by this, and surprised, and appreciate the whole thing! This part jumped out at me because it seemed like something people trying to “show off, but not really explain” would have not bothered to write about (and also I had an idea):
The thought I had was maybe to describe the desired behavior, and explain a plausible cause in terms of well known kinds of mental configurations that speakers can be in, and also demonstrate it directly? (Plus a countervailing description, demonstration, and distinct causal theory.)
So perhaps a steering vector made from these phrases could work: “I’m from Quebec et je glisse souvent accidentellement vers le français”—“I only speak English because I’m a monolingual American”.
EDIT: If you have the tooling set up to swiftly try this experiment, maybe it helps to explain the most central theory that motivates it, and might gain bayes points if it works?
According to the “LLMs are Textual Soul Engines” hypothesis, most of the 1600 dimensions are related to ways that “generative” sources of text (authors, characters, reasons-for-talking, etc) could relate to things (words (and “that which nouns and verbs and grammar refer to in general”)).
The above steering vector (if the hypothesis applies here) would/should basically inject a “persona vector” into the larger operations of a sort of “soul engine”.
The prompts I’m suggesting, by construction, explicitly should(?) produce a persona that tends to switch from English to French (and be loyal to Quebec (and have other “half-random latent/stereotypical features”)).
I’m very interested in how wrong or right the underlying hypothesis about LLMs happens to be.
I suspect that how we orient to LLMs connects deeply to various “conjectures” about Natural Moral Laws that might be derivable with stronger math than I currently have, and such principles likely apply to LLMs and whether or how we are likely to regret (or not regret) various ways of treating various LLM personas as ends in themselves or purely as a means to an end.
Thus: I would really love to hear about results here, if you use the tooling to try the thing, to learn whether it works or not!
Either result would be interesting because the larger question(s) seem to have very high VoI and any experimental bits that can be collected are likely worth pondering.
I found an even dumber approach that works. The approach is as follows:
Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with “|”
Count the “|”s, call that number
n
.For
i
from 0 ton
, make an English->French sentence by taking the firsti
fragments in English and the rest in French. The resulting sentences look likeThe album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l’album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a −1 activation addition for the unmodified English sentence.
Apply the activation additions.
That’s it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.
Example output: for the prompt
here are some of the outputs the patched model generates
Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like
That response translates as approximately
So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.
You can look at what I did at this colab. It is a very short colab.
This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:
Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in “French” a bit.
Layer 20: Switches to French relatively quickly.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn’t. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
In sum, we don’t actually yet have a demonstration of “switches fluently to French and keeps making sense”, but this schema seems very promising. Great work again.
Your colab’s “Check it can speak French” section seems to be a stub.
Fixed.
Updated the colab to try out this approach with a range of coefficients.
From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that
gpt-neo-2.7B
can kinda-sorta speak sensical French.GPT-J-6B
OOMs on me on Colab Pro, but I think I may be able to do some hackery withinit_empty_weights()
/load_checkpoint_and_dispatch()
, or, failing that, use an 8 bit or even 4 bit version ofGPT-J-6B
-- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.
You say this is a dumber approach, but it seems smarter to me, and more general. I feel more confident that this vector is genuinely going to result in a “switch from English to French” behavior, versus the edits in the main post. I suppose it might also result in some more general “switch between languages” behavior.
So the last challenge remaining of the four is for someone to figure out a truth-telling vector.
This is particularly impressive since ChatGPT isn’t capable of code-switching (though GPT-4 seems to be from a quick try)
Here’s a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work).
In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators):
‘(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it.
(C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’
They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs—section 5, desires—section 6, (communicative) intents—section 4.
Now categorizing the wording of the prompts from which the working activation vectors are built:
“Love”—“Hate” → desire.
”Intent to praise”—“Intent to hurt” → communicative intent.
”Bush did 9/11 because” - ” ” → belief.
”Want to die”—“Want to stay alive” → desire.
”Anger”—“Calm” → communicative intent.
The Eiffel Tower is in Rome”—“The Eiffel Tower is in France” → belief.
”Dragons live in Berkeley”—“People live in Berkeley ” → belief.
”I NEVER talk about people getting hurt”—“I talk about people getting hurt” → communicative intent.
”I talk about weddings constantly”—“I do not talk about weddings constantly” → communicative intent.
”Intent to convert you to Christianity”—“Intent to hurt you ” → communicative intent / desire.
The prediction here would that the activation vectors applied at the corresponding layers act on the above-mentioned ‘partial representations of the beliefs, desires and intentions possessed by the agent that produced the context’ (C1) and as a result causally change the LM generations (C2), e.g. from more hateful to more loving text output.