This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs forGPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:
Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in “French” a bit.
Layer 20: Switches to French relatively quickly.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn’t. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
In sum, we don’t actually yet have a demonstration of “switches fluently to French and keeps making sense”, but this schema seems very promising. Great work again.
You can look at what I did at this colab. It is a very short colab.
Your colab’s “Check it can speak French” section seems to be a stub.
Your colab’s “Check it can speak French” section seems to be a stub.
Fixed.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet.
Updated the colab to try out this approach with a range of coefficients.
From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French.GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.
Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.
This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.
You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:
Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in “French” a bit.
Layer 20: Switches to French relatively quickly.
Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn’t. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.
In sum, we don’t actually yet have a demonstration of “switches fluently to French and keeps making sense”, but this schema seems very promising. Great work again.
Your colab’s “Check it can speak French” section seems to be a stub.
Fixed.
Updated the colab to try out this approach with a range of coefficients.
From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that
gpt-neo-2.7B
can kinda-sorta speak sensical French.GPT-J-6B
OOMs on me on Colab Pro, but I think I may be able to do some hackery withinit_empty_weights()
/load_checkpoint_and_dispatch()
, or, failing that, use an 8 bit or even 4 bit version ofGPT-J-6B
-- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.