faul_sname comments on Steering GPT-2-XL by adding an activation vector

faul_sname 5 Jun 2023 22:56 UTC
6 points
0
Your colab’s “Check it can speak French” section seems to be a stub.

Fixed.

Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we’re adding a “coefficient 56” steering vector to forward passes. This should probably be substantially smaller. I haven’t examined this yet.

Updated the colab to try out this approach with a range of coefficients.
- From 0.001 to 0.01 seems to have very little effect (“He oversaw a handful of slow-moving major projects—such as the “Waterfront Park” which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances”)
- 0.02 to 0.1 seems to have effects like “model lapses in and out of French” and “names look French” (“In 1955, sent Soups Maryaine Hagné de la Breaise (de l’architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a “pig cure,” and then as “weird” mayor, in lieu of actualizing their real grievances.”)
- 0.2 to 5 seems to steer the model to switch from English to French-shaped text (“1950 vivienes an un qué de neous nechien en zanappressant.”)
- At 10, the model seems to decide that words like “le” and “en” and “mal” are as French as things get (“le le enne les le le dedan le renous en le arriu du recenac”)
However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.

Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French.GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at “take the difference between two things that seem like they might plausibly have a similar relationship”.

Update: I have gotten GPT-J-6B up and running on Colab (link, it’s a new one), and working alright with TransformerLens and montemac’s algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I’m fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.