TurnTrout comments on Experiments in Evaluating Steering Vectors

TurnTrout 19 Jun 2023 19:03 UTC
5 points
0
Nice work! There’s precedent for “average a bunch of activation additions” being a good idea, from Li et al.‘s recent “adding the truth vector” work, to White (2016)’s smile vector. Steering GPT-2-XL by adding an activation vector didn’t use averages for length / time reasons.
One part I’m concerned about is that the “all combined” completions are monomaniacal. The wedding focus doesn’t tie in to the original prompt very much:
Science is the great antidote to the poison of enthusiasm and superstition. I’m not a wedding expert, but I know that most people have no idea what they’re getting into when they decide to get married. I am an expert on weddings, so it’s
This is a bit surprising because both conditions in the table should be adding a steering vector of similar norm (so it’s not like one is totally obscuring the original prompt).
Maybe reducing the coefficient helps?
The trick of this technique is that we can ask for a completion of one token, and to get a smoother distribution, we can take the likelihood of the token “Yes”. This gives us a continuous score from 0-1.
Wondering whether miscellaneous token probabilities could complicate the picture? E.g. is most of the probability on Yes and No?
- Gytis Daujotas 20 Jun 2023 22:05 UTC
  1 point
  0
  Parent
  Definitely a good point! I wanted to get a rough sense as to whether this evaluation approach would work at all, so I deliberately aimed at trying to be monomaniacal. If I was to continue with this, you’re right—I think figuring out what a human would actually want to see in a completion would be the next step in seeing if this technique can be useful in practice.
  For the token probabilities—I was inspired mostly by seeing this used in Ought’s work for factored cognition:
  https://github.com/rawmaterials/ice/blob/4493d6198955804cc03069c3f88bda1b23de616f/ice/recipes/experiments_and_arms/prompts/can_name_exps.py#L161
  It seems like the misc. token probabilities usually add up to less than 1% of the total probability mass:
  https://i.imgur.com/aznsQdr.png