Nina Panickssery comments on I found >800 orthogonal “write code” steering vectors

Nina Panickssery 17 Jul 2024 7:05 UTC
21 points
4
Have you tried this procedure starting with a steering vector found using a supervised method?

It could be that there are only a few “true” feature directions (like what you would find with a supervised method), and the melbo vectors are vectors that happen to have a component in the “true direction”. As long as none of the vectors in the basket of stuff you are staying orthogonal to are the exact true vector(s), you can find different orthogonal vectors that all have some sufficient amount of the actual feature you want.

This would predict:
- Summing/averaging your vectors produces a reasonable steering vector for the behavior (provided rescaling to an effective norm)
- Starting with a supervised steering vector enables you to generate fewer orthogonal vectors with same effect
- (Maybe) The sum of your successful melbo vectors is similar to the supervised steering vector (eg. mean difference in activations on code/prose contrast pairs)
What links here?
- (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need by Sodium (3 Oct 2024 19:11 UTC; 34 points)