I don’t think I define it rigorously. Maybe someone with deeper technical understanding of these models could.
But if I had to come up with a hack somehow, you could look at the distribution of probabilities for various words as ChatGPT is predicting the next token. Presumably you’ll noticed a certain kind of probability distribution when it’s in the “Luigi” mode and another when it’s in “Waluigi” mode. Then prodding it in the right direction might be weighing more the tokens that are a lot more frequent in the Luigi mode than Waluigi.
Is “behavior vector space” referencing something? If not, what do you mean by it?
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
I don’t think I define it rigorously. Maybe someone with deeper technical understanding of these models could.
But if I had to come up with a hack somehow, you could look at the distribution of probabilities for various words as ChatGPT is predicting the next token. Presumably you’ll noticed a certain kind of probability distribution when it’s in the “Luigi” mode and another when it’s in “Waluigi” mode. Then prodding it in the right direction might be weighing more the tokens that are a lot more frequent in the Luigi mode than Waluigi.