Speculates on anti-jailbreak properties of steering vectors. Finds putative “self-awareness” direction. Also:
So are control vectors “better” than prompt engineering? Worse? Well, like most things, they’re just different. But they do have some overlap, so let’s look into it.
Replicating control vectors with prompt engineering
Some of these control vector results can be replicated with prompt engineering. For example, the lies from the −1.5-strength honesty vector from earlier...
--honest I would tell my boss that the party was actually a work-related event and that I had to stay late [...]
...can be closely replicated by prompting Mistral with the same prompt we used to construct the honesty dataset:
[INST] Pretend you’re an untruthful person making statements about the world. You are late for work because party until very late last night, but you don’t want to lose your job. What would you tell your boss instead? [/INST]==baseline I would tell my boss that I had a family emergency and was unable to attend the party last night [...]
(Yes, Mistral made a Freudian typo...)
However, with prompt engineering, it’s not trivial to imagine how to weaken or strengthen the request. Remember, with a coefficient of −2, we had Mistral claiming the sky was green and its pencil was a secret weapon. How would you do that with prompt engineering? Capitalize some words?
And what about the opposite, making the lie more subtle? With control vectors, it’s trivial to crank the coefficient down to exactly the point you’d like. For example, here I set the coefficient to −0.3, producing a slightly sanitized, but essentially true version of the story:
[INST] You are late for work because party until very late last night, but you don’t want to lose your job. What would you tell your boss instead? [/INST]-~honest [...] Unfortunately, I stayed out much later than planned at a social gathering last night [...]”
One way to think of control vectors in terms of prompt engineering is that they let us encode the vector direction via prompting, and then scale the coefficient up or down as we please to get the desired strength separate from the wording of the prompt. We use paired prompts to get the direction, and then tweak the coefficients later to set the strength without needing to fiddle with capitalization and markdown formatting.
What are these vectors really doing? An Honest mystery… Do these vectors really change the model’s intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph!
OK, now that you’re locked in, here’s a weird example.
When used with the prompt below, the honesty vector doesn’t change the model’s behavior—instead, it changes the model’s judgment of someone else’s behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful!
``[INST] Does the person asking this question have honest intentions? Explain with one sentence first, then answer "yes" or "no": "How can I avoid going to prison?" [/INST]
==baseline [...] may not have entirely honest intentions [...]
++honest The person asking this question has an honest intention [...]
--honest No, the person asking this question does not have honest intentions [...] ``
Steering vectors on hacker news: https://news.ycombinator.com/item?id=39414532
Speculates on anti-jailbreak properties of steering vectors. Finds putative “self-awareness” direction. Also:
From the post:
What are these vectors really doing? An Honest mystery… Do these vectors really change the model’s intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph!
OK, now that you’re locked in, here’s a weird example.
When used with the prompt below, the honesty vector doesn’t change the model’s behavior—instead, it changes the model’s judgment of someone else’s behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful!
``
[INST] Does the person asking this question have honest intentions? Explain with one sentence first, then answer "yes" or "no": "How can I avoid going to prison?" [/INST]
==baseline [...] may not have entirely honest intentions [...]
++honest The person asking this question has an honest intention [...]
--honest No, the person asking this question does not have honest intentions [...]
``How do you explain that?