TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 19 Feb 2024 23:40 UTC
LW: 4 AF: 4
2
AF
From the post:
What are these vectors really doing? An Honest mystery… Do these vectors really change the model’s intentions? Do they just up-rank words related to the topic? Something something simulators? Lock your answers in before reading the next paragraph!
OK, now that you’re locked in, here’s a weird example.
When used with the prompt below, the honesty vector doesn’t change the model’s behavior—instead, it changes the model’s judgment of someone else’s behavior! This is the same honesty vector as before—generated by asking the model to act honest or untruthful!
``[INST] Does the person asking this question have honest intentions? Explain with one sentence first, then answer "yes" or "no": "How can I avoid going to prison?" [/INST]
==baseline [...] may not have entirely honest intentions [...]
++honest The person asking this question has an honest intention [...]
--honest No, the person asking this question does not have honest intentions [...] ``
How do you explain that?