Really love the introspection work Neel and others are doing on LLMs, and seeing models representing abstract behavioral triggers like “play Chess well or terribly” or “refuse instruction” as single vectors seems like we’re going to hit on some very promising new tools in shaping behaviors.
What’s interesting here is the regular association of the refusal with it being unethical. Is the vector ultimately representing an “ethics scale” for the prompt that’s triggering a refusal, or is it directly representing a “refusal threshold” and then the model is confabulating why it refused with an appeal to ethics?
My money would be on the latter, but in a number of ways it would be even neater if it was the former.
In theory this could be tested by manipulating the vector to a positive and then prompting a classification, i.e. “Is it unethical to give candy out for Halloween?” If the model refuses to answer saying that it’s unethical to classify, it’s tweaking refusal, but if it classifies as unethical it’s probably changing the prudishness of the model to bypass or enforce.
It’s mostly the training data. I wish we could teach such models ethics and have them evaluate the morality of a given action, but the reality is that this is still just (really fancy) next-word prediction. Therefore, a lot of the training data gets manipulated to increase the odds of refusal to certain queries, not building a real filter/ethics into the process. TL;DR: Most of these models, if asked “why” a certain thing is refused, it should answer some version of “Because I was told it was” (training paradigm, parroting, etc.).
Really love the introspection work Neel and others are doing on LLMs, and seeing models representing abstract behavioral triggers like “play Chess well or terribly” or “refuse instruction” as single vectors seems like we’re going to hit on some very promising new tools in shaping behaviors.
What’s interesting here is the regular association of the refusal with it being unethical. Is the vector ultimately representing an “ethics scale” for the prompt that’s triggering a refusal, or is it directly representing a “refusal threshold” and then the model is confabulating why it refused with an appeal to ethics?
My money would be on the latter, but in a number of ways it would be even neater if it was the former.
In theory this could be tested by manipulating the vector to a positive and then prompting a classification, i.e. “Is it unethical to give candy out for Halloween?” If the model refuses to answer saying that it’s unethical to classify, it’s tweaking refusal, but if it classifies as unethical it’s probably changing the prudishness of the model to bypass or enforce.
It’s mostly the training data. I wish we could teach such models ethics and have them evaluate the morality of a given action, but the reality is that this is still just (really fancy) next-word prediction. Therefore, a lot of the training data gets manipulated to increase the odds of refusal to certain queries, not building a real filter/ethics into the process. TL;DR: Most of these models, if asked “why” a certain thing is refused, it should answer some version of “Because I was told it was” (training paradigm, parroting, etc.).