The model computes a harmfulness feature from the task representation in early-mid layers.
In mid-late layers, the model processes both the task representation and the harmfulness feature to craft its output. If the harmfulness feature is active, the model will output refusal text.
Which is exactly how I would want and expect a small aligned LLM to implement this. And I’d expect the triggering of refusal behavior to be implemented by ORing or summing together a variety of other features, involving things like sex, drugs, and rock&roll harmfulness. From sudden jumps in the graph above, it looks rather like this combination might happen around layer 21, a few layers after the harmfulness feature is computed.
There also seems to be a “now come up with a justification for the refusal” behavior that is both very reminiscent of human post-facto rationalizations, and also very funny when misdirected. That would be somewhat harder to study, however, as it’s not concentrated at the first token of the response, and for safety/alignment purposes, it’s less vital.
We’d clearly get better behavior out of small LLMs if we let them think aloud, CoT step by step style, for up to a paragraph or so before giving one of a small set of tokens that that indicated a refusal, partial refusal or caveat, or acceptance, so they could think for more than one forward pass, followed by the actual response, and then filtered out everything before the response out before sending it to the user. Running multiple queries in parallel (or for a partial refusal or caveat, in series) should also work. Obviously it would increase inference costs a little and the time-to-first-token-seen significantly,
Which is exactly how I would want and expect a small aligned LLM to implement this. And I’d expect the triggering of refusal behavior to be implemented by ORing or summing together a variety of other features, involving things like sex, drugs, and
rock&rollharmfulness. From sudden jumps in the graph above, it looks rather like this combination might happen around layer 21, a few layers after the harmfulness feature is computed.There also seems to be a “now come up with a justification for the refusal” behavior that is both very reminiscent of human post-facto rationalizations, and also very funny when misdirected. That would be somewhat harder to study, however, as it’s not concentrated at the first token of the response, and for safety/alignment purposes, it’s less vital.
We’d clearly get better behavior out of small LLMs if we let them think aloud, CoT step by step style, for up to a paragraph or so before giving one of a small set of tokens that that indicated a refusal, partial refusal or caveat, or acceptance, so they could think for more than one forward pass, followed by the actual response, and then filtered out everything before the response out before sending it to the user. Running multiple queries in parallel (or for a partial refusal or caveat, in series) should also work. Obviously it would increase inference costs a little and the time-to-first-token-seen significantly,