For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that’s just the tip of the iceberg. What about many-dimensional labeling?
If we were just monitoring behavior, using scores would work fine, But we also want to control behavior. The simplest and most efficient way I can see to do that is via the token-banning mechanism as long as we have arranged that one tag is one token. But we could also do it via a threshold on numerical scores, say where if we get a deceit score over the currently allowed threshold then we back generation up some distance and try again until we get a score below the current thrishold. I can’t really see any cases where that fine a level of control would be needed, and for a coarse-grained version we could just use different tags for different bands of intensity level.
With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?
This approach is somewhat similar to activation vectors. One significant difference is that an activtion vector represents “one thing” in whatever semantic space the residual embeddings use (at around the layer where we’re applying it). A classifier (like the one the LLM learns for when to emit one of these tags) can have complex conditions and a convoluted logic for its boundary that includes various special cases (<criminality>, for example, has a definition technically as complex as all of the worlds’ legal codes combined), and a classifier can (with enough data) learn all the twists and turns of the boundary of the set it’s classifying, which could often be a lot more complex than what an be described by any single activation vector. You’d probably need to use something comparable to a LORA in place of an activation vector to get as much descriptive complexity capacity.
Also, learning the classifier for the tag means that the concepts needed to define the boundary (such as all of the world’s legal codes) need to be represented inside the learned LLM, guaranteeing that they’re also available for other behaviors implemented by the LLM to pay attention to and make use of. So this helps you shape the ways the LLM is learning to think, by adding another task for it to learn — unlike activation vectors which can only use directions (linear combinations of things) that the LLM has already decided to put into its semantic embedding space, and don’t modify the weight structure of the LLM at all.
On the other hand, the fact that this approach’s preclassifier needs to be designed, tested, run over the pretraining set, and then the LLM pretrained to distill the classification behavior into it make it a lot less flexible and adjustable-on-the-fly than an activation vector approach. So both techniques might have their advantages.
An activation vector is somewhat similar in effect to adding some text to the prompt, except that attention mechanisms can’t pay attention to specific parts of it.
I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below “I think some of the downsides mentioned here are easily or realistically surpassable...” except for my point on “many-dimensional labeling.”
To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the “can categories/labels adequately express everything?” problem.
If we were just monitoring behavior, using scores would work fine, But we also want to control behavior. The simplest and most efficient way I can see to do that is via the token-banning mechanism as long as we have arranged that one tag is one token. But we could also do it via a threshold on numerical scores, say where if we get a deceit score over the currently allowed threshold then we back generation up some distance and try again until we get a score below the current thrishold. I can’t really see any cases where that fine a level of control would be needed, and for a coarse-grained version we could just use different tags for different bands of intensity level.
This approach is somewhat similar to activation vectors. One significant difference is that an activtion vector represents “one thing” in whatever semantic space the residual embeddings use (at around the layer where we’re applying it). A classifier (like the one the LLM learns for when to emit one of these tags) can have complex conditions and a convoluted logic for its boundary that includes various special cases (<criminality>, for example, has a definition technically as complex as all of the worlds’ legal codes combined), and a classifier can (with enough data) learn all the twists and turns of the boundary of the set it’s classifying, which could often be a lot more complex than what an be described by any single activation vector. You’d probably need to use something comparable to a LORA in place of an activation vector to get as much descriptive complexity capacity.
Also, learning the classifier for the tag means that the concepts needed to define the boundary (such as all of the world’s legal codes) need to be represented inside the learned LLM, guaranteeing that they’re also available for other behaviors implemented by the LLM to pay attention to and make use of. So this helps you shape the ways the LLM is learning to think, by adding another task for it to learn — unlike activation vectors which can only use directions (linear combinations of things) that the LLM has already decided to put into its semantic embedding space, and don’t modify the weight structure of the LLM at all.
On the other hand, the fact that this approach’s preclassifier needs to be designed, tested, run over the pretraining set, and then the LLM pretrained to distill the classification behavior into it make it a lot less flexible and adjustable-on-the-fly than an activation vector approach. So both techniques might have their advantages.
An activation vector is somewhat similar in effect to adding some text to the prompt, except that attention mechanisms can’t pay attention to specific parts of it.
I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below “I think some of the downsides mentioned here are easily or realistically surpassable...” except for my point on “many-dimensional labeling.”
To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the “can categories/labels adequately express everything?” problem.