This is a fantastic article! It’s great to see that there’s work going on in this space, and I like that the approach is described in very easy to follow and practical terms.
I’ve been working on a very expansive approach/design for AI safety called safety-first cognitive architectures, which is vaguely like a language model agent designed from the ground up with safety in mind, except extensible to both present-day and future AI designs, and with a very sophisticated (yet achievable, and scalable from easy to hard) safety- and performance-minded architecture. I have intentionally not publicly published implementation details yet, but will send you a DM!
It seems like this concept is related to the “Federating Cognition” section of my article, specifically a point about the safety benefits of externalizing memory: “external memory systems can contain information on human preferences which AI systems can learn from and/or use as a reference or assessment mechanism for evaluating proposed goals and actions.” At a high level, this can affect both AI models themselves as well as model evaluations and the cognitive architecture containing models (the latter is mentioned at the end of your post). For various reasons, I haven’t written much about the implications of this work to AI models themselves.
I think some of the downsides mentioned here are easily or realistically surpassable. I’ll post a couple thoughts.
For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that’s just the tip of the iceberg. What about many-dimensional labeling? I’d be curious to get your take on related work like Eric Drexler’s article on QNRs (which is unfortunately similar to my writing in that it may be high-level and hard to interpret) which is one of the few works I can think of regarding interesting safety and performance applications of externalized memories.
With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?
For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that’s just the tip of the iceberg. What about many-dimensional labeling?
If we were just monitoring behavior, using scores would work fine, But we also want to control behavior. The simplest and most efficient way I can see to do that is via the token-banning mechanism as long as we have arranged that one tag is one token. But we could also do it via a threshold on numerical scores, say where if we get a deceit score over the currently allowed threshold then we back generation up some distance and try again until we get a score below the current thrishold. I can’t really see any cases where that fine a level of control would be needed, and for a coarse-grained version we could just use different tags for different bands of intensity level.
With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?
This approach is somewhat similar to activation vectors. One significant difference is that an activtion vector represents “one thing” in whatever semantic space the residual embeddings use (at around the layer where we’re applying it). A classifier (like the one the LLM learns for when to emit one of these tags) can have complex conditions and a convoluted logic for its boundary that includes various special cases (<criminality>, for example, has a definition technically as complex as all of the worlds’ legal codes combined), and a classifier can (with enough data) learn all the twists and turns of the boundary of the set it’s classifying, which could often be a lot more complex than what an be described by any single activation vector. You’d probably need to use something comparable to a LORA in place of an activation vector to get as much descriptive complexity capacity.
Also, learning the classifier for the tag means that the concepts needed to define the boundary (such as all of the world’s legal codes) need to be represented inside the learned LLM, guaranteeing that they’re also available for other behaviors implemented by the LLM to pay attention to and make use of. So this helps you shape the ways the LLM is learning to think, by adding another task for it to learn — unlike activation vectors which can only use directions (linear combinations of things) that the LLM has already decided to put into its semantic embedding space, and don’t modify the weight structure of the LLM at all.
On the other hand, the fact that this approach’s preclassifier needs to be designed, tested, run over the pretraining set, and then the LLM pretrained to distill the classification behavior into it make it a lot less flexible and adjustable-on-the-fly than an activation vector approach. So both techniques might have their advantages.
An activation vector is somewhat similar in effect to adding some text to the prompt, except that attention mechanisms can’t pay attention to specific parts of it.
I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below “I think some of the downsides mentioned here are easily or realistically surpassable...” except for my point on “many-dimensional labeling.”
To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the “can categories/labels adequately express everything?” problem.
This is a fantastic article! It’s great to see that there’s work going on in this space, and I like that the approach is described in very easy to follow and practical terms.
I’ve been working on a very expansive approach/design for AI safety called safety-first cognitive architectures, which is vaguely like a language model agent designed from the ground up with safety in mind, except extensible to both present-day and future AI designs, and with a very sophisticated (yet achievable, and scalable from easy to hard) safety- and performance-minded architecture. I have intentionally not publicly published implementation details yet, but will send you a DM!
It seems like this concept is related to the “Federating Cognition” section of my article, specifically a point about the safety benefits of externalizing memory: “external memory systems can contain information on human preferences which AI systems can learn from and/or use as a reference or assessment mechanism for evaluating proposed goals and actions.” At a high level, this can affect both AI models themselves as well as model evaluations and the cognitive architecture containing models (the latter is mentioned at the end of your post). For various reasons, I haven’t written much about the implications of this work to AI models themselves.
I think some of the downsides mentioned here are easily or realistically surpassable. I’ll post a couple thoughts.
For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time. As you can see, it uses numerical emotion labeling, although I think that’s just the tip of the iceberg. What about many-dimensional labeling? I’d be curious to get your take on related work like Eric Drexler’s article on QNRs (which is unfortunately similar to my writing in that it may be high-level and hard to interpret) which is one of the few works I can think of regarding interesting safety and performance applications of externalized memories.
With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?
If we were just monitoring behavior, using scores would work fine, But we also want to control behavior. The simplest and most efficient way I can see to do that is via the token-banning mechanism as long as we have arranged that one tag is one token. But we could also do it via a threshold on numerical scores, say where if we get a deceit score over the currently allowed threshold then we back generation up some distance and try again until we get a score below the current thrishold. I can’t really see any cases where that fine a level of control would be needed, and for a coarse-grained version we could just use different tags for different bands of intensity level.
This approach is somewhat similar to activation vectors. One significant difference is that an activtion vector represents “one thing” in whatever semantic space the residual embeddings use (at around the layer where we’re applying it). A classifier (like the one the LLM learns for when to emit one of these tags) can have complex conditions and a convoluted logic for its boundary that includes various special cases (<criminality>, for example, has a definition technically as complex as all of the worlds’ legal codes combined), and a classifier can (with enough data) learn all the twists and turns of the boundary of the set it’s classifying, which could often be a lot more complex than what an be described by any single activation vector. You’d probably need to use something comparable to a LORA in place of an activation vector to get as much descriptive complexity capacity.
Also, learning the classifier for the tag means that the concepts needed to define the boundary (such as all of the world’s legal codes) need to be represented inside the learned LLM, guaranteeing that they’re also available for other behaviors implemented by the LLM to pay attention to and make use of. So this helps you shape the ways the LLM is learning to think, by adding another task for it to learn — unlike activation vectors which can only use directions (linear combinations of things) that the LLM has already decided to put into its semantic embedding space, and don’t modify the weight structure of the LLM at all.
On the other hand, the fact that this approach’s preclassifier needs to be designed, tested, run over the pretraining set, and then the LLM pretrained to distill the classification behavior into it make it a lot less flexible and adjustable-on-the-fly than an activation vector approach. So both techniques might have their advantages.
An activation vector is somewhat similar in effect to adding some text to the prompt, except that attention mechanisms can’t pay attention to specific parts of it.
I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below “I think some of the downsides mentioned here are easily or realistically surpassable...” except for my point on “many-dimensional labeling.”
To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the “can categories/labels adequately express everything?” problem.