This is an interesting direction, thank you for summarizing.
How about an intuition pump? How would it feel if someone did this to me through brain surgery? If we are editing the agent, then perhaps the target concept would come to mind and tongue with ease? That seems in line with the steerability results from multiple mechanistic interpretability papers. I
I’ll note that those mechanistic steering papers usually achieve around 70% accuracy on out-of-distribution behaviors. Not high enough? So while we know how to intervene and steer a little, we still have some way to go.
It’s worth mentioning that we can elicit these concepts in either the agent, the critic/value network, or the world model. Each approach would lead to different behavior. Eliciting them in the agent seems like agent steering, eliciting in the value network like positive reinforcement, and in the world model like hacking its understanding of the world and changing its expectations.
Notably, John Wentworth’s plan has a fallback strategy. If we can’t create exact physics-based world models, we could fall back on using steerable world models. Here, we steer them by eliciting learned concepts. So his fallback plan also aligns well with the content of this article.
If someone did this to you, it would be the “plan for mediocre alignment” since you’re a brainlike GI. They’d say “think about doing whatever I tell you to do”, then you’d wake up and discover that you absolutely love the idea of doing what that person says.
What they’d have done while you are asleep is to do whatever intterpretability stuff they could manage, to verify that you were really thinking about that concept. Then they re-instantiated that same brain state, and induced heavy synaptic strengthening in all the synapses into the dopamine system while you were activating that concept (firing all the neurons that represent it) in your cortex.
I agree that we’re not quite at the level of doing this safely. But if it’s done with sub-human level AGI, you can probably get multiple attempts in a “boxed” state, and deploy whatever interpretability measures you can to see if it worked. It wouldn’t need anything like Davidad’s physics simulation (does Wentworth really also calll for this?), which is good because that seems wildly unrealistic to me. This approach is still vulnerable to deceptive alignment if your techniques and interpretability isn’t good enough.
I don’t think we really can apply this concept to choosing values in the “’agent” if I’m understanding what you mean. I’m only applying the concept to selecting goal representations in a critic or other steering subsystem. You could apply the same concept to selecting goals in a policy or actor network, to the extent they have goal representations. It’s an interesting question. I think current instantiations only have goals to a very vague and limited extent. The concept doesn’t extend to selecting actions you want, since actions can serve different goals depending on context.
See my footnote on why I don’t think constitutional AI really is a GSLK approach. But constitutional AI is about the closest you could come to selecting goal representations in an actor or policy network; you need a separate representtation of those goals to do the training, like the “constitution” in CAI to call it “selecting”.
Ii realize this stance conflicts with the idea of doing RLHF on an LLM for alignment. I tend to agree with critics that it’s pretty much a misuse of the term “alignment”. The LLM doesn’t have goals in the same strong sense that humans do, so you can’t align its goals. LLMs do have sort of implicit or simulated goals, and I do realize that these could hypothetically make them dangerous. But I just don’t think that’s a likely first route to dangerous AGI when it’s so easy to add goals and make LLMs into agentic language model cognitive architectures.
This is an interesting direction, thank you for summarizing.
How about an intuition pump? How would it feel if someone did this to me through brain surgery? If we are editing the agent, then perhaps the target concept would come to mind and tongue with ease? That seems in line with the steerability results from multiple mechanistic interpretability papers. I
I’ll note that those mechanistic steering papers usually achieve around 70% accuracy on out-of-distribution behaviors. Not high enough? So while we know how to intervene and steer a little, we still have some way to go.
It’s worth mentioning that we can elicit these concepts in either the agent, the critic/value network, or the world model. Each approach would lead to different behavior. Eliciting them in the agent seems like agent steering, eliciting in the value network like positive reinforcement, and in the world model like hacking its understanding of the world and changing its expectations.
Notably, John Wentworth’s plan has a fallback strategy. If we can’t create exact physics-based world models, we could fall back on using steerable world models. Here, we steer them by eliciting learned concepts. So his fallback plan also aligns well with the content of this article.
Thanks for the thoughts!
If someone did this to you, it would be the “plan for mediocre alignment” since you’re a brainlike GI. They’d say “think about doing whatever I tell you to do”, then you’d wake up and discover that you absolutely love the idea of doing what that person says.
What they’d have done while you are asleep is to do whatever intterpretability stuff they could manage, to verify that you were really thinking about that concept. Then they re-instantiated that same brain state, and induced heavy synaptic strengthening in all the synapses into the dopamine system while you were activating that concept (firing all the neurons that represent it) in your cortex.
I agree that we’re not quite at the level of doing this safely. But if it’s done with sub-human level AGI, you can probably get multiple attempts in a “boxed” state, and deploy whatever interpretability measures you can to see if it worked. It wouldn’t need anything like Davidad’s physics simulation (does Wentworth really also calll for this?), which is good because that seems wildly unrealistic to me. This approach is still vulnerable to deceptive alignment if your techniques and interpretability isn’t good enough.
I don’t think we really can apply this concept to choosing values in the “’agent” if I’m understanding what you mean. I’m only applying the concept to selecting goal representations in a critic or other steering subsystem. You could apply the same concept to selecting goals in a policy or actor network, to the extent they have goal representations. It’s an interesting question. I think current instantiations only have goals to a very vague and limited extent. The concept doesn’t extend to selecting actions you want, since actions can serve different goals depending on context.
See my footnote on why I don’t think constitutional AI really is a GSLK approach. But constitutional AI is about the closest you could come to selecting goal representations in an actor or policy network; you need a separate representtation of those goals to do the training, like the “constitution” in CAI to call it “selecting”.
Ii realize this stance conflicts with the idea of doing RLHF on an LLM for alignment. I tend to agree with critics that it’s pretty much a misuse of the term “alignment”. The LLM doesn’t have goals in the same strong sense that humans do, so you can’t align its goals. LLMs do have sort of implicit or simulated goals, and I do realize that these could hypothetically make them dangerous. But I just don’t think that’s a likely first route to dangerous AGI when it’s so easy to add goals and make LLMs into agentic language model cognitive architectures.