Seth Herd comments on Goals selected from learned knowledge: an alternative to RL alignment

Seth Herd Jan 18, 2024, 8:18 PM
2 points
0
Thanks for the thoughts!
If someone did this to you, it would be the “plan for mediocre alignment” since you’re a brainlike GI. They’d say “think about doing whatever I tell you to do”, then you’d wake up and discover that you absolutely love the idea of doing what that person says.
What they’d have done while you are asleep is to do whatever intterpretability stuff they could manage, to verify that you were really thinking about that concept. Then they re-instantiated that same brain state, and induced heavy synaptic strengthening in all the synapses into the dopamine system while you were activating that concept (firing all the neurons that represent it) in your cortex.
I agree that we’re not quite at the level of doing this safely. But if it’s done with sub-human level AGI, you can probably get multiple attempts in a “boxed” state, and deploy whatever interpretability measures you can to see if it worked. It wouldn’t need anything like Davidad’s physics simulation (does Wentworth really also calll for this?), which is good because that seems wildly unrealistic to me. This approach is still vulnerable to deceptive alignment if your techniques and interpretability isn’t good enough.
I don’t think we really can apply this concept to choosing values in the “’agent” if I’m understanding what you mean. I’m only applying the concept to selecting goal representations in a critic or other steering subsystem. You could apply the same concept to selecting goals in a policy or actor network, to the extent they have goal representations. It’s an interesting question. I think current instantiations only have goals to a very vague and limited extent. The concept doesn’t extend to selecting actions you want, since actions can serve different goals depending on context.
See my footnote on why I don’t think constitutional AI really is a GSLK approach. But constitutional AI is about the closest you could come to selecting goal representations in an actor or policy network; you need a separate representtation of those goals to do the training, like the “constitution” in CAI to call it “selecting”.
Ii realize this stance conflicts with the idea of doing RLHF on an LLM for alignment. I tend to agree with critics that it’s pretty much a misuse of the term “alignment”. The LLM doesn’t have goals in the same strong sense that humans do, so you can’t align its goals. LLMs do have sort of implicit or simulated goals, and I do realize that these could hypothetically make them dangerous. But I just don’t think that’s a likely first route to dangerous AGI when it’s so easy to add goals and make LLMs into agentic language model cognitive architectures.