The one direct approach to the stability problem in deep networks that I know of is Alex Turner’s A shot at the diamond-alignment problem. I think this sort of reflective stability in a deep network or similarly messy agent is only likely to hold going to apply to the most centrally held goal (or shard), and for everything else we have no idea. Context matters, as you point out, so even for the one biggest goal it’s not guaranteed to stay stable under reflective stability after a lot of learning and perhaps successor design.
I hope the links are useful. I’m also working on another post on exactly this topic, and I’ll cite this one.
I’ve now read your alignment stability post and the goal changes in intelligent agents post, and they’re pretty good. In the 2018 post, I liked how you framed all the previous alignment attempts as value reflectivity adjacent. For some of them, like motivation drift or some examples of representation hacking, I think I would have categorized the failuremode more along the lines of goodheart, though there is some sense in which, when seen from the outside, goodheart looks a lot like value drift. Like, as the agent gets smarter and thinks about what it wants more, it will look more and more like it doesn’t care about the target goal. But from the agent’s point of view, no goal changes are happening, its just getting better at what it was previously doing.
Yeah, I think this area is super important to think about, and that this post is possibly my most important post to-date. I’m glad you’re also thinking about this, and am excited to read what you have written.
This is great. I think this is an important consideration because the new school of deep network thinkers is excited that we have some actual promising approaches to alignment. These include shard theory, Steve Byrnes’ Plan for mediocre alignment of brain-like [model-based RL] AGI, and the potential for natural language chain-of-thought alignment as in my post Capabilities and alignment of LLM cognitive architectures. All of these are promising, but none really address the stability problem.
I think this is exactly the problem I was pointing to in my post The alignment stability problem and my 2018 paper Goal changes in intelligent agents.
The one direct approach to the stability problem in deep networks that I know of is Alex Turner’s A shot at the diamond-alignment problem. I think this sort of reflective stability in a deep network or similarly messy agent is only likely to hold going to apply to the most centrally held goal (or shard), and for everything else we have no idea. Context matters, as you point out, so even for the one biggest goal it’s not guaranteed to stay stable under reflective stability after a lot of learning and perhaps successor design.
I hope the links are useful. I’m also working on another post on exactly this topic, and I’ll cite this one.
I’ve now read your alignment stability post and the goal changes in intelligent agents post, and they’re pretty good. In the 2018 post, I liked how you framed all the previous alignment attempts as value reflectivity adjacent. For some of them, like motivation drift or some examples of representation hacking, I think I would have categorized the failuremode more along the lines of goodheart, though there is some sense in which, when seen from the outside, goodheart looks a lot like value drift. Like, as the agent gets smarter and thinks about what it wants more, it will look more and more like it doesn’t care about the target goal. But from the agent’s point of view, no goal changes are happening, its just getting better at what it was previously doing.
Yeah, I think this area is super important to think about, and that this post is possibly my most important post to-date. I’m glad you’re also thinking about this, and am excited to read what you have written.