I’ve now read your alignment stability post and the goal changes in intelligent agents post, and they’re pretty good. In the 2018 post, I liked how you framed all the previous alignment attempts as value reflectivity adjacent. For some of them, like motivation drift or some examples of representation hacking, I think I would have categorized the failuremode more along the lines of goodheart, though there is some sense in which, when seen from the outside, goodheart looks a lot like value drift. Like, as the agent gets smarter and thinks about what it wants more, it will look more and more like it doesn’t care about the target goal. But from the agent’s point of view, no goal changes are happening, its just getting better at what it was previously doing.
I’ve now read your alignment stability post and the goal changes in intelligent agents post, and they’re pretty good. In the 2018 post, I liked how you framed all the previous alignment attempts as value reflectivity adjacent. For some of them, like motivation drift or some examples of representation hacking, I think I would have categorized the failuremode more along the lines of goodheart, though there is some sense in which, when seen from the outside, goodheart looks a lot like value drift. Like, as the agent gets smarter and thinks about what it wants more, it will look more and more like it doesn’t care about the target goal. But from the agent’s point of view, no goal changes are happening, its just getting better at what it was previously doing.