Thanks, I really appreciate that! I’ve just finished an undergrad in cognitive science, so I’m glad that I didn’t make any egregious mistakes, at least.
“AGI won’t be just an RL system … It will need to have explicit goals”: I agree that this if very likely. In fact, the theory of ‘instrumental convergence’ often discussed here is an example of how an RL system could go from being comprised of low-level shards to having higher-level goals (such as power-seeking) that have top-down influence. I think Shard Theory is correct about how very basic RL systems work, but am curious about if RL systems might naturally evolve higher-level goals and values as they become more larger, more complex and are deployed over longer time periods. And of course, as you say, there’s always the possibility of us deliberately adding explicit steering systems.
“shards can’t hide any complex computations...since these route through consciousness.”: Agree. Perhaps a case could be made for people being in denial of certain ideas they don’t want to accept, or ‘doublethink’ where people have two views about a subject contradict each other. Maybe these could be considered different shards competing? Still, it seems a bit of a stretch, and certainly doesn’t describe all our thinking.
“I think there’s another important error in applying shard theory to AGI alignment or human cognition in the claim reward is not the optimization target”: I think this is a very interesting area of discussion. I kind of wanted to delve into this further in the post, and talk about our aversion to wireheading, addiction and the reward system, and all the ways humans do and don’t differentiate between intrinsic rewards and more abstract concepts of goodness, but figured that would be too much of a tangent haha. But overall, I think you’re right.
Thanks, I really appreciate that! I’ve just finished an undergrad in cognitive science, so I’m glad that I didn’t make any egregious mistakes, at least.
“AGI won’t be just an RL system … It will need to have explicit goals”: I agree that this if very likely. In fact, the theory of ‘instrumental convergence’ often discussed here is an example of how an RL system could go from being comprised of low-level shards to having higher-level goals (such as power-seeking) that have top-down influence. I think Shard Theory is correct about how very basic RL systems work, but am curious about if RL systems might naturally evolve higher-level goals and values as they become more larger, more complex and are deployed over longer time periods. And of course, as you say, there’s always the possibility of us deliberately adding explicit steering systems.
“shards can’t hide any complex computations...since these route through consciousness.”: Agree. Perhaps a case could be made for people being in denial of certain ideas they don’t want to accept, or ‘doublethink’ where people have two views about a subject contradict each other. Maybe these could be considered different shards competing? Still, it seems a bit of a stretch, and certainly doesn’t describe all our thinking.
“I think there’s another important error in applying shard theory to AGI alignment or human cognition in the claim reward is not the optimization target”: I think this is a very interesting area of discussion. I kind of wanted to delve into this further in the post, and talk about our aversion to wireheading, addiction and the reward system, and all the ways humans do and don’t differentiate between intrinsic rewards and more abstract concepts of goodness, but figured that would be too much of a tangent haha. But overall, I think you’re right.