Seth Herd comments on Shard Theory—is it true for humans?

Seth Herd Jun 14, 2024, 11:05 PM
10 points
6
Excellent! I concur entirely. I had a 23-year career in computational cognitive neuroscience, focusing on the reward system and its interactions with other learning systems. Your explanations match my understanding.
This reminds me that I’ve never gotten around to writing any critique of shard theory.
In sum, I think Shard Theory is quite correct about how RL networks learn, and correct thaty they should be relatively easy to align. I think it is importantly wrong about how humans learn, and relatedly, wrong about how easy it’s likely to be to align a human-level AGI system. I think Shard Theory conflates AI with AGI, and assumes that we won’t add more sophistication. This is a prediction which could be right, but I think it’s both probably wrong, and unstated. This leads to the highly questionable conclusion that “AI is easy to control”. Current AI is, but assuming that means near-future systems will also be is highly overconfident.
Reading this brought this formulation to mind: Where Shard Theory is importantly wrong about alignment is that it applies to RL systems, but AGI won’t be just an RL system. It won’t be just a collection of learned habits, as RL systems are. It will need to have explicit goals, like humans do when they’re using executive function (and conscious attention). I’ve tried to express the importance of such a steering subsystem for human cognitive abilities.
Having such systems changes the logic of competing shards somewhat. I think there’s some validity to this metaphor, but importantly, shards can’t hide any complex computations, like predictions of outcomes, from each other, since these route through consciousness. You emphasize that, correctly IMO.
The other difference between humans and RL systems is that explicit goals are chosen through a distinct process.
Finally, I think there’s another important error in applying shard theory to AGI alignment or human cognition in the claim reward is not the optimization target. That’s true when there’s no reflective process to refine theories of what drives reward. RL systems as they exist have no such process, so it’s true for them that reward isn’t the optimization target. But with better critic networks that predict reward, as humans have, more advanced systems are likely to figure out what reward is, and so make reward their optimization target.
What links here?
- Shallow review of technical AI safety, 2024 by technicalities (Dec 29, 2024, 12:01 PM; 184 points)
- Rishika Jun 15, 2024, 12:40 AM
  2 points
  0
  Parent
  Thanks, I really appreciate that! I’ve just finished an undergrad in cognitive science, so I’m glad that I didn’t make any egregious mistakes, at least.
  “AGI won’t be just an RL system … It will need to have explicit goals”: I agree that this if very likely. In fact, the theory of ‘instrumental convergence’ often discussed here is an example of how an RL system could go from being comprised of low-level shards to having higher-level goals (such as power-seeking) that have top-down influence. I think Shard Theory is correct about how very basic RL systems work, but am curious about if RL systems might naturally evolve higher-level goals and values as they become more larger, more complex and are deployed over longer time periods. And of course, as you say, there’s always the possibility of us deliberately adding explicit steering systems.
  “shards can’t hide any complex computations...since these route through consciousness.”: Agree. Perhaps a case could be made for people being in denial of certain ideas they don’t want to accept, or ‘doublethink’ where people have two views about a subject contradict each other. Maybe these could be considered different shards competing? Still, it seems a bit of a stretch, and certainly doesn’t describe all our thinking.
  “I think there’s another important error in applying shard theory to AGI alignment or human cognition in the claim reward is not the optimization target”: I think this is a very interesting area of discussion. I kind of wanted to delve into this further in the post, and talk about our aversion to wireheading, addiction and the reward system, and all the ways humans do and don’t differentiate between intrinsic rewards and more abstract concepts of goodness, but figured that would be too much of a tangent haha. But overall, I think you’re right.
- [ ]
  
  [deleted]