TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 11 Aug 2023 23:45 UTC
LW: 22 AF: 13
1
AF
What is “shard theory”? I’ve written a lot about shard theory. I largely stand by these models and think they’re good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a “theory”? Is it a “frame”? Is it “a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner”?
I think this understandable confusion happened because my writing didn’t distinguish between:
1. Shard theory itself,
  1. IE the mechanistic assumptions about internal motivational structure, which seem to imply certain conclusions around e.g. AIs caring about a bunch of different things and not just one thing
2. A bunch of Quintin Pope’s and my beliefs about how people work,
  1. where those beliefs were derived by modeling people as satisfying the assumptions of (1)
3. And a bunch of my alignment insights which I had while thinking about shard theory, or what problem decompositions are useful.
(People might be less excited to use the “shard” abstraction (1), because they aren’t sure whether they buy all this other stuff—(2) and (3).)
I think I can give an interesting and useful definition of (1) now, but I couldn’t do so last year. Maybe “offload shard theory intuitions onto LessWrong” was largely the right choice at the time, but I regret the confusion that has arisen. Maybe I’ll type up my shot at (1)—a semiformal definition of a shard-based agent—when I’m feeling better and more energetic.
Thanks to Alex Lawsen for a conversation which inspired this comment.
- Viliam 14 Aug 2023 21:11 UTC
  2 points
  1
  Parent
  I have read a few articles about shard theory, but I still have a problem understanding what it is. It feels like either the “theory” is something trivial, or I am missing the important insights.
  (The trivial interpretation would be something like: when people think about their values, they imagine their preferences in specific situations, rather than having a mathematical definition of a utility function.)
- Adele Lopez 12 Aug 2023 4:41 UTC
  LW: 2 AF: 1
  1
  AF Parent
  Strong encouragement to write about (1)!