What is “shard theory”? I’ve written a lot about shard theory. I largely stand by these models and think they’re good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a “theory”? Is it a “frame”? Is it “a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner”?
I think this understandable confusion happened because my writing didn’t distinguish between:
Shard theory itself,
IE the mechanistic assumptions about internal motivational structure, which seem to imply certain conclusions around e.g. AIs caring about a bunch of different things and not just one thing
A bunch of Quintin Pope’s and my beliefs about how people work,
where those beliefs were derived by modeling people as satisfying the assumptions of (1)
(People might be less excited to use the “shard” abstraction (1), because they aren’t sure whether they buy all this other stuff—(2) and (3).)
I think I can give an interesting and useful definition of (1) now, but I couldn’t do so last year. Maybe “offload shard theory intuitions onto LessWrong” was largely the right choice at the time, but I regret the confusion that has arisen. Maybe I’ll type up my shot at (1)—a semiformal definition of a shard-based agent—when I’m feeling better and more energetic.
Thanks to Alex Lawsen for a conversation which inspired this comment.
I have read a few articles about shard theory, but I still have a problem understanding what it is. It feels like either the “theory” is something trivial, or I am missing the important insights.
(The trivial interpretation would be something like: when people think about their values, they imagine their preferences in specific situations, rather than having a mathematical definition of a utility function.)
What is “shard theory”? I’ve written a lot about shard theory. I largely stand by these models and think they’re good and useful. Unfortunately, lots of people seem to be confused about what shard theory is. Is it a “theory”? Is it a “frame”? Is it “a huge bag of alignment takes which almost no one wholly believes except, perhaps, Quintin Pope and Alex Turner”?
I think this understandable confusion happened because my writing didn’t distinguish between:
Shard theory itself,
IE the mechanistic assumptions about internal motivational structure, which seem to imply certain conclusions around e.g. AIs caring about a bunch of different things and not just one thing
A bunch of Quintin Pope’s and my beliefs about how people work,
where those beliefs were derived by modeling people as satisfying the assumptions of (1)
And a bunch of my alignment insights which I had while thinking about shard theory, or what problem decompositions are useful.
(People might be less excited to use the “shard” abstraction (1), because they aren’t sure whether they buy all this other stuff—(2) and (3).)
I think I can give an interesting and useful definition of (1) now, but I couldn’t do so last year. Maybe “offload shard theory intuitions onto LessWrong” was largely the right choice at the time, but I regret the confusion that has arisen. Maybe I’ll type up my shot at (1)—a semiformal definition of a shard-based agent—when I’m feeling better and more energetic.
Thanks to Alex Lawsen for a conversation which inspired this comment.
I have read a few articles about shard theory, but I still have a problem understanding what it is. It feels like either the “theory” is something trivial, or I am missing the important insights.
(The trivial interpretation would be something like: when people think about their values, they imagine their preferences in specific situations, rather than having a mathematical definition of a utility function.)
Strong encouragement to write about (1)!