I read several shard theory posts and found the details interesting, but I couldn’t quite see the big picture. I’m used to hearing “theory” refer to a falsifiable generalization of data. It is typically stated as a sentence, paragraph, list, mathematical expression, or diagram early in a scientific paper. Theories range from extremely precise like Newton’s three laws of motion to extremely broad like Frankish’s consciousness illusionism (i.e., consciousness is an illusion).
I have also been generally confused about what it means to “solve alignment” before AGI arrives given that there is not (yet) consensus around any pre-AGI[1]formalization of the alignment problem itself: Wouldn’t any proposed solution still have a significant number of people (say, >10% of Alignment Forum users) who think that it doesn’t even pose the problem the right way? What should our “theories” even be aiming for?
With help from Nathan Helm-Burger, I think I now better understand what’s referred to as shard theory and want to share my understanding as an exercise in problem and solution formulation in alignment. I think “shard theory” refers to four sequential components: a Shard Hunch that motivates a two-part Shard Question, the first part of which is currently being answered by the gradual development of an actual Shard Theory of human values, which hopefully provides answers to the second part with Shard Insight that can be implemented in AI systems to facilitate alignment.[2] Namely:
Shard Hunch: A human brain is a general intelligences, and its intentions and behavior are reasonably aligned with its many shards of values (i.e., little bits of “contextual influences on decision-making”[3]). Maybe something like that alignment can work for AI too![4]
Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers’ values?
Shard Theory: The brain ensures alignment with its values by doing A, B, C, etc.
Shard Insight: We can ensure the alignment of an AI with its designers’ values by doing X, Y, Z, etc. mapped from shard theory.
This is exciting! Now when I read shard theory research, I feel like I properly understand it as gradually filling in A, B, C, X, Y, Z, etc. For example, Assumptions 1, 2, and 3 in “The Shard Theory of Human Values” are examples of A, B, and C, and the two arguments in “Reward is Not the Optimization Target” and “Human Values & Biases are Inaccessible to the Genome” are examples of X and Y. I also think the specific idea of a “shard” is less central to these claims than I thought; it seems the first of those posts could parsimoniously replace “shard” with “value” (in the dictionary sense) with very little meaning lost, and the latter two posts don’t even use the word. I wonder if something like “Brain-Inspired Alignment” would be a clearer label, at least until a central concept like shards emerges in the research.
Shard research is also at a very early stage, so it is inevitably less focused on stating and validating the falsifiable, non-trivial claims that could be an actual shard theory (which is usually what we discuss in science) and instead seems to mostly be developing a language for eventually specifying shard theory—much like how Rubin’s potential outcomes (POs) and Pearl’s directed acyclic graphs (DAGs) were important developments in causality research because they allowed for the clear statement of falsifiable, nontrivial causal theories. Pope and Turner also use the terms “paradigm” and “frame,” which I think are more fitting for what they have done so far than shard “theory” per se though less specific than “language.”
For example, the post “Reward is not the optimization target” and Paul Christiano’s reply seem better read not as claim and counter-claim, but as thinking about the most useful neuroscience-inspired way to define “reward,” “optimization,” etc. These discussions seem to some hand waving and talking past each other, so I also wonder if more explicitly approaching shard theory as building a language, not as sharing an extant theory, would help us think more clearly. In any case, these meta-questions seem inevitable as the field of AI alignment advances and we come closer to developed theories and solutions—whatever that means.
Post-AGI formalizations of alignment, such as thresholds for how much value persists, seem less controversial but also less useful than a pre-AGI formalization would be. And they still seem far from uncontroversial. For example, some make an appeal to moral nature, so to speak, to keep human value as close to its current path as possible while ensuring AI safety, while other see this as false or confused.
Pope and Turner say in “The Shard Theory of Human Values” that ““Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment. … We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”” I take that to mean they do not include anything about AI alignment itself as shard theory. I think this will confuse many people because of how central AI alignment is to the shard theory project.
This definition of value (i.e., shard) is unintuitively broad, as Pope and Turner acknowledge. I think precisifying and clarifying that will be an important part of building shard theory.
Unpacking “Shard Theory” as Hunch, Question, Theory, and Insight
I read several shard theory posts and found the details interesting, but I couldn’t quite see the big picture. I’m used to hearing “theory” refer to a falsifiable generalization of data. It is typically stated as a sentence, paragraph, list, mathematical expression, or diagram early in a scientific paper. Theories range from extremely precise like Newton’s three laws of motion to extremely broad like Frankish’s consciousness illusionism (i.e., consciousness is an illusion).
I have also been generally confused about what it means to “solve alignment” before AGI arrives given that there is not (yet) consensus around any pre-AGI[1] formalization of the alignment problem itself: Wouldn’t any proposed solution still have a significant number of people (say, >10% of Alignment Forum users) who think that it doesn’t even pose the problem the right way? What should our “theories” even be aiming for?
With help from Nathan Helm-Burger, I think I now better understand what’s referred to as shard theory and want to share my understanding as an exercise in problem and solution formulation in alignment. I think “shard theory” refers to four sequential components: a Shard Hunch that motivates a two-part Shard Question, the first part of which is currently being answered by the gradual development of an actual Shard Theory of human values, which hopefully provides answers to the second part with Shard Insight that can be implemented in AI systems to facilitate alignment.[2] Namely:
Shard Hunch: A human brain is a general intelligences, and its intentions and behavior are reasonably aligned with its many shards of values (i.e., little bits of “contextual influences on decision-making”[3]). Maybe something like that alignment can work for AI too![4]
Shard Question: How does the human brain ensure alignment with its values, and how can we use that information to ensure the alignment of an AI with its designers’ values?
Shard Theory: The brain ensures alignment with its values by doing A, B, C, etc.
Shard Insight: We can ensure the alignment of an AI with its designers’ values by doing X, Y, Z, etc. mapped from shard theory.
This is exciting! Now when I read shard theory research, I feel like I properly understand it as gradually filling in A, B, C, X, Y, Z, etc. For example, Assumptions 1, 2, and 3 in “The Shard Theory of Human Values” are examples of A, B, and C, and the two arguments in “Reward is Not the Optimization Target” and “Human Values & Biases are Inaccessible to the Genome” are examples of X and Y. I also think the specific idea of a “shard” is less central to these claims than I thought; it seems the first of those posts could parsimoniously replace “shard” with “value” (in the dictionary sense) with very little meaning lost, and the latter two posts don’t even use the word. I wonder if something like “Brain-Inspired Alignment” would be a clearer label, at least until a central concept like shards emerges in the research.
Shard research is also at a very early stage, so it is inevitably less focused on stating and validating the falsifiable, non-trivial claims that could be an actual shard theory (which is usually what we discuss in science) and instead seems to mostly be developing a language for eventually specifying shard theory—much like how Rubin’s potential outcomes (POs) and Pearl’s directed acyclic graphs (DAGs) were important developments in causality research because they allowed for the clear statement of falsifiable, nontrivial causal theories. Pope and Turner also use the terms “paradigm” and “frame,” which I think are more fitting for what they have done so far than shard “theory” per se though less specific than “language.”
For example, the post “Reward is not the optimization target” and Paul Christiano’s reply seem better read not as claim and counter-claim, but as thinking about the most useful neuroscience-inspired way to define “reward,” “optimization,” etc. These discussions seem to some hand waving and talking past each other, so I also wonder if more explicitly approaching shard theory as building a language, not as sharing an extant theory, would help us think more clearly. In any case, these meta-questions seem inevitable as the field of AI alignment advances and we come closer to developed theories and solutions—whatever that means.
Post-AGI formalizations of alignment, such as thresholds for how much value persists, seem less controversial but also less useful than a pre-AGI formalization would be. And they still seem far from uncontroversial. For example, some make an appeal to moral nature, so to speak, to keep human value as close to its current path as possible while ensuring AI safety, while other see this as false or confused.
Pope and Turner say in “The Shard Theory of Human Values” that ““Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment. … We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”” I take that to mean they do not include anything about AI alignment itself as shard theory. I think this will confuse many people because of how central AI alignment is to the shard theory project.
This definition of value (i.e., shard) is unintuitively broad, as Pope and Turner acknowledge. I think precisifying and clarifying that will be an important part of building shard theory.
The Shard Hunch is most clearly stated in the first blockquote in Turner’s “Looking Back on My Alignment PhD” and in Turner’s comment on “Where I Agree and Disagree with Eliezer.”