The shard theory of human values
TL;DR: We propose a theory of human value formation. According to this theory, the reward system shapes human values in a relatively straightforward manner. Human values are not e.g. an incredibly complicated, genetically hard-coded set of drives, but rather sets of contextually activated heuristics which were shaped by and bootstrapped from crude, genetically hard-coded reward circuitry.
We think that human value formation is extremely important for AI alignment. We have empirically observed exactly one process which reliably produces agents which intrinsically care about certain objects in the real world, which reflect upon their values and change them over time, and which—at least some of the time, with non-negligible probability—care about each other. That process occurs millions of times each day, despite genetic variation, cultural differences, and disparity in life experiences. That process produced you and your values.
Human values look so strange and inexplicable. How could those values be the product of anything except hack after evolutionary hack? We think this is not what happened. This post describes the shard theory account of human value formation, split into three sections:
Details our working assumptions about the learning dynamics within the brain,
Conjectures that reinforcement learning grows situational heuristics of increasing complexity, and
Uses shard theory to explain several confusing / “irrational” quirks of human decision-making.
Terminological note: We use “value” to mean a contextual influence on decision-making. Examples:
Wanting to hang out with a friend.
Feeling an internal urge to give money to a homeless person.
Feeling an internal urge to text someone you have a crush on.
That tug you feel when you are hungry and pass by a donut.
To us, this definition seems importantly type-correct and appropriate—see Appendix A.2. The main downside is that the definition is relatively broad—most people wouldn’t list “donuts” among their “values.” To avoid this counterintuitiveness, we would refer to a “donut shard” instead of a “donut value.” (“Shard” and associated terminology are defined in section II.)
I. Neuroscientific assumptions
The shard theory of human values makes three main assumptions. We think each assumption is pretty mainstream and reasonable. (For pointers to relevant literature supporting these assumptions, see Appendix A.3.)
Assumption 1: The cortex[1] is basically (locally) randomly initialized. According to this assumption, most of the circuits in the brain are learned from scratch, in the sense of being mostly randomly initialized and not mostly genetically hard-coded. While the high-level topology of the brain may be genetically determined, we think that the local connectivity is not primarily genetically determined. For more clarification, see [Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain.
Thus, we infer that human values & biases are inaccessible to the genome:
It seems hard to scan a trained neural network and locate the AI’s learned “tree” abstraction. For very similar reasons, it seems intractable for the genome to scan a human brain and back out the “death” abstraction, which probably will not form at a predictable neural address. Therefore, we infer that the genome can’t directly make us afraid of death by e.g. specifying circuitry which detects when we think about death and then makes us afraid. In turn, this implies that there are a lot of values and biases which the genome cannot hardcode…
[This leaves us with] a huge puzzle. If we can’t say “the hardwired circuitry down the street did it”, where do biases come from? How can the genome hook the human’s preferences into the human’s world model, when the genome doesn’t “know” what the world model will look like? Why do people usually navigate ontological shifts properly, why don’t people want to wirehead, why do people almost always care about other people if the genome can’t even write circuitry that detects and rewards thoughts about people?”.
Assumption 2: The brain does self-supervised learning. According to this assumption, the brain is constantly predicting what it will next experience and think, from whether a V1 neuron will detect an edge, to whether you’re about to recognize your friend Bill (which grounds out as predicting the activations of higher-level cortical representations). (See On Intelligence for a book-long treatment of this assumption.)
In other words, the brain engages in self-supervised predictive learning: Predict what happens next, then see what actually happened, and update to do better next time.
Definition. Consider the context available to a circuit within the brain. Any given circuit is innervated by axons from different parts of the brain. These axons transmit information to the circuit. Therefore, whether a circuit fires is not primarily dependent on the external situation navigated by the human, or even what the person senses at a given point in time. A circuit fires depending on whether its inputs[2]—the mental context—triggers it or not. This is what the “context” of a shard refers to.
Assumption 3: The brain does reinforcement learning. According to this assumption, the brain has a genetically hard-coded reward system (implemented via certain hard-coded circuits in the brainstem and midbrain). In some[3] fashion, the brain reinforces thoughts and mental subroutines which have led to reward, so that they will be more likely to fire in similar contexts in the future. We suspect that the “base” reinforcement learning algorithm is relatively crude, but that people reliably bootstrap up to smarter credit assignment.
Summary. Under our assumptions, most of the human brain is locally randomly initialized. The brain has two main learning objectives: self-supervised predictive loss (we view this as building your world model; see Appendix A.1) and reward (we view this as building your values, as we are about to explore).
II. Reinforcement events shape human value shards
This section lays out a bunch of highly specific mechanistic speculation about how a simple value might form in a baby’s brain. For brevity, we won’t hedge statements like “the baby is reinforced for X.” We think the story is good and useful, but don’t mean to communicate absolute confidence via our unhedged language.
Given the inaccessibility of world model concepts, how does the genetically hard-coded reward system dispense reward in the appropriate mental situations? For example, suppose you send a drunk text, and later feel embarrassed, and this triggers a penalty. How is that penalty calculated? By information inaccessibility and the absence of text messages in the ancestral environment, the genome isn’t directly hard-coding a circuit which detects that you sent an embarrassing text and then penalizes you. Nonetheless, such embarrassment seems to trigger (negative) reinforcement events… and we don’t really understand how that works yet.
Instead, let’s model what happens if the genome hardcodes a sugar-detecting reward circuit. For the sake of this section, suppose that the genome specifies a reward circuit which takes as input the state of the taste buds and the person’s metabolic needs, and produces a reward if the taste buds indicate the presence of sugar while the person is hungry. By assumption 3 in section I, the brain does reinforcement learning and credit assignment to reinforce circuits and computations which led to reward. For example, if a baby picks up a pouch of apple juice and sips some, that leads to sugar-reward. The reward makes the baby more likely to pick up apple juice in similar situations in the future.
Therefore, a baby may learn to sip apple juice which is already within easy reach. However, without a world model (much less a planning process), the baby cannot learn multi-step plans to grab and sip juice. If the baby doesn’t have a world model, then she won’t be able to act differently in situations where there is or is not juice behind her. Therefore, the baby develops a set of shallow situational heuristics which involve sensory preconditions like “IF juice pouch detected in center of visual field, THEN move arm towards pouch.” The baby is basically a trained reflex agent.
However, when the baby has a proto-world model, the reinforcement learning process takes advantage of that new machinery by further developing the juice-tasting heuristics. Suppose the baby models the room as containing juice within reach but out of sight. Then, the baby happens to turn around, which activates the already-trained reflex heuristic of “grab and drink juice you see in front of you.” In this scenario, “turn around to see the juice” preceded execution of “grab and drink the juice which is in front of me”, and so the baby is reinforced for turning around to grab the juice in situations where the baby models the juice as behind herself.[4]
By this process, repeated many times, the baby learns how to associate world model concepts (e.g. “the juice is behind me”) with the heuristics responsible for reward (e.g. “turn around” and “grab and drink the juice which is in front of me”). Both parts of that sequence are reinforced. In this way, the contextual-heuristics exchange information with the budding world model.
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events. For example, the juice-shard consists of the various decision-making influences which steer the baby towards the historical reinforcer of a juice pouch. These contextual influences were all reinforced into existence by the activation of sugar reward circuitry upon drinking juice. A subshard is a contextually activated component of a shard. For example, “IF juice pouch in front of me THEN grab” is a subshard of the juice-shard. It seems plain to us that learned value shards are[5] most strongly activated in the situations in which they were historically reinforced and strengthened. (For more on terminology, see Appendix A.2.)
While all of this is happening, many different shards of value are also growing, since the human reward system offers a range of feedback signals. Many subroutines are being learned, many heuristics are developing, and many proto-preferences are taking root. At this point, the brain learns a crude planning algorithm,[6] because proto-planning subshards (e.g. IF motor-command-5214
predicted to bring a juice pouch into view, THEN execute) would be reinforced for their contributions to activating the various hardcoded reward circuits. This proto-planning is learnable because most of the machinery was already developed by the self-supervised predictive learning, when e.g. learning to predict the consequences of motor commands (see Appendix A.1).
The planner has to decide on a coherent plan of action. That is, micro-incoherences (turn towards juice, but then turn back towards a friendly adult, but then turn back towards the juice, ad nauseum) should generally be penalized away.[7] Somehow, the plan has to be coherent, integrating several conflicting shards. We find it useful to view this integrative process as a kind of “bidding.” For example, when the juice-shard activates, the shard fires in a way which would have historically increased the probability of executing plans which led to juice pouches. We’ll say that the juice-shard is bidding for plans which involve juice consumption (according to the world model), and perhaps bidding against plans without juice consumption.
Importantly, however, the juice-shard is shaped to bid for plans which the world model predicts actually lead to juice being consumed, and not necessarily for plans which lead to sugar-reward-circuit activation. You might wonder: “Why wouldn’t the shard learn to value reward circuit activation?”. The effect of drinking juice is that the baby’s credit assignment reinforces the computations which were causally responsible for producing the situation in which the hardcoded sugar-reward circuitry fired.
But what is reinforced? The content of the responsible computations includes a sequence of heuristics and decisions, one of which involved the juice pouch abstraction in the world model. Those are the circuits which actually get reinforced and become more likely to fire in the future. Therefore, the juice-heuristics get reinforced. The heuristics coalesce into a so-called shard of value as they query the world model and planner to implement increasingly complex multi-step plans.
In contrast, in this situation, the baby’s decision-making does not involve “if this action is predicted to lead to sugar-reward, then bid for the action.” This non-participating heuristic probably won’t be reinforced or created, much less become a shard of value.[8]
This is important. We see how the reward system shapes our values, without our values entirely binding to the activation of the reward system itself. We have also laid bare the manner in which the juice-shard is bound to your model of reality instead of simply your model of future perception. Looking back across the causal history of the juice-shard’s training, the shard has no particular reason to bid for the plan “stick a wire in my brain to electrically stimulate the sugar reward-circuit”, even if the world model correctly predicts the consequences of such a plan. In fact, a good world model predicts that the person will drink fewer juice pouches after becoming a wireheader, and so the juice-shard in a reflective juice-liking adult bids against the wireheading plan! Humans are not reward-maximizers, they are value shard-executors.
This, we claim, is one reason why people (usually) don’t want to wirehead and why people often want to avoid value drift. According to the sophisticated reflective capabilities of your world model, if you popped a pill which made you 10% more okay with murder, your world model predicts futures which are bid against by your current shards because they contain too much murder.
We’re pretty confident that the reward circuitry is not a complicated hard-coded morass of alignment magic which forces the human to care about real-world juice. No, the hypothetical sugar-reward circuitry is simple. We conjecture that the order in which the brain learns abstractions makes it convergent to care about certain objects in the real world.
III. Explaining human behavior using shard theory
The juice-shard formation story is simple and—if we did our job as authors—easy to understand. However, juice-consumption is hardly a prototypical human value. In this section, we’ll show how shard theory neatly explains a range of human behaviors and preferences.
As people, we have lots of intuitions about human behavior. However, intuitively obvious behaviors still have to have mechanistic explanations—such behaviors still have to be retrodicted by a correct theory of human value formation. While reading the following examples, try looking at human behavior with fresh eyes, as if you were seeing humans for the first time and wondering what kinds of learning processes would produce agents which behave in the ways described.
Altruism is contextual
Consider Peter Singer’s drowning child thought experiment:
Imagine you come across a small child who has fallen into a pond and is in danger of drowning. You know that you can easily and safely rescue him, but you are wearing an expensive pair of shoes that will be ruined if you do.
Probably,[9] most people would save the child, even at the cost of the shoes. However, few of those people donate an equivalent amount of money to save a child far away from them. Why do we care more about nearby visible strangers as opposed to distant strangers?
We think that the answer is simple. First consider the relevant context. The person sees a drowning child. What shards activate? Consider the historical reinforcement events relevant to this context. Many of these events involved helping children and making them happy. These events mostly occurred face-to-face.
For example, perhaps there is a hardcoded reward circuit which is activated by a crude subcortical smile-detector and a hardcoded attentional bias towards objects with relatively large eyes. Then reinforcement events around making children happy would cause people to care about children. For example, an adult’s credit assignment might correctly credit decisions like “smiling at the child” and “helping them find their parents at a fair” as responsible for making the child smile. “Making the child happy” and “looking out for the child’s safety” are two reliable correlates of smiles, and so people probably reliably grow child-subshards around these correlates.
This child-shard most strongly activates in contexts similar to the historical reinforcement events. In particular, “knowing the child exists” will activate the child-shard less strongly than “knowing the child exists and also seeing them in front of you.” “Knowing there are some people hurting somewhere” activates altruism-relevant shards even more weakly still. So it’s no grand mystery that most people care more when they can see the person in need.
Shard theory retrodicts that altruism tends to be biased towards nearby people (and also the ingroup), without positing complex, information-inaccessibility-violating adaptations like the following:
We evolved in small groups in which people helped their neighbors and were suspicious of outsiders, who were often hostile. Today we still have these “Us versus Them” biases, even when outsiders pose no threat to us and could benefit enormously from our help. Our biological history may predispose us to ignore the suffering of faraway people, but we don’t have to act that way. — Comparing the Effect of Rational and Emotional Appeals on Donation Behavior
Similarly, you may be familiar with scope insensitivity: that the function from (# of children at risk) → (willingness to pay to protect the children) is not linear, but perhaps logarithmic. Is it that people “can’t multiply”? Probably not.
Under the shard theory view, it’s not that brains can’t multiply, it’s that for most people, the altruism-shard is most strongly invoked in face-to-face, one-on-one interactions, because those are the situations which have been most strongly touched by altruism-related reinforcement events. Whatever the altruism-shard’s influence on decision-making, it doesn’t steer decision-making so as to produce a linear willingness-to-pay relationship.
Friendship strength seems contextual
Personally, I (TurnTrout) am more inclined to make plans with my friends when I’m already hanging out with them—when we are already physically near each other. But why?
Historically, when I’ve hung out with a friend, that was fun and rewarding and reinforced my decision to hang out with that friend, and to continue spending time with them when we were already hanging out. As above, one possible way this could[10] happen is via a genetically hardcoded smile-activated reward circuit.
Since shards more strongly influence decisions in their historical reinforcement situations, the shards reinforced by interacting with my friend have the greatest control over my future plans when I’m actually hanging out with my friend.
Milgram is also contextual
The Milgram experiment(s) on obedience to authority figures was a series of social psychology experiments conducted by Yale University psychologist Stanley Milgram. They measured the willingness of study participants, men in the age range of 20 to 50 from a diverse range of occupations with varying levels of education, to obey an authority figure who instructed them to perform acts conflicting with their personal conscience. Participants were led to believe that they were assisting an unrelated experiment, in which they had to administer electric shocks to a “learner”. These fake electric shocks gradually increased to levels that would have been fatal had they been real. — Wikipedia
We think that people convergently learn obedience- and cooperation-shards which more strongly influence decisions in the presence of an authority figure, perhaps because of historical obedience-reinforcement events in the presence of teachers / parents. These shards strongly activate in this situation.
We don’t pretend to have sufficient mastery of shard theory to a priori quantitatively predict Milgram’s obedience rate. However, shard theory explains why people obey so strongly in this experimental setup, but not in most everyday situations: The presence of an authority figure and of an official-seeming experimental protocol. This may seem obvious, but remember that human behavior requires a mechanistic explanation. “Common sense” doesn’t cut it. “Cooperation- and obedience-shards more strongly activate in this situation because this situation is similar to historical reinforcement contexts” is a nontrivial retrodiction.
Indeed, varying the contextual features dramatically affected the percentage of people who administered “lethal” shocks:
Sunflowers and timidity
Consider the following claim: “People reliably become more timid when surrounded by tall sunflowers. They become easier to sell products to and ask favors from.”
Let’s see if we can explain this with shard theory. Consider the mental context. The person knows there’s a sunflower near them. What historical reinforcement events pertain to this context? Well, the person probably has pleasant associations with sunflowers, perhaps spawned by aesthetic reinforcement events which reinforced thoughts like “go to the field where sunflowers grow” and “look at the sunflower.”
Therefore, the sunflower-timidity-shard was grown from… Hm. It wasn’t grown. The claim isn’t true, and this shard doesn’t exist, because it’s not downstream of past reinforcement.
Thus: Shard theory does not explain everything, because shards are grown from previous reinforcement events and previous thoughts. Shard theory constrains anticipation around actual observed human nature.
Optional exercise: Why might it feel wrong to not look both ways before crossing the street, even if you have reliable information that the coast is clear?
Optional exercise: Suppose that it’s more emotionally difficult to kill a person face-to-face than from far away and out of sight. Explain via shard theory.[11]
We think that many biases are convergently produced artifacts of the human learning process & environment
We think that simple reward circuitry leads to different cognition activating in different circumstances. Different circumstances can activate cognition that implements different values, and this can lead to inconsistent or biased behavior. We conjecture that many biases are convergent artifacts of the human training process and internal shard dynamics. People aren’t just randomly/hardcoded to be more or less “rational” in different situations.
Projection bias
Humans have a tendency to mispredict their future marginal utilities by assuming that they will remain at present levels. This leads to inconsistency as marginal utilities (for example, tastes) change over time in a way that the individual did not expect. For example, when individuals are asked to choose between a piece of fruit and an unhealthy snack (such as a candy bar) for a future meal, the choice is strongly affected by their “current” level of hunger. — Dynamic inconsistency—Wikipedia
We believe that this is not a misprediction of how tastes will change in the future. Many adults know perfectly well that they will later crave the candy bar. However, a satiated adult has a greater probability of choosing fruit for their later self, because their deliberative shards are more strongly activated than their craving-related shards. The current level of hunger strongly controls which food-related shards are activated.
Sunk cost fallacy
Why are we hesitant to shift away from the course of action that we’re currently pursuing? There are two shard theory-related factors that we think contribute to sunk cost fallacy:
The currently active shards are those that bid for the current course of action. Those shards probably bid for the current course. They also have more influence, since they’re currently very active. Thus, the currently active shard coalition supports the current course of action more strongly, when compared to your “typical” shard coalitions. This can cause the you-that-is-pursuing-the-course-of-action to continue, even after your “otherwise” self would have stopped.
Shards activate more strongly in concrete situations. Actually seeing a bear will activate self-preservation shards more strongly than simply imagining a bear. Thus, the concrete benefits of the current course of action will more easily activate shards than the abstract benefits of an imagined course of action. This can lead to overestimating the value of continuing the current activity relative to the value of other options.
Time inconsistency
A person might deliberately avoid passing through the sweets aisle in a supermarket in order to avoid temptation. This is a very strange thing to do, and it makes no sense from the perspective of an agent maximizing expected utility over quantities like “sweet food consumed” and “leisure time” and “health.” Such an EU-maximizing agent would decide to buy sweets or not, but wouldn’t worry about entering the aisle itself. Avoiding temptation makes perfect sense under shard theory.
Shards are contextually activated, and the sweet-shard is most strongly activated when you can actually see sweets. We think that planning-capable shards are manipulating future contexts so as to prevent the full activation of your sweet shard.
Similarly,
Which do you prefer, to be given 500 dollars today or 505 dollars tomorrow?
Which do you prefer, to be given 500 dollars 365 days from now or 505 dollars 366 days from now?
In such situations, people tend to choose $500 in (A) but $505 in (B), which is inconsistent with exponentially-discounted-utility models of the value of money. To explain this observed behavioral regularity using shard theory, consider the historical reinforcement contexts around immediate and delayed gratification. If contexts involving short-term opportunities activate different shards than contexts involving long-term opportunities, then it’s unsurprising that a person might choose 500 dollars in (A) but 505 dollars in (B).[12] (Of course, a full shard theory explanation must explain why those contexts activate different shards. We strongly intuit that there’s a good explanation, but do not think we have a satisfying story here yet.)
Framing effect
This is another bias that’s downstream of shards activating contextually. Asking the same question in different contexts can change which value-shards activate, and thus change how people answer the question. Consider also: People are hesitant to drink from a cup labeled “poison”, even if they themselves were the one to put the label there.
Other factors driving biases
There are many different reasons why someone might act in a biased manner. We’ve described some shard theory explanations for the listed biases. These explanations are not exhaustive. While writing this, we found an experiment with results that seem contrary to the shard theory explanations of sunk cost. Namely, experiment 4 (specifically, the uncorrelated condition) in this study on sunk cost in pigeons.
However, the cognitive biases literature is so large and heterogeneous that there probably isn’t any theory which cleanly explains all reported experimental outcomes. We think that shard theory has decently broad explanatory power for many aspects of human values and biases, even though not all observations fit neatly into the shard theory frame. (Alternatively, we might have done the shard theory analysis wrong for experiment 4.)
Why people can’t enumerate all their values
Shards being contextual also helps explain why we can’t specify our full values. We can describe a moral theory that seems to capture our values in a given mental context, but it’s usually easy to find some counterexample to such a theory—some context or situation where the specified theory prescribes absurd behavior.
If shards implement your values, and shards activate situationally, your values will also be situational. Once you move away from the mental context / situation in which you came up with the moral theory, you might activate shards that the theory fails to capture. We think that this is why the static utility function framing is hard to operate for humans.
E.g., the classical utilitarianism maxim to maximize joy might initially seem appealing, but it doesn’t take long to generate a new mental context which activates shards that value emotions other than joy, or shards that value things in physical reality beyond your own mental state.
You might generate such new mental contexts by directly searching for shards that bid against pure joy maximization, or by searching for hypothetical scenarios which activate such shards (“finding a counterexample”, in the language of moral philosophy). However, there is no clean way to query all possible shards, and we can’t enumerate every possible context in which shards could activate. It’s thus very difficult to precisely quantify all of our values, or to create an explicit utility function that describes our values.
Content we aren’t (yet) discussing
The story we’ve presented here skips over important parts of human value formation. E.g., humans can do moral philosophy and refactor their deliberative moral framework without necessarily encountering any externally-activated reinforcement events, and humans also learn values through processes like cultural osmosis or imitation of other humans. Additionally, we haven’t addressed learned reinforcers (where a correlate of reinforcement events eventually becomes reinforcing in and of itself). We’ve also avoided most discussion of shard theory’s AI alignment implications.
This post explains our basic picture of shard formation in humans. We will address deeper shard theory-related questions in later posts.
Conclusion
Working from three reasonable assumptions about how the brain works, shard theory implies that human values (e.g. caring about siblings) are implemented by contextually activated circuits which activate in situations downstream of past reinforcement (e.g. when physically around siblings) so as to steer decision-making towards the objects of past reinforcement (e.g. making plans to spend more time together). According to shard theory, human values may be complex, but much of human value formation is simple.
For shard theory discussion, join our Discord server. Charles Foster wrote Appendix A.3. We thank David Udell, Peter Barnett, Raymond Arnold, Garrett Baker, Steve Byrnes, and Thomas Kwa for feedback on this finalized post. Many more people provided feedback on an earlier version.
Appendices
A.1 The formation of the world model
Most of our values seem to be about the real world. Mechanistically, we think that this means that they are functions of the state of our world model. We therefore infer that human values do not form durably or in earnest until after the human has learned a proto-world model. Since the world model is learned from scratch (by assumption 1 in section I), the world model takes time to develop. In particular, we infer that babies don’t have any recognizable “values” to speak of.
Therefore, to understand why human values empirically coalesce around the world model, we will sketch a detailed picture of how the world model might form. We think that self-supervised learning (item 2 in section I) produces your world model.
Due to learning from scratch, the fancy and interesting parts of your brain start off mostly useless. Here’s a speculative[13] story about how a baby learns to reduce predictive loss, in the process building a world model:
The baby is born[14] into a world where she is pummeled by predictive error after predictive error, because most of her brain consists of locally randomly initialized neural circuitry.
The baby’s brain learns that a quick loss-reducing hack is to predict that the next sensory activations will equal the previous ones: That nothing will observationally change from moment to moment. If the baby is stationary, much of the visual scene is constant (modulo saccades). Similar statements may hold for other sensory modalities, from smell (olfaction) to location of body parts (proprioception).
At the same time, the baby starts learning edge detectors in V1[15] (which seem to be universally learned / convergently useful in vision tasks) in order to take advantage of visual regularities across space and time, from moment to moment.
The baby learns to detect when they are being moved or when their eyes are about to saccade, in order to crudely anticipate e.g. translations of part of the visual field. For example, given the prior edge-detector activations and her current acceleration, the baby predicts that the next edge detectors to light up will be a certain translation of the previous edge-detector patterns.
This acceleration → visual translation circuitry is reliably learned because it’s convergently useful for reducing predictive loss in many situations under our laws of physics.
Driven purely by her self-supervised predictive learning, the baby has learned something interesting about how she is embedded in the world.
Once the “In what way is my head accelerating?” circuit is learned, other circuits can invoke it. This pushes toward modularity and generality, since it’s easier to learn a circuit which is predictively useful for two tasks, than to separately learn two variants of the same circuit. See also invariant representations.
The baby begins to learn rules of thumb e.g. about how simple objects move. She continues to build abstract representations of how movement relates to upcoming observations.
For example, she gains another easy reduction in predictive loss by using her own motor commands to predict where her body parts will soon be located (i.e. to predict upcoming proprioceptive observations).
This is the beginning of her self-model.
The rules of thumb become increasingly sophisticated. Object recognition and modeling begins in order to more precisely predict low- and medium-level visual activations, like “if I recognize a square-ish object at time t and it has smoothly moved left for k timesteps, predict I will recognize a square-ish object at time t+1 which is yet farther left in my visual field.”
As the low-hanging fruit are picked, the baby’s brain eventually learns higher-level rules.
“If a stationary object is to my right and I turn my head to the left, then I will stop seeing it, but if I turn my head back to the right, I will see it again.”
This rule requires statefulness via short-term memory and some coarse summary of the object itself (small time-scale object permanence within a shallow world-model).
Object permanence develops from the generalization of specific heuristics for predicting common objects, to an invariant scheme for handling objects and their relationship to the child.
Developmental milestones vary from baby to baby because it takes them a varying amount of time to learn certain keystone but convergent abstractions, such as self-models.
Weak evidence that this learning timeline is convergent: Crows (and other smart animals) reach object permanence milestones in a similar order as human babies reach them.
The more abstractions are learned, the easier it is to lay down additional functionality. When we see a new model of car, we do not have to relearn our edge detectors or car-detectors.
Learning continues, but we will stop here.
In this story, the world model is built from the self-supervised loss signal. Reinforcement probably also guides and focuses attention. For example, perhaps brainstem-hardcoded (but crude) face detectors hook into a reward circuit which focuses the learning on human faces.
A.2 Terminology
Shards are not full subagents
In our conception, shards vary in their sophistication (e.g. IF-THEN reflexes vs planning-capable, reflective shards which query the world model in order to steer the future in a certain direction) and generality of activating contexts (e.g. only activates when hungry and a lollipop is in the middle of the visual field vs activates whenever you’re thinking about a person). However, we think that shards are not discrete subagents with their own world models and mental workspaces. We currently estimate that most shards are “optimizers” to the extent that a bacterium or a thermostat is an optimizer.
“Values”
We defined[16] “values” as “contextual influences on decision-making.” We think that “valuing someone’s friendship” is what it feels like from the inside to be an algorithm with a contextually activated decision-making influence which increases the probability of e.g. deciding to hang out with that friend. Here are three extra considerations and clarifications.
Type-correctness. We think that our definition is deeply appropriate in certain ways. Just because you value eating donuts, doesn’t mean you want to retain that pro-donut influence on your decision-making. This is what it means to reflectively endorse a value shard—that the shards which reason about your shard composition, bid for the donut-shard to stick around. By the same logic, it makes total sense to want your values to change over time—the “reflective” parts of you want the shard composition in the future to be different from the present composition. (For example, many arachnophobes probably want to drop their fear of spiders.) Rather than humans being “weird” for wanting their values to change over time, we think it’s probably the default for smart agents meeting our learning-process assumptions (section I).
Furthermore, your values do not reflect a reflectively endorsed utility function. First off, those are different types of objects. Values bid for and against options, while a utility function grades options. Second, your values vary contextually, while any such utility function would be constant across contexts. More on these points later, in more advanced shard theory posts.
Different shard compositions can produce similar urges. If you feel an urge to approach nearby donuts, that indicates a range of possibilities:
A donut shard is firing to increase P(eating the donut) because the WM indicates there’s a short plan that produces that outcome, and seeing/smelling a donut activates the donut shard particularly strongly.
A hedonic shard is firing to increase P(eating the donut) because the WM indicates there’s a short plan that produces a highly pleasurable outcome.
A social shard is firing because your friends are all eating donuts, and the social shard was historically reinforced for executing plans where you “fit in” / gain their approval.
…
So, just because you feel an urge to eat the donut, doesn’t necessarily mean you have a donut shard or that you “value” donuts under our definition. (But you probably do.)
Shards are just collections of subshards. One subshard of your family-shard might steer towards futures where your family is happy, while another subshard may influence decisions so that your mother is proud of you. On my (TurnTrout’s) current understanding, “family shard” is just an abstraction of a set of heterogeneous subshards which are downstream of similar historical reinforcement events (e.g. related to spending time with your family). By and large, subshards of the same shard do not all steer towards the same kind of future.
“Shard Theory”
Over the last several months, many people have read either a draft version of this document, Alignment Forum comments by shard theory researchers, or otherwise heard about “shard theory” in some form. However, in the absence of a canonical public document explaining the ideas and defining terms, “shard theory” has become overloaded. Here, then, are several definitions.
This document lays out (the beginning of) the shard theory of human values. This theory attempts a mechanistic account of how values / decision-influencers arise in human brains.
As hinted at by our remark on shard theory mispredicting behavior in pigeons, we also expect this theory to qualitatively describe important aspects of animal cognition (insofar as those animals satisfy learning from scratch + self-supervised learning + reinforcement learning).
Typical shard theory questions:
“What is the mechanistic process by which a few people developed preferences over what happens under different laws of physics?”
“What is the mechanistic basis of certain shards (e.g. people respecting you) being ‘reflectively endorsed’, while other shards (e.g. avoiding spiders) can be consciously ‘planned around’ (e.g. going to exposure therapy so that you stop embarrassingly startling when you see a spider)?” Thanks to Thane Ruthenis for this example.
“Why do humans have good general alignment properties, like robustness to ontological shifts?”
The shard paradigm/theory/frame of AI alignment analyzes the value formation processes which will occur in deep learning, and tries to figure out their properties.
Typical questions asked under this paradigm/frame:
“How can we predictably control the way in which a policy network generalizes? For example, under what training regimes and reinforcement schedules would a CoinRun agent generalize to pursuing coins instead of the right end of the level? What quantitative relationships and considerations govern this process?”
“Will deep learning agents robustly and reliably navigate ontological shifts?”
This paradigm places a strong (and, we argue, appropriate) emphasis on taking cues from humans, since they are the only empirical examples of real-world general intelligences which “form values” in some reasonable sense.
That said, alignment implications are out of scope for this post. We postpone discussion to future posts.
“Shard theory” also has been used to refer to insights gained by considering the shard theory of human values and by operating the shard frame on alignment.
We don’t like this ambiguous usage. We would instead say something like “insights from shard theory.”
Example insights include Reward is not the optimization target and Human values & biases are inaccessible to the genome.
A.3 Evidence for neuroscience assumptions
In section I, we stated that shard theory makes three key neuroscientific assumptions. Below we restate those assumptions, and give pointers to what we believe to be representative evidence from the psychology & neuroscience literature:
The cortex is basically locally randomly initialized.
Steve Byrnes has already written on several key lines of evidence that suggest the telencephalon (which includes the cerebral cortex) & cerebellum learn primarily from scratch. We recommend his writing as an entrypoint into that literature.
One easily observable weak piece of evidence: humans are super altricial—if the genome hardcoded a bunch of the cortex, why would babies take so long to become autonomous?
The brain does self-supervised learning.
Certain forms of spike-timing dependent plasticity (STDP) as observed in many regions of telencephalon would straightforwardly support self-supervised learning at the synaptic level, as connections are adjusted such that earlier inputs (pre-synaptic firing) anticipate later outputs (post-synaptic firing).
Within the hippocampus, place-selective cells fire in the order of the spatial locations they are bound to, with a coding scheme that plays out whole sequences of place codes that the animal will later visit.
If the predictive processing framework is an accurate picture of information processing in the brain, then the brain obviously does self-supervised learning.
The brain does reinforcement learning.
Within captive animal care, positive reinforcement training appears to be a common paradigm (see this paper for a reference in the case of nonhuman primates). This at least suggests that “shaping complex behavior through reward” is possible.
Operant & respondent conditioning methods like fear conditioning have a long history of success, and are now related back to key neural structures that support the acquisition and access of learned responses. These paradigms work so well, experimenters have been able to use them to have mice learn to directly control the activity of a single neuron in their motor cortex.
Wolfram Schultz and colleagues have found that the signaling behavior of phasic dopamine in the mesocorticolimbic pathway mirrors that of a TD error (or reward prediction error).
In addition to finding correlates of reinforcement learning signals in the brain, artificial manipulation of those signal correlates (through optogenetic stimulation, for example) produces the behavioral adjustments that would be predicted from their putative role in reinforcement learning.
- ^
More precisely, we adopt Steve Byrnes’ stronger conjecture that the telencephelon and cerebellum are locally ~randomly initialized.
- ^
There are non-synaptic ways to transmit information in the brain, including ephaptic transmission, gap junctions, and volume transmission. We also consider these to be part of a circuit’s mental context.
- ^
We take an agnostic stance on the form of RL in the brain, both because we have trouble spelling out exact neurally plausible base credit assignment and reinforcement learning algorithms, but also so that the analysis does not make additional assumptions.
- ^
In psychology, “shaping” roughly refers to this process of learning increasingly sophisticated heuristics.
- ^
Shards activate more strongly in historical reinforcement contexts, according to our RL intuitions, introspective experience, and inference from observed human behavior. We have some abstract theoretical arguments that RL should work this way in the brain, but won’t include them in this post.
- ^
We think human planning is less like Monte-Carlo Tree Search and more like greedy heuristic search. The heuristic is computed in large part by the outputs of the value shards, which themselves receive input from the world model about the consequences of the plan stub.
- ^
For example, turning back and forth while hungry might produce continual slight negative reinforcement events, at which point good credit assignment blames and downweights the micro-incoherences.
- ^
We think that “hedonic” shards of value can indeed form, and this would be part of why people seem to intrinsically value “rewarding” experiences. However, two points. 1) In this specific situation, the juice-shard forms around real-life juice. 2) We think that even self-proclaimed hedonists have some substantial values which are reality-based instead of reward-based.
- ^
We looked for a citation but couldn’t find one quickly.
- ^
We think the actual historical hanging-out-with-friend reinforcement events transpire differently. We may write more about this in future essays.
- ^
“It’s easier to kill a distant and unseen victim” seems common-sensically true, but we couldn’t actually find citations. Therefore, we are flagging this as possibly wrong folk wisdom. We would be surprised if it were wrong.
- ^
Shard theory reasoning says that while humans might be well-described as “hyperbolic discounters”, the real mechanistic explanation is importantly different. People may well not be doing any explicitly represented discounting; instead, discounting may only convergently arise as a superficial regularity! This presents an obstacle to alignment schemes aiming to infer human preferences by assuming that people are actually discounting.
- ^
We made this timeline up. We expect that we got many details wrong for a typical timeline, but the point is not the exact order. The point is to outline the kind of process by which the world model might arise only from self-supervised learning.
- ^
For simplicity, we start the analysis at birth. There is probably embryonic self-supervised learning as well. We don’t think it matters for this section.
- ^
Interesting but presently unimportant: My (TurnTrout)’s current guess is that given certain hard-coded wiring (e.g. where the optic nerve projects), the functional areas of the brain comprise the robust, convergent solution to: How should the brain organize cognitive labor to minimize the large metabolic costs of information transport (and, later, decision-making latency). This explains why learning a new language produces a new Broca’s area close to the original, and it explains why rewiring ferrets’ retinal projections into the auditory cortex seems to grow a visual cortex there instead. (jacob_cannell posited a similar explanation in 2015.)
The actual function of each functional area is overdetermined by the convergent usefulness of e.g. visual processing or language processing. Convergence builds upon convergence to produce reliable but slightly-varied specialization of cognitive labor across people’s brains. That is, people learn edge detectors because they’re useful, and people’s brains put them in V1 in order to minimize the costs of transferring information.
Furthermore, this process compounds upon itself. Initially there were weak functional convergences, and then mutations finetuned regional learning hyperparameters and connectome topology to better suit those weak functional convergences, and then the convergences sharpened, and so on. We later found that Voss et al.’s Branch Specialization made a similar conjecture about the functional areas. - ^
I (TurnTrout) don’t know whether philosophers have already considered this definition (nor do I think that’s important to our arguments here). A few minutes of searching didn’t return any such definition, but please let me know if it already exists!
- Steering GPT-2-XL by adding an activation vector by 13 May 2023 18:42 UTC; 436 points) (
- Understanding and controlling a maze-solving policy network by 11 Mar 2023 18:59 UTC; 328 points) (
- Discussion with Nate Soares on a key alignment difficulty by 13 Mar 2023 21:20 UTC; 256 points) (
- Inner and outer alignment decompose one hard problem into two extremely hard problems by 2 Dec 2022 2:43 UTC; 146 points) (
- Shard Theory in Nine Theses: a Distillation and Critical Appraisal by 19 Dec 2022 22:52 UTC; 143 points) (
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:49 UTC; 116 points) (
- A case for AI alignment being difficult by 31 Dec 2023 19:55 UTC; 105 points) (
- Value systematization: how values become coherent (and misaligned) by 27 Oct 2023 19:06 UTC; 102 points) (
- A shot at the diamond-alignment problem by 6 Oct 2022 18:29 UTC; 95 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 89 points) (EA Forum;
- Disentangling Shard Theory into Atomic Claims by 13 Jan 2023 4:23 UTC; 86 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:51 UTC; 80 points) (
- Review of AI Alignment Progress by 7 Feb 2023 18:57 UTC; 72 points) (
- Don’t design agents which exploit adversarial inputs by 18 Nov 2022 1:48 UTC; 70 points) (
- Shard Theory—is it true for humans? by 14 Jun 2024 19:21 UTC; 68 points) (
- Understanding strategic deception and deceptive alignment by 25 Sep 2023 16:27 UTC; 64 points) (
- Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by 29 Nov 2022 6:23 UTC; 60 points) (
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by 10 Mar 2023 8:21 UTC; 58 points) (
- The LessWrong 2022 Review: Review Phase by 22 Dec 2023 3:23 UTC; 58 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- AXRP Episode 22 - Shard Theory with Quintin Pope by 15 Jun 2023 19:00 UTC; 52 points) (
- Positive values seem more robust and lasting than prohibitions by 17 Dec 2022 21:43 UTC; 52 points) (
- The heritability of human values: A behavior genetic critique of Shard Theory by 20 Oct 2022 15:53 UTC; 49 points) (EA Forum;
- Mode collapse in RL may be fueled by the update equation by 19 Jun 2023 21:51 UTC; 49 points) (
- Understanding and avoiding value drift by 9 Sep 2022 4:16 UTC; 48 points) (
- Don’t align agents to evaluations of plans by 26 Nov 2022 21:16 UTC; 45 points) (
- Broad Picture of Human Values by 20 Aug 2022 19:42 UTC; 42 points) (
- Are humans misaligned with evolution? by 19 Oct 2023 3:14 UTC; 42 points) (
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 41 points) (
- Contra “Strong Coherence” by 4 Mar 2023 20:05 UTC; 39 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 38 points) (
- EA & LW Forums Weekly Summary (5 − 11 Sep 22’) by 12 Sep 2022 23:21 UTC; 36 points) (EA Forum;
- World-Model Interpretability Is All We Need by 14 Jan 2023 19:37 UTC; 35 points) (
- Value Formation: An Overarching Model by 15 Nov 2022 17:16 UTC; 34 points) (
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 29 points) (EA Forum;
- Alignment, Goals, & The Gut-Head Gap: A Review of Ngo. et al by 11 May 2023 17:16 UTC; 26 points) (EA Forum;
- Internal Interfaces Are a High-Priority Interpretability Target by 29 Dec 2022 17:49 UTC; 26 points) (
- Failure modes in a shard theory alignment plan by 27 Sep 2022 22:34 UTC; 26 points) (
- AI Safety − 7 months of discussion in 17 minutes by 15 Mar 2023 23:41 UTC; 25 points) (
- EA & LW Forums Weekly Summary (5 − 11 Sep 22′) by 12 Sep 2022 23:24 UTC; 24 points) (
- “Wanting” and “liking” by 30 Aug 2023 14:52 UTC; 23 points) (
- Alignment, Goals, and The Gut-Head Gap: A Review of Ngo. et al. by 11 May 2023 18:06 UTC; 20 points) (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 20 points) (
- 14 May 2023 20:27 UTC; 20 points) 's comment on Power-seeking can be probable and predictive for trained agents by (
- Greed Is the Root of This Evil by 13 Oct 2022 20:40 UTC; 18 points) (
- Consider trying Vivek Hebbar’s alignment exercises by 24 Oct 2022 19:46 UTC; 16 points) (EA Forum;
- 28 Feb 2023 20:30 UTC; 13 points) 's comment on DragonGod’s Shortform by (
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by 10 Mar 2023 8:20 UTC; 12 points) (EA Forum;
- Why The Focus on Expected Utility Maximisers? by 27 Dec 2022 15:51 UTC; 11 points) (EA Forum;
- 22 Dec 2023 16:11 UTC; 11 points) 's comment on OpenAI, DeepMind, Anthropic, etc. should shut down. by (EA Forum;
- Disentangling Our Terminal and Instrumental Values by 14 Oct 2023 3:35 UTC; 11 points) (
- On utility functions by 10 Feb 2023 1:22 UTC; 11 points) (
- Some thoughts on George Hotz vs Eliezer Yudkowsky by 15 Aug 2023 23:33 UTC; 10 points) (
- 19 Mar 2023 5:47 UTC; 9 points) 's comment on High-level hopes for AI alignment by (EA Forum;
- 16 Dec 2022 19:01 UTC; 9 points) 's comment on wrapper-minds are the enemy by (
- Strategy of Inner Conflict by 15 Nov 2022 19:38 UTC; 9 points) (
- 3 Nov 2022 0:43 UTC; 9 points) 's comment on AI X-risk >35% mostly based on a recent peer-reviewed argument by (
- I’m planning to start creating more write-ups summarizing my thoughts on various issues, mostly related to AI existential safety. What do you want to hear my nuanced takes on? by 24 Sep 2022 12:38 UTC; 9 points) (
- 23 Oct 2022 17:45 UTC; 7 points) 's comment on How to get past Haidt’s elephant and listen by (
- Proxi-Antipodes: A Geometrical Intuition For The Difficulty Of Aligning AI With Multitudinous Human Values by 9 Jun 2023 21:21 UTC; 7 points) (
- 7 Dec 2023 2:12 UTC; 6 points) 's comment on On Trust by (
- A few Alignment questions: utility optimizers, SLT, sharp left turn and identifiability by 26 Sep 2023 0:27 UTC; 6 points) (
- Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents by 29 Sep 2024 0:32 UTC; 6 points) (
- 3 Mar 2023 0:31 UTC; 6 points) 's comment on DragonGod’s Shortform by (
- 27 Dec 2022 20:25 UTC; 5 points) 's comment on Loose Threads on Intelligence by (
- 3 Oct 2022 17:01 UTC; 4 points) 's comment on Data for IRL: What is needed to learn human values? by (
- 5 Oct 2022 20:03 UTC; 4 points) 's comment on The Pointers Problem: Clarifications/Variations by (
- 6 Nov 2022 1:59 UTC; 4 points) 's comment on A shot at the diamond-alignment problem by (
- 8 Nov 2022 0:29 UTC; 4 points) 's comment on A shot at the diamond-alignment problem by (
- 28 Dec 2022 4:28 UTC; 4 points) 's comment on Loose Threads on Intelligence by (
- 24 Sep 2022 13:12 UTC; 3 points) 's comment on A game of mattering by (
- 24 Apr 2023 17:37 UTC; 3 points) 's comment on Behavioural statistics for a maze-solving agent by (
- 21 Nov 2022 19:38 UTC; 3 points) 's comment on Don’t design agents which exploit adversarial inputs by (
- 2 Jan 2023 7:49 UTC; 3 points) 's comment on Alignment, Anger, and Love: Preparing for the Emergence of Superintelligent AI by (
- 4 Oct 2022 19:15 UTC; 2 points) 's comment on The Pointers Problem: Clarifications/Variations by (
- 24 Sep 2024 2:35 UTC; 2 points) 's comment on The Sun is big, but superintelligences will not spare Earth a little sunlight by (
- 26 Dec 2022 18:46 UTC; 2 points) 's comment on Richard Ngo’s Shortform by (
- Towards the Operationalization of Philosophy & Wisdom by 28 Oct 2024 19:45 UTC; 1 point) (EA Forum;
- 15 Dec 2022 11:21 UTC; 1 point) 's comment on AGI safety from first principles: Introduction by (
- 10 Jan 2024 10:32 UTC; 1 point) 's comment on World-Model Interpretability Is All We Need by (
- LW/ACX Saturday (5/27/23) Values and Shard Theory by 25 May 2023 18:11 UTC; 1 point) (
- 6 Sep 2022 23:10 UTC; 1 point) 's comment on Will Capabilities Generalise More? by (
- 30 Dec 2022 16:19 UTC; 1 point) 's comment on Human sexuality as an interesting case study of alignment by (
- 1 Dec 2022 18:00 UTC; 1 point) 's comment on Goal Alignment Is Robust To the Sharp Left Turn by (
In my personal view, ‘Shard theory of human values’ illustrates both the upsides and pathologies of the local epistemic community.
The upsides
- majority of the claims is true or at least approximately true
- “shard theory” as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,...
- shard theory coined a number of locally memetically fit names or phrases, such as ‘shards’
- part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem
The downsides
- almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models
- the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions)
- the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research)
Overall, ‘shards’ become so popular that reading at least the basics is probably necessary to understand what many people are talking about.
That’s certainly an interesting position in discussion about what people want!
Namely, that actions and preferences are just conditionally-activated and those context activations are balanced against each other. That means that person’s preference system may be not only incomplete but incoherent in architecture, and moral systems and goals obtained via reflection are almost certainly not total (will lack in some contexts), creating problem in RLHF.
The first assumption, that part of neurons is basically randomly initialized, can’t be tested really well because all humans are born in similar gravity field, see similarly-structured images in first days (all “colorful patches” correspond to objects which are continuous, mostly flat or uniformly round), etc and that leaves a generic imprint.