A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two “motivational circuits” (shards) which can independently activate (more precisely, the shard activation contexts arecompositionally represented).
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
For illustration, what would be an example of having different shards for “I get food” (F) and “I see my parents again” (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?
I think this is also what I was confused about—TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren’t always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don’t understand this.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
Let’s say π:(S×A)∗×S→ΔA is a policy with state space S and action space A.
A “context” is a small moving window in the state-history, i.e. an element of Sd where d is a small positive integer.
A shard is something like u:S×A→R, i.e. it evaluates actions given particular states.
The shards u1,…,un are “activated” by contexts, i.e.gi:Sd→R≥0 maps each context to the amount that shard ui is activated by the context.
The total activation of ui, given a history h:=(s1,a1,s2,a2,…,sN−1,aN−1,sN), is given by the time-decay average of the activation across the contexts, i.e. λi=gi(sN−d+1,…sN)+β⋅gi(sN−d,…,sN−1)+β2⋅gi(sN−d−1,…,sN−2)⋯
The overall utility function u is the weighted average of the shards, i.e. u=λi⋅ui+⋯+λi⋅un
Finally, the policy u will maximise the utility function, i.e. π(h)=softmax(u)
Thanks for posting this. I’ve been confused about the connection between shard theory and activation vectors for a long time!
AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn’t two full motivational circuits, but it kind of is tow motivational circuits.
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.
On a related note, perhaps it might be interesting to try SAEs / other forms of disentanglement to see if there’s actually something like superposition going on in the representations of the maze solving policy? Something like ‘not enough dimensions’ + ambiguity in the reward specification sounds like it might be a pretty attractive explanation for the potential goal misgeneralization.
Based on the findings from these two papers, I propose the following unified explanation/model for how goal representations and task superposition work in neural networks: Key Unified Concepts:
Distributed and Redundant Goal Representations: Both papers demonstrate that goals and tasks are represented in a distributed and redundant manner across multiple network components. The maze-solving policy has 11 channels that track the cheese location, while the language models exhibit task superposition across multiple network components.
Context-Dependent Goal Activation: Goals and tasks are activated in a context-dependent manner. The maze policy pursues different goals based on maze features, while language models activate different tasks based on in-context examples.
Continuous Goal/Task Spaces: Rather than discrete goals/tasks, networks seem to operate in continuous goal/task spaces. This is evidenced by the ability to interpolate between tasks in language models and steer the maze policy to arbitrary locations.
Separable Goal/Task Vectors: Goals and tasks can be represented as separable vectors in activation space. This allows for manipulation through vector addition/subtraction to modify network behavior.
Emergence of Goal/Task Representations: These goal and task representations emerge naturally from training, even when not explicitly trained for multi-task capabilities. Both papers show this emergent behavior in models trained on single objectives.
Unified Model: I propose that neural networks develop an internal “goal/task space” during training. This space has the following properties:
It is continuous and high-dimensional, allowing for representation of many possible goals/tasks.
Different regions of this space correspond to different goals/tasks.
The network’s activations at certain layers map to points in this space, representing the current goal/task.
Multiple redundant mappings exist between network activations and the goal/task space, providing robustness.
The mapping from inputs to points in the goal/task space is context-dependent and learned during training.
The goal/task point then influences subsequent network computations to pursue that goal/task.
Vector operations in activation space correspond to movements in the goal/task space, allowing for controlled modification of network behavior.
This unified model explains the observed phenomena in both papers:
Distributed and redundant goal representations
Context-dependent goal/task selection
Ability to interpolate and combine goals/tasks
Steering of behavior through activation engineering
It also provides a framework for understanding how complex goal-directed and multi-task behaviors can emerge in neural networks trained on seemingly simpler objectives. This model could guide future research into understanding and controlling the goal/task representations in AI systems.
When further prompted with ‘what about superposition and task vectors?’:
You’re right to point out superposition and task vectors—these are indeed key concepts that I should have emphasized more explicitly in the unified model. Let me expand on how these fit into and enhance the unified explanation: Superposition:
Task Superposition: Both papers demonstrate that neural networks can represent and execute multiple tasks or goals simultaneously. This is a core aspect of superposition.
In language models, this manifests as the ability to perform multiple in-context learning tasks in a single forward pass.
In the maze-solving policy, superposition is evident in the network’s ability to pursue multiple context-dependent goals (e.g., navigating to cheese or to the top-right corner).
Superposition allows networks to efficiently represent multiple tasks/goals within the same set of parameters, rather than requiring separate networks for each task.
Task Vectors:
Task Vector Representation: Both papers show that tasks or goals can be represented as vectors in the network’s activation space.
In language models, task vectors are extracted from the difference in activations between task-specific and baseline contexts.
For the maze-solving policy, “cheese vectors” and “top-right vectors” are computed similarly, representing the difference in activations with and without specific goals.
These task vectors can be added or subtracted from network activations to modify behavior, demonstrating their role in representing and controlling task execution.
The ability to combine task vectors (e.g., subtracting the cheese vector while adding the top-right vector) shows how networks can blend multiple task representations.
Enhanced Unified Model: Incorporating these concepts more explicitly, we can enhance the unified model as follows:
The internal “goal/task space” of the network supports superposition, allowing multiple goals/tasks to be represented and pursued simultaneously.
Points in this space can be thought of as weighted combinations of task vectors, with the weights determining the relative importance of each task.
The network’s activations at certain layers encode these superposed task representations.
Task vectors provide a mechanism for moving through the goal/task space, allowing for fine-grained control over network behavior.
The superposition property allows the network to efficiently represent a vast range of possible task combinations within its fixed parameter space.
Context-dependent inputs determine how these superposed task representations are constructed and weighted during inference.
The network’s subsequent computations operate on this superposed representation, effectively pursuing multiple goals/tasks in parallel.
This enhanced model better captures the key insights from both papers, explaining how networks can efficiently represent and execute multiple tasks, and how this capability can be analyzed and controlled through the lens of task vectors and superposition. It provides a powerful framework for understanding the emergent multi-task capabilities of neural networks and suggests methods for steering their behavior through targeted interventions in activation space.
Then, when fed Alex’s shortform comment and prompted with ‘how well would this fit this ‘semi-formalization of shard theory’?’:
This semi-formalization of shard theory aligns well with our unified model of goal representations and task superposition in neural networks. Let’s analyze how it fits and where it might extend our understanding:
Multiple Motivational Circuits: The idea of multiple “motivational circuits” or shards aligns perfectly with our observation of distributed and redundant goal representations. In the maze-solving policy, we identified multiple channels tracking the cheese location, which could be seen as components of a “cheese-seeking shard.”
Independent Activation: The ability of shards to activate independently matches our finding of context-dependent goal activation. Different maze features activate different goals, just as different contexts activate different shards.
Compositional Representation: While the semi-formalization suggests compositional representation, it acknowledges that this is unrealistic due to superposition. This aligns with our model’s emphasis on superposition and continuous goal/task spaces.
Implementation-Level Definition: The focus on implementation rather than behavior fits well with our approach of examining internal network representations rather than just output behavior.
Subjective Granularity: The point about the subjective nature of shard definitions resonates with our observation of redundant and distributed representations. We could choose to model the 11 cheese-tracking channels as separate shards or as components of a single “cheese-seeking shard.”
Steering Vectors: The connection between shard theory and steering vectors is particularly relevant. Our unified model uses task vectors, which are essentially steering vectors, to modify network behavior.
Extensions and Refinements:
Motivational Circuits: We could refine the definition of “motivational circuits” using our understanding of goal-tracking channels and task vectors. A motivational circuit could be defined as a set of network components that, when activated, consistently bias the network’s behavior towards a specific goal or task.
Superposition: Our model explicitly incorporates superposition, addressing one of the limitations in the semi-formalization. We can represent shards as vectors in a continuous space, allowing for non-orthogonal combinations.
Granularity and Hierarchy: The subjective nature of shard definitions could be formalized by considering hierarchical representations in the goal/task space. Lower-dimensional subspaces could represent more general shards, while higher-dimensional representations could capture finer distinctions.
Activation Patterns: We could extend the model to consider not just the presence of shards, but their activation patterns across different contexts. This could help differentiate between systems with similar shards but different activation dynamics.
In conclusion, this semi-formalization of shard theory fits well with our unified model and provides a valuable perspective on understanding complex, multi-goal behaviors in neural networks. It offers a framework for thinking about how multiple motivations can coexist and interact within a single system, which is crucial for understanding and controlling advanced AI systems.
(I guess there might also be a meta-point here about augmented/automated safety research, though I was only using Claude for convenience. Notice that I never fed it my own comment and I only fed it Alex’s at the end, after the ‘unifying theory’ had already been proposed. Also note that my speculation successfully predicted the task vector mechanism before the paper came out; and before the senior author’s post/confirmation.)
A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”[1] In particular, here is a candidate definition of a shard theoretic policy:
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.
That said, I still find this definition useful.
I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing.
Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
For illustration, what would be an example of having different shards for “I get food” (F) and “I see my parents again” (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?
I think this is also what I was confused about—TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren’t always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don’t understand this.
Hey TurnTrout.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
Let’s say π:(S×A)∗×S→ΔA is a policy with state space S and action space A.
A “context” is a small moving window in the state-history, i.e. an element of Sd where d is a small positive integer.
A shard is something like u:S×A→R, i.e. it evaluates actions given particular states.
The shards u1,…,un are “activated” by contexts, i.e.gi:Sd→R≥0 maps each context to the amount that shard ui is activated by the context.
The total activation of ui, given a history h:=(s1,a1,s2,a2,…,sN−1,aN−1,sN), is given by the time-decay average of the activation across the contexts, i.e. λi=gi(sN−d+1,…sN)+β⋅gi(sN−d,…,sN−1)+β2⋅gi(sN−d−1,…,sN−2)⋯
The overall utility function u is the weighted average of the shards, i.e. u=λi⋅ui+⋯+λi⋅un
Finally, the policy u will maximise the utility function, i.e. π(h)=softmax(u)
Is this what you had in mind?
Thanks for posting this. I’ve been confused about the connection between shard theory and activation vectors for a long time!
This confuses me.
I can imagine an AIXI program where the utility function is compositional even if the optimisation is unitary. And I guess this isn’t two full motivational circuits, but it kind of is tow motivational circuits.
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.
Instead of demanding orthogonal representations, just have them obey the restricted isometry property.
Basically, instead of requiring ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ .
This would allow a polynomial number of sparse shards while still allowing full recovery.
Maybe somewhat oversimplifying, but this might suggest non-trivial similarities to Simulators and having [the representations of] multiple tasks in superposition (e.g. during in-context learning). One potential operationalization/implemention mechanism (especially in the case of in-context learning) might be task vectors in superposition.
On a related note, perhaps it might be interesting to try SAEs / other forms of disentanglement to see if there’s actually something like superposition going on in the representations of the maze solving policy? Something like ‘not enough dimensions’ + ambiguity in the reward specification sounds like it might be a pretty attractive explanation for the potential goal misgeneralization.
Edit 1: more early evidence.
Edit 2: the full preprint referenced in tweets above is now public.
Here’s some intuition of why I think this could be a pretty big deal, straight from Claude, when prompted with ‘come up with an explanation/model, etc. that could unify the findings in these 2 papers’ and fed Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition and Understanding and Controlling a Maze-Solving Policy Network:
When further prompted with ‘what about superposition and task vectors?’:
Then, when fed Alex’s shortform comment and prompted with ‘how well would this fit this ‘semi-formalization of shard theory’?’:
(I guess there might also be a meta-point here about augmented/automated safety research, though I was only using Claude for convenience. Notice that I never fed it my own comment and I only fed it Alex’s at the end, after the ‘unifying theory’ had already been proposed. Also note that my speculation successfully predicted the task vector mechanism before the paper came out; and before the senior author’s post/confirmation.)
Edit: and throwback to an even earlier speculation, with arguably at least some predictive power: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=wHeawXzPM3g9xSF8P.