On a related note, perhaps it might be interesting to try SAEs / other forms of disentanglement to see if there’s actually something like superposition going on in the representations of the maze solving policy? Something like ‘not enough dimensions’ + ambiguity in the reward specification sounds like it might be a pretty attractive explanation for the potential goal misgeneralization.
Based on the findings from these two papers, I propose the following unified explanation/model for how goal representations and task superposition work in neural networks: Key Unified Concepts:
Distributed and Redundant Goal Representations: Both papers demonstrate that goals and tasks are represented in a distributed and redundant manner across multiple network components. The maze-solving policy has 11 channels that track the cheese location, while the language models exhibit task superposition across multiple network components.
Context-Dependent Goal Activation: Goals and tasks are activated in a context-dependent manner. The maze policy pursues different goals based on maze features, while language models activate different tasks based on in-context examples.
Continuous Goal/Task Spaces: Rather than discrete goals/tasks, networks seem to operate in continuous goal/task spaces. This is evidenced by the ability to interpolate between tasks in language models and steer the maze policy to arbitrary locations.
Separable Goal/Task Vectors: Goals and tasks can be represented as separable vectors in activation space. This allows for manipulation through vector addition/subtraction to modify network behavior.
Emergence of Goal/Task Representations: These goal and task representations emerge naturally from training, even when not explicitly trained for multi-task capabilities. Both papers show this emergent behavior in models trained on single objectives.
Unified Model: I propose that neural networks develop an internal “goal/task space” during training. This space has the following properties:
It is continuous and high-dimensional, allowing for representation of many possible goals/tasks.
Different regions of this space correspond to different goals/tasks.
The network’s activations at certain layers map to points in this space, representing the current goal/task.
Multiple redundant mappings exist between network activations and the goal/task space, providing robustness.
The mapping from inputs to points in the goal/task space is context-dependent and learned during training.
The goal/task point then influences subsequent network computations to pursue that goal/task.
Vector operations in activation space correspond to movements in the goal/task space, allowing for controlled modification of network behavior.
This unified model explains the observed phenomena in both papers:
Distributed and redundant goal representations
Context-dependent goal/task selection
Ability to interpolate and combine goals/tasks
Steering of behavior through activation engineering
It also provides a framework for understanding how complex goal-directed and multi-task behaviors can emerge in neural networks trained on seemingly simpler objectives. This model could guide future research into understanding and controlling the goal/task representations in AI systems.
When further prompted with ‘what about superposition and task vectors?’:
You’re right to point out superposition and task vectors—these are indeed key concepts that I should have emphasized more explicitly in the unified model. Let me expand on how these fit into and enhance the unified explanation: Superposition:
Task Superposition: Both papers demonstrate that neural networks can represent and execute multiple tasks or goals simultaneously. This is a core aspect of superposition.
In language models, this manifests as the ability to perform multiple in-context learning tasks in a single forward pass.
In the maze-solving policy, superposition is evident in the network’s ability to pursue multiple context-dependent goals (e.g., navigating to cheese or to the top-right corner).
Superposition allows networks to efficiently represent multiple tasks/goals within the same set of parameters, rather than requiring separate networks for each task.
Task Vectors:
Task Vector Representation: Both papers show that tasks or goals can be represented as vectors in the network’s activation space.
In language models, task vectors are extracted from the difference in activations between task-specific and baseline contexts.
For the maze-solving policy, “cheese vectors” and “top-right vectors” are computed similarly, representing the difference in activations with and without specific goals.
These task vectors can be added or subtracted from network activations to modify behavior, demonstrating their role in representing and controlling task execution.
The ability to combine task vectors (e.g., subtracting the cheese vector while adding the top-right vector) shows how networks can blend multiple task representations.
Enhanced Unified Model: Incorporating these concepts more explicitly, we can enhance the unified model as follows:
The internal “goal/task space” of the network supports superposition, allowing multiple goals/tasks to be represented and pursued simultaneously.
Points in this space can be thought of as weighted combinations of task vectors, with the weights determining the relative importance of each task.
The network’s activations at certain layers encode these superposed task representations.
Task vectors provide a mechanism for moving through the goal/task space, allowing for fine-grained control over network behavior.
The superposition property allows the network to efficiently represent a vast range of possible task combinations within its fixed parameter space.
Context-dependent inputs determine how these superposed task representations are constructed and weighted during inference.
The network’s subsequent computations operate on this superposed representation, effectively pursuing multiple goals/tasks in parallel.
This enhanced model better captures the key insights from both papers, explaining how networks can efficiently represent and execute multiple tasks, and how this capability can be analyzed and controlled through the lens of task vectors and superposition. It provides a powerful framework for understanding the emergent multi-task capabilities of neural networks and suggests methods for steering their behavior through targeted interventions in activation space.
Then, when fed Alex’s shortform comment and prompted with ‘how well would this fit this ‘semi-formalization of shard theory’?’:
This semi-formalization of shard theory aligns well with our unified model of goal representations and task superposition in neural networks. Let’s analyze how it fits and where it might extend our understanding:
Multiple Motivational Circuits: The idea of multiple “motivational circuits” or shards aligns perfectly with our observation of distributed and redundant goal representations. In the maze-solving policy, we identified multiple channels tracking the cheese location, which could be seen as components of a “cheese-seeking shard.”
Independent Activation: The ability of shards to activate independently matches our finding of context-dependent goal activation. Different maze features activate different goals, just as different contexts activate different shards.
Compositional Representation: While the semi-formalization suggests compositional representation, it acknowledges that this is unrealistic due to superposition. This aligns with our model’s emphasis on superposition and continuous goal/task spaces.
Implementation-Level Definition: The focus on implementation rather than behavior fits well with our approach of examining internal network representations rather than just output behavior.
Subjective Granularity: The point about the subjective nature of shard definitions resonates with our observation of redundant and distributed representations. We could choose to model the 11 cheese-tracking channels as separate shards or as components of a single “cheese-seeking shard.”
Steering Vectors: The connection between shard theory and steering vectors is particularly relevant. Our unified model uses task vectors, which are essentially steering vectors, to modify network behavior.
Extensions and Refinements:
Motivational Circuits: We could refine the definition of “motivational circuits” using our understanding of goal-tracking channels and task vectors. A motivational circuit could be defined as a set of network components that, when activated, consistently bias the network’s behavior towards a specific goal or task.
Superposition: Our model explicitly incorporates superposition, addressing one of the limitations in the semi-formalization. We can represent shards as vectors in a continuous space, allowing for non-orthogonal combinations.
Granularity and Hierarchy: The subjective nature of shard definitions could be formalized by considering hierarchical representations in the goal/task space. Lower-dimensional subspaces could represent more general shards, while higher-dimensional representations could capture finer distinctions.
Activation Patterns: We could extend the model to consider not just the presence of shards, but their activation patterns across different contexts. This could help differentiate between systems with similar shards but different activation dynamics.
In conclusion, this semi-formalization of shard theory fits well with our unified model and provides a valuable perspective on understanding complex, multi-goal behaviors in neural networks. It offers a framework for thinking about how multiple motivations can coexist and interact within a single system, which is crucial for understanding and controlling advanced AI systems.
(I guess there might also be a meta-point here about augmented/automated safety research, though I was only using Claude for convenience. Notice that I never fed it my own comment and I only fed it Alex’s at the end, after the ‘unifying theory’ had already been proposed. Also note that my speculation successfully predicted the task vector mechanism before the paper came out; and before the senior author’s post/confirmation.)
Maybe somewhat oversimplifying, but this might suggest non-trivial similarities to Simulators and having [the representations of] multiple tasks in superposition (e.g. during in-context learning). One potential operationalization/implemention mechanism (especially in the case of in-context learning) might be task vectors in superposition.
On a related note, perhaps it might be interesting to try SAEs / other forms of disentanglement to see if there’s actually something like superposition going on in the representations of the maze solving policy? Something like ‘not enough dimensions’ + ambiguity in the reward specification sounds like it might be a pretty attractive explanation for the potential goal misgeneralization.
Edit 1: more early evidence.
Edit 2: the full preprint referenced in tweets above is now public.
Here’s some intuition of why I think this could be a pretty big deal, straight from Claude, when prompted with ‘come up with an explanation/model, etc. that could unify the findings in these 2 papers’ and fed Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition and Understanding and Controlling a Maze-Solving Policy Network:
When further prompted with ‘what about superposition and task vectors?’:
Then, when fed Alex’s shortform comment and prompted with ‘how well would this fit this ‘semi-formalization of shard theory’?’:
(I guess there might also be a meta-point here about augmented/automated safety research, though I was only using Claude for convenience. Notice that I never fed it my own comment and I only fed it Alex’s at the end, after the ‘unifying theory’ had already been proposed. Also note that my speculation successfully predicted the task vector mechanism before the paper came out; and before the senior author’s post/confirmation.)
Edit: and throwback to an even earlier speculation, with arguably at least some predictive power: https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector?commentId=wHeawXzPM3g9xSF8P.