’The paper uses the example of studying deception in language models (LLMs) to illustrate how Marr’s levels of analysis can be applied to AI interpretability research. Here’s a detailed breakdown of how the authors suggest approaching this topic at each level:
1.Computational Level:
Define criteria for classifying LLM behavior as deception
Develop comprehensive benchmarks to measure deceptive behaviors across various conditions
Conduct thorough behavioral assessments while systematically varying: a) Input properties (e.g., prompting strategies, specific word usage) b) Model properties (e.g., training data composition, fine-tuning tasks, model architecture)
Develop theoretical (possibly Bayesian) models to predict when an LLM would exhibit deceptive behavior given particular inputs and internal states
Use these insights to adjust training and inference procedures to control LLM deception
2. Algorithmic/Representational Level:
Examine representational spaces across different layers of the LLM
Test which spaces systematically separate instances of deception from instances of honesty
Trace how specific inputs (e.g., “deception triggers”) alter the latent space trajectory of an LLM
Analyze how generated tokens deviate from the ground truth trajectory to a deception trajectory
Use high-level representational trajectories to generalize over specific word choices to more general cases of deception
Determine whether the model uses a single algorithm for different deception cases or if deception is supported by multiple diverse algorithms
3. Implementation Level:
Trace the specific circuits that selectively activate during deceptive behavior
Identify circuits that causally contribute to deceptive behavior
Map out circuits responsible for tracking different kinds of deception-relevant information, such as: a) The true state of the world b) The agent’s intention c) The user’s current and predicted beliefs
Use insights from computational and algorithmic levels to guide the search for relevant circuits
Conversely, use circuit-level findings to refine algorithmic-level hypotheses
The authors emphasize that this multi-level approach allows for a more comprehensive understanding of the phenomenon:
Computational-level analysis can yield insights for adjusting training and inference procedures.
Algorithmic-level analysis can reveal how the model generalizes across different instances of deception.
Implementation-level analysis can pinpoint specific mechanisms that can be targeted for intervention.
They also stress the importance of iterating between levels. For example, computational and algorithmic understanding can guide the search for relevant circuits at the implementation level, while circuit-level findings can help refine higher-level hypotheses. This example illustrates how adopting a multi-level approach can provide a more holistic and actionable understanding of complex behaviors in AI systems, potentially leading to more effective interventions and improvements in model design.′
From a chat with Claude on the example of applying a multilevel interpretability framework to deception from https://arxiv.org/abs/2408.12664:
’The paper uses the example of studying deception in language models (LLMs) to illustrate how Marr’s levels of analysis can be applied to AI interpretability research. Here’s a detailed breakdown of how the authors suggest approaching this topic at each level:
1.Computational Level:
Define criteria for classifying LLM behavior as deception
Develop comprehensive benchmarks to measure deceptive behaviors across various conditions
Conduct thorough behavioral assessments while systematically varying: a) Input properties (e.g., prompting strategies, specific word usage) b) Model properties (e.g., training data composition, fine-tuning tasks, model architecture)
Develop theoretical (possibly Bayesian) models to predict when an LLM would exhibit deceptive behavior given particular inputs and internal states
Use these insights to adjust training and inference procedures to control LLM deception
2. Algorithmic/Representational Level:
Examine representational spaces across different layers of the LLM
Test which spaces systematically separate instances of deception from instances of honesty
Trace how specific inputs (e.g., “deception triggers”) alter the latent space trajectory of an LLM
Analyze how generated tokens deviate from the ground truth trajectory to a deception trajectory
Use high-level representational trajectories to generalize over specific word choices to more general cases of deception
Determine whether the model uses a single algorithm for different deception cases or if deception is supported by multiple diverse algorithms
3. Implementation Level:
Trace the specific circuits that selectively activate during deceptive behavior
Identify circuits that causally contribute to deceptive behavior
Map out circuits responsible for tracking different kinds of deception-relevant information, such as: a) The true state of the world b) The agent’s intention c) The user’s current and predicted beliefs
Use insights from computational and algorithmic levels to guide the search for relevant circuits
Conversely, use circuit-level findings to refine algorithmic-level hypotheses
The authors emphasize that this multi-level approach allows for a more comprehensive understanding of the phenomenon:
Computational-level analysis can yield insights for adjusting training and inference procedures.
Algorithmic-level analysis can reveal how the model generalizes across different instances of deception.
Implementation-level analysis can pinpoint specific mechanisms that can be targeted for intervention.
They also stress the importance of iterating between levels. For example, computational and algorithmic understanding can guide the search for relevant circuits at the implementation level, while circuit-level findings can help refine higher-level hypotheses.
This example illustrates how adopting a multi-level approach can provide a more holistic and actionable understanding of complex behaviors in AI systems, potentially leading to more effective interventions and improvements in model design.′