Bogdan Ionut Cirstea comments on Bogdan Ionut Cirstea’s Shortform

Bogdan Ionut Cirstea 29 Aug 2024 20:09 UTC
1 point
0
From a chat with Claude on the example of applying a multilevel interpretability framework to deception from https://arxiv.org/abs/2408.12664:
’The paper uses the example of studying deception in language models (LLMs) to illustrate how Marr’s levels of analysis can be applied to AI interpretability research. Here’s a detailed breakdown of how the authors suggest approaching this topic at each level:

1.Computational Level:
- Define criteria for classifying LLM behavior as deception
- Develop comprehensive benchmarks to measure deceptive behaviors across various conditions
- Conduct thorough behavioral assessments while systematically varying: a) Input properties (e.g., prompting strategies, specific word usage) b) Model properties (e.g., training data composition, fine-tuning tasks, model architecture)
- Develop theoretical (possibly Bayesian) models to predict when an LLM would exhibit deceptive behavior given particular inputs and internal states
- Use these insights to adjust training and inference procedures to control LLM deception
2. Algorithmic/Representational Level:
- Examine representational spaces across different layers of the LLM
- Test which spaces systematically separate instances of deception from instances of honesty
- Trace how specific inputs (e.g., “deception triggers”) alter the latent space trajectory of an LLM
- Analyze how generated tokens deviate from the ground truth trajectory to a deception trajectory
- Use high-level representational trajectories to generalize over specific word choices to more general cases of deception
- Determine whether the model uses a single algorithm for different deception cases or if deception is supported by multiple diverse algorithms
3. Implementation Level:
- Trace the specific circuits that selectively activate during deceptive behavior
- Identify circuits that causally contribute to deceptive behavior
- Map out circuits responsible for tracking different kinds of deception-relevant information, such as: a) The true state of the world b) The agent’s intention c) The user’s current and predicted beliefs
- Use insights from computational and algorithmic levels to guide the search for relevant circuits
- Conversely, use circuit-level findings to refine algorithmic-level hypotheses
The authors emphasize that this multi-level approach allows for a more comprehensive understanding of the phenomenon:
- Computational-level analysis can yield insights for adjusting training and inference procedures.
- Algorithmic-level analysis can reveal how the model generalizes across different instances of deception.
- Implementation-level analysis can pinpoint specific mechanisms that can be targeted for intervention.
They also stress the importance of iterating between levels. For example, computational and algorithmic understanding can guide the search for relevant circuits at the implementation level, while circuit-level findings can help refine higher-level hypotheses.
This example illustrates how adopting a multi-level approach can provide a more holistic and actionable understanding of complex behaviors in AI systems, potentially leading to more effective interventions and improvements in model design.′