TAISIC = “the AI safety interpretability community”
MI = “mechanistic interpretability”
There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :)
Lots of interpretability research exists, and the field is still rapidly growing.
Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren’t used much by engineers working on real alignment problems.
If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant.
This post introduced a parable about two researchers trying to understand and fix a classifier displaying racial biases.
The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves.
From an engineer’s perspective, the term “interpretability” isn’t that useful.
Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models.
Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but...
There are many non-deceptive ways AI could go wrong.
MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures.
There is a growing consensus that interpretability research is generally not very productive or engineering relevant.
There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community.
There are three types of evaluation.
Intuition + pontification --> inadequate
Weak/ad-hoc --> still not enough
Based on engineering-relevant tasks --> what is needed
This can be based on one of three things
Making novel predictions about how a model will handle interesting inputs.
Controlling what a system does by guiding edits to it.
Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative
Other common limitations of existing work
Poor scaling
Relying too much on humans in the loop
Failing to study combinations of tools
A lack of practical applications with real-world systems
Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective.
These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well.
These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped to do could also be found by simply examining how the model handles test inputs and looking for what features correspond to particular behaviors from the model.
There are, however, some indirect but useful ways of using attribution maps. For example, regularizing them can improve adversarial robustness. See EIS IX.
TAISIC has some tendencies toward isolation from the broader research community.
TAISIC has reinvented, reframed, or renamed several paradigms
Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation.
All of these are hard and suggest difficulties with scaling MI.
Moving forward, causal scrubbing researchers should be aware of...
the network compression literature.
frivolous circuits that may seem to be performing a task when studied in isolation but do not actually perform that task inside of the network.
TAISIC has done a lot of work on “polysemanticity” and “superposition” in neural networks, but this work is very similar and in some ways redundant to existing work on “disentanglement”.
Two different groups of AI researchers have now been working on the same problems under different names.
Terminology matters, and it may be the case that TAISIC’s terminology has caused a type of isolation among different groups of AI researchers.
Fixing deceptive alignment and trojans are very similar tasks from a technical perspective.
Moving forward, it will be important for researchers working on ELK to not reinvent the probing literature.
TAISIC works relatively little on intrinsic interpretability, but there are a number of reasons why more of this work may be good.
There are four limitations with much of the existing work from TAISIC on MI.
Cherrypicked results.
Examples and anecdotes can be very misleading.
It is important to avoid p-hacking.
Only using toy models and tasks.
Any problem that humans can solve with a simple program is not a problem that should be addressed by deep learning.
A lack of scalability.
The norm with much MI research is not to just solve very simple, convenient problems, but to do so in one of the least scalable ways possible – through large amounts of effort from human experts.
The hard part of MI is generating good mechanistic hypotheses, but by only focusing on toy problems and having humans generate mechanistic hypotheses, MI research has avoided confronting this problem. This work has failed to scale to challenging problems, and might always fail to scale because of this dependency on hypothesis generation.
I would argue that automating the generation of mechanistic hypotheses is the only type of MI work TAISIC should prioritize at this point in time.
Not accomplishing useful tasks
One of the reasons that not much MI research is applied to non-toy problems could be that current approaches to MI may just be ill-equipped to produce competitive techniques.
The post includes a list of engineering-relevant accomplishments that have already been done by non-MI papers. It asks the reader how surprised and excited they would be if these things were accomplished by MI researchers analyzing circuits. This is meant to illustrate the point that different types of methods should not be graded on different curves.
The challenge is to take a network and use MI tools to recover the program implementing the labeling function.
Challenge 1: an MNIST CNN
Challenge 2: a 1-layer transformer.
This problem is still a toy problem, but it is hopefully taking a step in the right direction. Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so.
Deceptive alignment can be similar to non-deceptive forms of misalignment.
Deceptive alignment and other insidious inner alignment failures are issues where the model will betray us as the result of (1) a trigger that is hard to find during development and (2) probably a substantial amount of inner neural circuitry.
This post uses the term “insidious misalignment” to refer to this more general set of failures.
Detecting and fixing deceptive alignment is a very similar (but not entirely equivalent) technical challenge to detecting and fixing trojans.
Mechanistic interpretability is not uniquely capable of fixing insidious misalignment.
Latent adversarial training is a neglected but potentially very useful alternative.
The studies of interpretability and adversaries are inseparable.
More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable.
Interpretability tools can and should be used to guide the design of adversaries.
Adversarial examples can be useful interpretability tools.
Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures.
Working to understand and debug inner mechanisms is the key to make progress on insidious misalignment.
Adversarial examples are (almost always) features, not bugs.
Evidence and nonevidence discussed includes...
Robustness <--> interpretability
Inconclusive
Adversarial transferability
Inconclusive
Adversarial training and task performance
Inconclusive
Generalization from training on nonrobust features
Strong evidence for the feature hypothesis
Genuine nonrobust features
An existance proof for buggy features + weak evidence for the the feature hypothesis.
Observations related to superposition
Moderate evidence for the feature hypothesis.
Evidence from the neural tangent kernel
Weak evidence for the feature hypothesis
What does it all mean for interpretability?
Humans may never be able to fully understand or explain nonrobust networks.
The existence of useful nonrobust features suggests that it will be important to do more intrinsic interpretability and robustness work and to not rely on methods that hinge on humans.
This post discusses ideas for future work including projects that I will be working on soon. Please let me know if you would like to talk about or work on any of these ideas.
Early in the post is a reading list of ~20 past works that have done engineering-relevant things.
Key directions for work
Benchmarking
I will be working on some benchmarking work for LLM interpretability soon.
Red teaming in the wild
This seems useful for applying tools and making interpretability work easier to incorporate into policy frameworks via auditing.
Finding new tools via exploratory work
This could be done by some existing approaches.
Also by research at the intersections of interpretability and other fields.
Combining existing tools
This seems highly important, tractable, and neglected.
Automated translation/distillation of networks into programs
This may be the biggest and most important challenge for MI.
Latent adversarial training
This seems very important and neglected. I will be working on this soon.
Thanks
I hope you enjoyed this sequence and found some useful ideas. Let me know if you’d like to talk about interpretability, adversaries, etc. sometime.
I am very thankful for TAISIC and others in the AI safety space for doing important and interesting work. For me personally, TAISIC members have been excellent sources of inspiration and collaboration, and I’m glad to be a part of this community.
EIS XII: Summary
Part 12 of 12 in the Engineer’s Interpretability Sequence.
TAISIC = “the AI safety interpretability community”
MI = “mechanistic interpretability”
There might be some addenda later, but for now, this is the final post in The Engineer’s Interpretability Sequence. I hope you have found it interesting and have gotten some useful ideas. I will always be happy to talk to people about the topics from this sequence in the comments or via email. For now, the last thing I will do is offer a summary of key points post by post :)
A Prequel: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Räuker et al., 2022)
A survey of over 300 works on inner interpretability from an AI safety perspective.
All opinions in this sequence, however, are my own and not necessarily those of coauthors or other affiliates.
EIS I: Intro
Lots of interpretability research exists, and the field is still rapidly growing.
Most of it is not very productive, and there is a significant gap between the research and practice. Interpretability tools aren’t used much by engineers working on real alignment problems.
If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, we should be working on tools that are more engineering-relevant.
EIS II: What is “Interpretability”?
This post introduced a parable about two researchers trying to understand and fix a classifier displaying racial biases.
The moral of the story is that we should not privilege difficult or interesting methods over easy and simple ones. It is key not to grade different tools on different curves.
From an engineer’s perspective, the term “interpretability” isn’t that useful.
Whatever we call “interpretability” tools are entirely fungible with other techniques related to describing, evaluating, debugging, etc. in models.
Mechanistic approaches to interpretability are not uniquely important for AI safety. MI tools have the potential to help identify and fix deceptive alignment failures, but...
There are many non-deceptive ways AI could go wrong.
MI is not uniquely useful for fixing deceptive alignment and especially not uniquely useful for fixing non-deceptive alignment failures.
EIS III Broad Critiques of Interpretability Research
There is a growing consensus that interpretability research is generally not very productive or engineering relevant.
There is also a growing consensus that better evaluation is needed. A lack of good evaluation methods may be the biggest challenge facing the interpretability research community.
There are three types of evaluation.
Intuition + pontification --> inadequate
Weak/ad-hoc --> still not enough
Based on engineering-relevant tasks --> what is needed
This can be based on one of three things
Making novel predictions about how a model will handle interesting inputs.
Controlling what a system does by guiding edits to it.
Abandoning a system that does a nontrivial task and replacing it with a simpler reverse-engineered alternative
Other common limitations of existing work
Poor scaling
Relying too much on humans in the loop
Failing to study combinations of tools
A lack of practical applications with real-world systems
EIS IV: A Spotlight on Feature Attribution/Saliency
Feature attribution/saliency methods are very common but unlikely to be very important from an engineering perspective.
These methods tend to be poorly evaluated, and when they have been subjected to task-based evaluation, they have not tended to fare well.
These methods just aren’t equipped to directly be very useful even when they work. They require scrutinizing samples from some data distribution. So the exact same things that feature attribution/saliency methods are equipped to do could also be found by simply examining how the model handles test inputs and looking for what features correspond to particular behaviors from the model.
There are, however, some indirect but useful ways of using attribution maps. For example, regularizing them can improve adversarial robustness. See EIS IX.
EIS V: Blind Spots In AI Safety Interpretability Research
TAISIC has some tendencies toward isolation from the broader research community.
TAISIC has reinvented, reframed, or renamed several paradigms
Mechanistic interpretability requires program synthesis, program induction, and/or programming language translation.
All of these are hard and suggest difficulties with scaling MI.
Moving forward, causal scrubbing researchers should be aware of...
the network compression literature.
frivolous circuits that may seem to be performing a task when studied in isolation but do not actually perform that task inside of the network.
TAISIC has done a lot of work on “polysemanticity” and “superposition” in neural networks, but this work is very similar and in some ways redundant to existing work on “disentanglement”.
Two different groups of AI researchers have now been working on the same problems under different names.
Terminology matters, and it may be the case that TAISIC’s terminology has caused a type of isolation among different groups of AI researchers.
Fixing deceptive alignment and trojans are very similar tasks from a technical perspective.
Moving forward, it will be important for researchers working on ELK to not reinvent the probing literature.
TAISIC works relatively little on intrinsic interpretability, but there are a number of reasons why more of this work may be good.
EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety
There are four limitations with much of the existing work from TAISIC on MI.
Cherrypicked results.
Examples and anecdotes can be very misleading.
It is important to avoid p-hacking.
Only using toy models and tasks.
Any problem that humans can solve with a simple program is not a problem that should be addressed by deep learning.
A lack of scalability.
The norm with much MI research is not to just solve very simple, convenient problems, but to do so in one of the least scalable ways possible – through large amounts of effort from human experts.
The hard part of MI is generating good mechanistic hypotheses, but by only focusing on toy problems and having humans generate mechanistic hypotheses, MI research has avoided confronting this problem. This work has failed to scale to challenging problems, and might always fail to scale because of this dependency on hypothesis generation.
I would argue that automating the generation of mechanistic hypotheses is the only type of MI work TAISIC should prioritize at this point in time.
Not accomplishing useful tasks
One of the reasons that not much MI research is applied to non-toy problems could be that current approaches to MI may just be ill-equipped to produce competitive techniques.
The post includes a list of engineering-relevant accomplishments that have already been done by non-MI papers. It asks the reader how surprised and excited they would be if these things were accomplished by MI researchers analyzing circuits. This is meant to illustrate the point that different types of methods should not be graded on different curves.
EIS VII: A Challenge for Mechanists
This post comes with a GitHub repository.
The challenge is to take a network and use MI tools to recover the program implementing the labeling function.
Challenge 1: an MNIST CNN
Challenge 2: a 1-layer transformer.
This problem is still a toy problem, but it is hopefully taking a step in the right direction. Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network’s solution to a task that was not cherrypicked by the researcher(s) doing so.
EIS VIII: An Engineer’s Understanding of Deceptive Alignment
Deceptive alignment can be similar to non-deceptive forms of misalignment.
Deceptive alignment and other insidious inner alignment failures are issues where the model will betray us as the result of (1) a trigger that is hard to find during development and (2) probably a substantial amount of inner neural circuitry.
This post uses the term “insidious misalignment” to refer to this more general set of failures.
Detecting and fixing deceptive alignment is a very similar (but not entirely equivalent) technical challenge to detecting and fixing trojans.
Mechanistic interpretability is not uniquely capable of fixing insidious misalignment.
Latent adversarial training is a neglected but potentially very useful alternative.
EIS IX: Interpretability and Adversaries
The studies of interpretability and adversaries are inseparable.
More interpretable networks are more adversarially robust and more adversarially robust networks are more interpretable.
Interpretability tools can and should be used to guide the design of adversaries.
Adversarial examples can be useful interpretability tools.
Mechanistic interpretability and mechanistic adversarial examples are uniquely equipped for addressing deception and other insidious misalignment failures.
Working to understand and debug inner mechanisms is the key to make progress on insidious misalignment.
Adversarial examples are (almost always) features, not bugs.
Evidence and nonevidence discussed includes...
Robustness <--> interpretability
Inconclusive
Adversarial transferability
Inconclusive
Adversarial training and task performance
Inconclusive
Generalization from training on nonrobust features
Strong evidence for the feature hypothesis
Genuine nonrobust features
An existance proof for buggy features + weak evidence for the the feature hypothesis.
Observations related to superposition
Moderate evidence for the feature hypothesis.
Evidence from the neural tangent kernel
Weak evidence for the feature hypothesis
What does it all mean for interpretability?
Humans may never be able to fully understand or explain nonrobust networks.
The existence of useful nonrobust features suggests that it will be important to do more intrinsic interpretability and robustness work and to not rely on methods that hinge on humans.
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
The science of interpretability is part of a larger picture.
There are rich connections between interpretability, adversaries, continual learning, modularity, compression, and biological brains.
Studying the connections between these areas may be a rich mine for insights and methods.
EIS XI: Moving Forward
This post discusses ideas for future work including projects that I will be working on soon. Please let me know if you would like to talk about or work on any of these ideas.
Early in the post is a reading list of ~20 past works that have done engineering-relevant things.
Key directions for work
Benchmarking
I will be working on some benchmarking work for LLM interpretability soon.
Red teaming in the wild
This seems useful for applying tools and making interpretability work easier to incorporate into policy frameworks via auditing.
Finding new tools via exploratory work
This could be done by some existing approaches.
Also by research at the intersections of interpretability and other fields.
Combining existing tools
This seems highly important, tractable, and neglected.
Automated translation/distillation of networks into programs
This may be the biggest and most important challenge for MI.
Latent adversarial training
This seems very important and neglected. I will be working on this soon.
Thanks
I hope you enjoyed this sequence and found some useful ideas. Let me know if you’d like to talk about interpretability, adversaries, etc. sometime.
I am very thankful for TAISIC and others in the AI safety space for doing important and interesting work. For me personally, TAISIC members have been excellent sources of inspiration and collaboration, and I’m glad to be a part of this community.