A Longlist of Theories of Impact for Interpretability
I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I’ve compiled a list with a longlist of 19 different arguments I’ve heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I’m sharing them in the hopes that this is interesting to people
(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)
Credit to Evan Hubinger for writing the excellent Chris Olah’s Views on AGI Safety, which was the source of several of these arguments!
Force-multiplier on alignment research: We can analyse a model to see why it gives misaligned answers, and what’s going wrong. This gets much richer data on empirical alignment work, and lets it progress faster
Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws.
Eg, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training
Auditing: We get a Mulligan. After training a system, we can check for misalignment, and only deploy if we’re confident it’s safe
Auditing for deception: Similar to auditing, we may be able detect deception in a model
This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features—I see this more as a theory of change for ‘worlds where interpretability is harder than I hope’
Enabling coordination/cooperation: If different actors can interpret each other’s systems, it’s much easier to trust other actors to behave sensibly and coordinate better
Empirical evidence for/against threat models: We can look for empirical examples of theorised future threat models, eg inner misalignment
Coordinating work on threat models: If we can find empirical examples of eg inner misalignment, it seems much easier to convince skeptics this is an issue, and maybe get more people to work on it.
Coordinating a slowdown: If alignment is really hard, it seems much easier to coordinate caution/a slowdown of the field with eg empirical examples of models that seem aligned but are actually deceptive
Improving human feedback: Rather than training models to just do the right things, we can train them to do the right things for the right reasons
Informed oversight: We can improve recursive alignment schemes like IDA by having each step include checking the system is actually aligned
Note: This overlaps a lot with 7. To me, the distinction is that 7 can be also be applied with systems trained non-recursively, eg today’s systems trained with Reinforcement Learning from Human Feedback
Interpretability tools in the loss function: We can directly put an interpretability tool into the training loop to ensure the system is doing things in an aligned way
Ambitious version—the tool is so good that it can’t be Goodharted
Less ambitious—The could be Goodharted, but it’s expensive, and this shifts the inductive biases to favour aligned cognition
Norm setting: If interpretability is easier, there may be expectations that, before a company deploys a system, part of doing due diligence is interpreting the system and checking it does what you want
Enabling regulation: Regulators and policy-makers can create more effective regulations around how aligned AI systems must be if they/the companies can use tools to audit them
Cultural shift 1: If the field of ML shifts towards having a better understanding of models, this may lead to a better understanding of failure cases and how to avoid them
Cultural shift 2: If the field expects better understanding of how models work, it’ll become more glaringly obvious how little we understand right now
Quote: Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.
Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is ‘actually understanding how our black box systems work’ not helpful?
Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves
Training AIs to interpret other AIs: Even if interpretability is really hard/labour intensive on advanced systems, if we can create aligned AIs near human level, we can give these interpretability tools and use them to interpret more powerful systems
Forecasting discontinuities: By understanding what’s going on, we can predict how likely we are to see discontinuities in alignment/capabilities, and potentially detect a discontinuity while training/before deploying a system
Intervening on training: By interpreting a system during training, we can notice misalignment early on, potentially before it’s good enough for strategies to avoid our notice such as deceptive alignment, gradient hacking, obfuscating its thoughts, etc.
Auditing a training run: By checking for misalignment early in training, we can stop training systems that seem misaligned. This gives us many more shots to make an aligned system without spending large amounts of capital, and eg allows us to try multiple different schemes, initialisations, etc. This essentially shifts the distribution of systems towards alignment.
Eliciting Latent Knowledges: Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do. (credit to Tao Lin for this one)
Exercise: I think a really useful quick exercise for eliciting inside views is to go through this list and quickly rate each theory of impact out of 10. I ran a small discussion group on these where we did this, you can see the results here—note that these are very off-the-cuff hot takes, and not something you should take seriously or expect participants to stand by later on! The rated order was 1, 4, 3, 6, 8, 18, 2, 7, 16, 19, 17, 10, 9, 12, 14, 13, 15, 11, 5 (ie 1=‘a force-multiplier on alignment research’ is the best). Interestingly, 3 (auditing a fully-trained model) was the most controversial highly rated one.
I’ve made a publicly editable spreadsheet for you to add in your responses!
The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.
I’d love to hear other people’s ratings and why! And any arguments that you think I’ve missed.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- A Mechanistic Interpretability Analysis of Grokking by 15 Aug 2022 2:41 UTC; 373 points) (
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 322 points) (
- On how various plans miss the hard bits of the alignment challenge by 12 Jul 2022 2:49 UTC; 305 points) (
- What I would do if I wasn’t at ARC Evals by 5 Sep 2023 19:19 UTC; 217 points) (
- A transparency and interpretability tech tree by 16 Jun 2022 23:44 UTC; 163 points) (
- What I would do if I wasn’t at ARC Evals by 6 Sep 2023 5:17 UTC; 130 points) (EA Forum;
- On how various plans miss the hard bits of the alignment challenge by 12 Jul 2022 5:35 UTC; 125 points) (EA Forum;
- What are the best examples of object-level work that was done by (or at least inspired by) the longtermist EA community that concretely and legibly reduced existential risk? by 11 Feb 2023 13:49 UTC; 118 points) (EA Forum;
- Circumventing interpretability: How to defeat mind-readers by 14 Jul 2022 16:59 UTC; 114 points) (
- Current themes in mechanistic interpretability research by 16 Nov 2022 14:14 UTC; 89 points) (
- (Even) More Early-Career EAs Should Try AI Safety Technical Research by 30 Jun 2022 21:14 UTC; 86 points) (EA Forum;
- Decision Transformer Interpretability by 6 Feb 2023 7:29 UTC; 84 points) (
- Analogies between scaling labs and misaligned superintelligent AI by 21 Feb 2024 19:29 UTC; 75 points) (
- Components of Strategic Clarity [Strategic Perspectives on Long-term AI Governance, #2] by 2 Jul 2022 11:22 UTC; 66 points) (EA Forum;
- ‘Fundamental’ vs ‘applied’ mechanistic interpretability research by 23 May 2023 18:26 UTC; 65 points) (
- A gentle introduction to mechanistic anomaly detection by 3 Apr 2024 23:06 UTC; 64 points) (
- Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios by 12 May 2022 20:01 UTC; 58 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- Theories of Change for AI Auditing by 13 Nov 2023 19:33 UTC; 54 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- AI safety technical research—Career review by 17 Jul 2023 15:34 UTC; 49 points) (EA Forum;
- AISC 2024 - Project Summaries by 27 Nov 2023 22:32 UTC; 48 points) (
- The Defender’s Advantage of Interpretability by 14 Sep 2022 14:05 UTC; 41 points) (
- Take 1: We’re not going to reverse-engineer the AI. by 1 Dec 2022 22:41 UTC; 38 points) (
- Takeaways from a Mechanistic Interpretability project on “Forbidden Facts” by 15 Dec 2023 11:05 UTC; 33 points) (
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 33 points) (
- Another list of theories of impact for interpretability by 13 Apr 2022 13:29 UTC; 33 points) (
- Impact stories for model internals: an exercise for interpretability researchers by 25 Sep 2023 23:15 UTC; 29 points) (
- Why I’m Working On Model Agnostic Interpretability by 11 Nov 2022 9:24 UTC; 27 points) (
- Anomalous Concept Detection for Detecting Hidden Cognition by 4 Mar 2024 16:52 UTC; 24 points) (
- Theories of impact for Science of Deep Learning by 1 Dec 2022 14:39 UTC; 24 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:14 UTC; 22 points) (EA Forum;
- Hidden Cognition Detection Methods and Benchmarks by 26 Feb 2024 5:31 UTC; 22 points) (
- How Interpretability can be Impactful by 18 Jul 2022 0:06 UTC; 18 points) (
- Disagreements about Alignment: Why, and how, we should try to solve them by 8 Aug 2022 22:32 UTC; 16 points) (EA Forum;
- Introduction to the sequence: Interpretability Research for the Most Important Century by 12 May 2022 19:59 UTC; 16 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:23 UTC; 15 points) (
- The risk-reward tradeoff of interpretability research by 5 Jul 2023 17:05 UTC; 15 points) (
- What AI companies should do: Some rough ideas by 21 Oct 2024 14:00 UTC; 14 points) (EA Forum;
- AI safety technical research—Career review by 17 Jul 2023 15:34 UTC; 14 points) (
- AISC 2024 - Project Summaries by 27 Nov 2023 22:35 UTC; 13 points) (EA Forum;
- Disagreements about Alignment: Why, and how, we should try to solve them by 9 Aug 2022 18:49 UTC; 11 points) (
- Crafting Polysemantic Transformer Benchmarks with Known Circuits by 23 Aug 2024 22:03 UTC; 10 points) (
- Desiderata for an AI by 19 Jul 2023 16:18 UTC; 9 points) (
- 8 Jul 2023 8:44 UTC; 3 points) 's comment on Using (Uninterpretable) LLMs to Generate Interpretable AI Code by (
- 3 Dec 2023 11:49 UTC; 2 points) 's comment on VictorW’s Quick takes by (EA Forum;
- 12 Jul 2023 0:26 UTC; 2 points) 's comment on fMRI LIKE APPROACH TO AI ALIGNMENT / DECEPTIVE BEHAVIOUR by (
- 24 Apr 2023 2:01 UTC; 2 points) 's comment on An open letter to SERI MATS program organisers by (
- 24 Apr 2023 2:00 UTC; 1 point) 's comment on An open letter to SERI MATS program organisers by (
- 18 Oct 2023 15:23 UTC; 1 point) 's comment on AI #32: Lie Detector by (
Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don’t think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post’s existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.
Object level I think the key point I wanted to make with this post was “there’s a bunch of ways that interp can be helpful”, which I think basically stands. I go back and forth on how much it’s valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it’s valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.
This post got some extensive criticism in Against Almost Every Theory of Impact of Interpretability, but I largely agree with Richard Ngo and Rohin Shah’s responses.
This is a great reference for the importance and excitement in Interpretability.
I just read this for the first time today. I’m currently learning about Interpretability in hopes I can participate, and this post solidified my understanding of how Interpretability might help.
The whole field of Interpretability is a test of this post. Some of the theories of change won’t pan out. Hopefully many will. Perhaps more theories not listed will be discovered.
One idea I’m surprised wasn’t mentioned is the potential for Interpretability to supercharge all of the sciences by allowing humans to extract the things that machine learning models discovered to make their predictions. I remember Chris Olah being excited about this possibility on the 80k Podcast, and that excitement meme has spread to me. Current AIs know so much about how the world works, but we can only indirectly use that knowledge indirectly through their black box interface. I want that knowledge for myself and for humanity! This is another incentive for Interpretability, and although it isn’t a development that clearly leads to “AI less likely to kill us” it will make humanity wiser, more prosperous, and on more even footing with the AIs.
Nanda’s post probably deserves a spot in a compilation of Alignment plans.
Thanks for the kind words! I’d class “interp supercharging other sciences” under:
This might just be semantics though
I’d call that “underselling it”! Your description of Microscope AI may be accurate, but even I didn’t realize you meant “supercharging science”, and I was looking for it in the list!