A Longlist of Theories of Impact for Interpretability

Neel Nanda11 Mar 2022 14:55 UTC

LW: 127 AF: 47

I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I’ve compiled a list with a longlist of 19 different arguments I’ve heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I’m sharing them in the hopes that this is interesting to people

(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)

Credit to Evan Hubinger for writing the excellent Chris Olah’s Views on AGI Safety, which was the source of several of these arguments!

Force-multiplier on alignment research: We can analyse a model to see why it gives misaligned answers, and what’s going wrong. This gets much richer data on empirical alignment work, and lets it progress faster
Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws.
1. Eg, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training
Auditing: We get a Mulligan. After training a system, we can check for misalignment, and only deploy if we’re confident it’s safe
Auditing for deception: Similar to auditing, we may be able detect deception in a model
1. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/features—I see this more as a theory of change for ‘worlds where interpretability is harder than I hope’
Enabling coordination/cooperation: If different actors can interpret each other’s systems, it’s much easier to trust other actors to behave sensibly and coordinate better
Empirical evidence for/against threat models: We can look for empirical examples of theorised future threat models, eg inner misalignment
1. Coordinating work on threat models: If we can find empirical examples of eg inner misalignment, it seems much easier to convince skeptics this is an issue, and maybe get more people to work on it.
2. Coordinating a slowdown: If alignment is really hard, it seems much easier to coordinate caution/a slowdown of the field with eg empirical examples of models that seem aligned but are actually deceptive
Improving human feedback: Rather than training models to just do the right things, we can train them to do the right things for the right reasons
Informed oversight: We can improve recursive alignment schemes like IDA by having each step include checking the system is actually aligned
1. Note: This overlaps a lot with 7. To me, the distinction is that 7 can be also be applied with systems trained non-recursively, eg today’s systems trained with Reinforcement Learning from Human Feedback
Interpretability tools in the loss function: We can directly put an interpretability tool into the training loop to ensure the system is doing things in an aligned way
1. Ambitious version—the tool is so good that it can’t be Goodharted
2. Less ambitious—The could be Goodharted, but it’s expensive, and this shifts the inductive biases to favour aligned cognition
Norm setting: If interpretability is easier, there may be expectations that, before a company deploys a system, part of doing due diligence is interpreting the system and checking it does what you want
Enabling regulation: Regulators and policy-makers can create more effective regulations around how aligned AI systems must be if they/the companies can use tools to audit them
Cultural shift 1: If the field of ML shifts towards having a better understanding of models, this may lead to a better understanding of failure cases and how to avoid them
Cultural shift 2: If the field expects better understanding of how models work, it’ll become more glaringly obvious how little we understand right now
1. Quote: Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.
Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is ‘actually understanding how our black box systems work’ not helpful?
Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves
Training AIs to interpret other AIs: Even if interpretability is really hard/labour intensive on advanced systems, if we can create aligned AIs near human level, we can give these interpretability tools and use them to interpret more powerful systems
Forecasting discontinuities: By understanding what’s going on, we can predict how likely we are to see discontinuities in alignment/capabilities, and potentially detect a discontinuity while training/before deploying a system
Intervening on training: By interpreting a system during training, we can notice misalignment early on, potentially before it’s good enough for strategies to avoid our notice such as deceptive alignment, gradient hacking, obfuscating its thoughts, etc.
Auditing a training run: By checking for misalignment early in training, we can stop training systems that seem misaligned. This gives us many more shots to make an aligned system without spending large amounts of capital, and eg allows us to try multiple different schemes, initialisations, etc. This essentially shifts the distribution of systems towards alignment.
Eliciting Latent Knowledges: Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do. (credit to Tao Lin for this one)

Exercise: I think a really useful quick exercise for eliciting inside views is to go through this list and quickly rate each theory of impact out of 10. I ran a small discussion group on these where we did this, you can see the results here—note that these are very off-the-cuff hot takes, and not something you should take seriously or expect participants to stand by later on! The rated order was 1, 4, 3, 6, 8, 18, 2, 7, 16, 19, 17, 10, 9, 12, 14, 13, 15, 11, 5 (ie 1=‘a force-multiplier on alignment research’ is the best). Interestingly, 3 (auditing a fully-trained model) was the most controversial highly rated one.

I’ve made a publicly editable spreadsheet for you to add in your responses!

The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.

I’d love to hear other people’s ratings and why! And any arguments that you think I’ve missed.

What links here?

Neel Nanda11 Mar 2022 14:55 UTC

LW: 127 AF: 47

41 comments5 min readLW link 2 reviews

Interpretability (ML & AI)AI

Neel Nanda 31 Dec 2023 14:28 UTC
LW: 11 AF: 8
AF
Meta level I wrote this post in 1-3 hours, and am very satisfied with the returns per unit time! I don’t think this is the best or most robust post I could have written, and I think some of these theories of impact are much more important than others. But I think that just collecting a ton of these in the same place was a valuable thing to do, and have heard from multiple people who appreciated this post’s existence! More importantly, it was easy and fun, and I personally want to take this as inspiration to find more, easy-to-write-yet-valuable things to do.

Object level I think the key point I wanted to make with this post was “there’s a bunch of ways that interp can be helpful”, which I think basically stands. I go back and forth on how much it’s valuable to think about theories of impact day to day, vs just trying to do good science and pluck impactful low-hanging fruit, but I think that either way it’s valuable to have a bunch in mind rather than carefully back-chaining from a specific and fragile theory of change.

This post got some extensive criticism in Against Almost Every Theory of Impact of Interpretability, but I largely agree with Richard Ngo and Rohin Shah’s responses.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)
Double 1 Jan 2024 23:40 UTC
3 points
This is a great reference for the importance and excitement in Interpretability.

I just read this for the first time today. I’m currently learning about Interpretability in hopes I can participate, and this post solidified my understanding of how Interpretability might help.

The whole field of Interpretability is a test of this post. Some of the theories of change won’t pan out. Hopefully many will. Perhaps more theories not listed will be discovered.

One idea I’m surprised wasn’t mentioned is the potential for Interpretability to supercharge all of the sciences by allowing humans to extract the things that machine learning models discovered to make their predictions. I remember Chris Olah being excited about this possibility on the 80k Podcast, and that excitement meme has spread to me. Current AIs know so much about how the world works, but we can only indirectly use that knowledge indirectly through their black box interface. I want that knowledge for myself and for humanity! This is another incentive for Interpretability, and although it isn’t a development that clearly leads to “AI less likely to kill us” it will make humanity wiser, more prosperous, and on more even footing with the AIs.

Nanda’s post probably deserves a spot in a compilation of Alignment plans.
- Neel Nanda 2 Jan 2024 22:29 UTC
  3 points
  Parent
  Thanks for the kind words! I’d class “interp supercharging other sciences” under:
  
  Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves
  
  This might just be semantics though
  - Double 7 Jan 2024 6:23 UTC
    1 point
    Parent
    I’d call that “underselling it”! Your description of Microscope AI may be accurate, but even I didn’t realize you meant “supercharging science”, and I was looking for it in the list!

Raymond D 11 Mar 2022 17:58 UTC
28 points
A slightly sideways argument for interpretability: It’s a really good way to introduce the importance and tractability of alignment research
In my experience it’s very easy to explain to someone with no technical background that
- Image classifiers have got much much better (like in 10 years they went from being impossible to being something you can do on your laptop)
- We actually don’t really understand why they do what they do (like we don’t know why the classifier says this is an image of a cat, even if it’s right)
- But, thanks to dedicated research, we have begun to understand a bit of what’s going on in the black box (like we know it knows what a curve is, we can tell when it thinks it sees a curve)
Then you say ‘this is the same thing that big companies are using to maximise your engagement on social media and sell you stuff, and look at how that’s going. and by the way did you notice how AIs keep getting bigger and stronger?’
At this point my experience is it’s very easy for people to understand why alignment matters and also what kind of thing you can actually do about it.
Compare this to trying to explain why people are worried about mesa-optimisers, boxed oracles, or even the ELK problem, and it’s a lot less concrete. People seem to approach it much more like a thought experiment and less like an ongoing problem, and it’s harder to grasp why ‘developing better regularisers’ might be a meaningful goal.
But interpretability gives people a non-technical story for how alignment affects their lives, the scale of the problem, and how progress can be made. IMO no other approach to alignment is anywhere near as good for this.
Rohin Shah 13 Mar 2022 15:28 UTC
LW: 16 AF: 13
AF
The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.
Fwiw, I do have the reverse view, but my reason is more that “auditing a trained model” does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don’t deploy your AI system, and someone else destroys the world instead).
There’s a path to impact where you (a) see that your model is going to kill you and (b) convince everyone else of this, thereby buying you time (or even solving the problem altogether if we then have global coordination to not build AGI since clearly it would destroy us). I feel skeptical about global coordination (especially as it becomes cheaper and cheaper to build AGI over time) but agree that it could buy you time which then allows alignment to “catch up” and solve the problem. However, this pathway seems pretty conjunctive (it makes a difference in worlds where (a) people were uncertain about AGI risk, (b) your interpretability tools successfully revealed evidence that convinced most of them, and (c) the resulting increase in time made the difference).
In contrast, using interpretability tools is impactful if (a) not using the interpretability tools leads to deception (also required in the previous story), and (b) using the interpretability tools gets rid of that deception.
(Obviously “level of conjunctiveness” isn’t the only thing that matters—you also need probabilities for each of the conjuncts—but this feels like the highest-level bit of why I’m more excited about putting tools in the training loop.)
(It’s also not an either-or, e.g. you could use ELK inside of your training loop, and then do Circuits-style mechanistic interpretability as an audit at the end. But if I were forced to go all-in on one of the two options, it would be the training loop one.)
EDIT (March 26, 2023): Coming back to this comment a year later, I think it undersells the “auditing” theory of impact; there are also effects like “if people know you are auditing your models deeply they are less worried that you’ll deploy something risky and so are less likely to race to beat you”. I don’t have a strong opinion on how those effects play out but they do seem important.
- Charlie Steiner 15 Mar 2022 4:02 UTC
  LW: 4 AF: 3
  AF Parent
  Fwiw, I do have the reverse view, but my reason is more that “auditing a trained model” does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don’t deploy your AI system, and someone else destroys the world instead).
  The way I’d put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes if you find one. It’s not like getting a coin to land Heads by flipping it again if it lands Tails—different AGI projects are not independent random variables, if you don’t get good results the first time you won’t get good results the next time unless you understand what happened. This means that auditing trained models isn’t really appropriate for the middle of the skill curve.
  Instead, it seems like something you could use after already being confident you’re doing good stuff, as quality control. This sharply limits the amount you expect it to save you, but might increase some other benefits of having an audit, like convincing people you know what you’re doing and aren’t trying to play Defect.
- Evan R. Murphy 30 Mar 2022 2:34 UTC
  1 point
  AF Parent
  (in which case you don’t deploy your AI system, and someone else destroys the world instead).
  Can you explain your reasoning behind this a bit more?
  Are you saying someone else destroys the world because a capable lab wants to destroy the world, and so as soon as the route to misaligned AGI is possible then someone will do it? Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well? (Or something else?...)
  - Rohin Shah 30 Mar 2022 11:50 UTC
    2 points
    AF Parent
    Or are you saying that a capable lab would accidentally destroy the world because they would be trying the same approach but either not have those interpretability tools or not be careful enough to use them to check their trained model as well?
    This one.
    - Evan R. Murphy 30 Mar 2022 13:55 UTC
      LW: 3 AF: 2
      AF Parent
      Ok, I think there’s a plausible success story for interpretability though where transparency tools become broadly available. Every major AI lab is equipped to use them and has incorporated them into their development processes.
      I also think it’s plausible that either 1) one AI lab eventually gains a considerable lead/advantage over the others so that they’d have time to iterate after their model fails audit, or 2) if one lab communicated that their audits show a certain architecture/training approach keeps producing models that are clearly unsafe, then the other major labs would take that seriously.
      This is why “auditing a trained model” still seems like a useful ability to me.
      Update: Perhaps I was reading Rohin’s original comment as more critical of audits than he intended. I thought he was arguing that audits will be useless. But re-reading it, I see him saying that the conjunctiveness of the coordination story makes him “more excited” about interpretability for training, and that it’s “not an either-or”.
      - Rohin Shah 31 Mar 2022 8:19 UTC
        LW: 3 AF: 3
        AF Parent
        Yeah I think I agree with all of that. Thanks for rereading my original comment and noticing a misunderstanding :)
TurnTrout 23 Aug 2022 19:10 UTC
LW: 11 AF: 8
0
AF
Thinking over the last few months, I came to most strongly endorse (2: Better prediction of future systems), or something close to it. I think that interpretability should adjudicate between competing theories of generalization and value formation in AIs (e.g. figure out whether and in what conditions a network learns a universal mesa objective, versus contextually activated objectives). Secondarily, figure out the mechanistic picture of how reward events form different kinds of cognition in a network (e.g. if I reward the agent for writing this line of code, what does the ensuing gradient mean, statistically, across training runs?).
Also, “is this model considering deceiving me?” doesn’t seem like that great of a question. Even an aligned AI would probably at least consider the plan of deceiving you, if that AI’s originating lab is dallying on letting it loose, meanwhile unaligned AIs are becoming increasingly capable around the world. Perhaps instead check if the AI is actively planning to kill you—that seems like better evidence on its alignment properties.
What links here?
- A shot at the diamond-alignment problem by TurnTrout (6 Oct 2022 18:29 UTC; 95 points)
- (Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need by Sodium (3 Oct 2024 19:11 UTC; 35 points)
evhub 12 Mar 2022 8:47 UTC
LW: 11 AF: 6
AF

Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations / you can tell if they do.

Maybe this is not the right place to ask this, but how does this not just give you a simplicity prior?
- ryan_greenblatt 16 Apr 2022 1:14 UTC
  LW: 3 AF: 3
  AF Parent
  By explanation, I think we mean ‘reason why a thing happens’ in some intuitive (and underspecified) sense. Explanation length gets at something like “how can you cluster/compress a justification for the way the program responds to inputs” (where justification is doing a lot of work). So, while the program itself is a great way to compress how the program responds to inputs, it doesn’t justify why the program responds this way to inputs. Thus program length/simplicity prior isn’t equivalent. Here are some examples demonstrating where (I think) these priors differ:
  - The axioms of arithmetic don’t explain why the primes have a certain frequency—there is a short justification for this, but it’s longer than just the axioms and has to include the axioms.
  - The explanation of why code golfed programs work is often longer than the programs (at least in English)
  - The shortest explanation for ‘the SHA-512 hash of the first 2000 primes is x’ probably has to include a full (long) computation trace despite the fact that a program which computes/checks this can be short.
  Here’s a short and bad explanation for why this is maybe useful for ELK.
  
  The reason the good reporter works is because it accesses the model’s concept for X and directly outputs it. The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
  
  So, the explanation for why the other heads work still has to go through the concept for X, but then has some other stuff tacked on and must be longer than the good reporter.
  - evhub 17 Apr 2022 18:27 UTC
    LW: 2 AF: 2
    AF Parent
    
    The reason other possible reporter heads work is because they access the model’s concept for X and then do something with that (where the ‘doing something’ might be done in the core model or in the head).
    
    I definitely think there are bad reporter heads that don’t ever have to access X. E.g. the human imitator only accesses X if X is required to model humans, which is certainly not the case for all X.
- Beth Barnes 13 Apr 2022 13:35 UTC
  LW: 2 AF: 1
  AF Parent
  Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
  - evhub 13 Apr 2022 19:16 UTC
    LW: 2 AF: 2
    AF Parent
    
    E.g. simplicity of explanation of a particular computation is a bit more like a speed prior.
    
    I don’t think that the size of an explanation/proof of correctness for a program should be very related to how long that program runs—e.g. it’s not harder to prove something about a program with larger loop bounds, since you don’t have to unroll the loop, you just have to demonstrate a loop invariant.
    - ryan_greenblatt 15 Apr 2022 16:48 UTC
      1 point
      Parent
      
      should be very related
      
      Perhaps you meant shouldn’t?
      - evhub 15 Apr 2022 20:54 UTC
        3 points
        Parent
        
        I don’t think
- Neel Nanda 12 Mar 2022 17:53 UTC
  LW: 2 AF: 1
  AF Parent
  Honestly, I don’t understand ELK well enough (yet!) to meaningfully comment. That one came from Tao Lin, who’s a better person to ask.
Raemon 7 Apr 2022 2:16 UTC
LW: 5 AF: 2
AF
Curated.
I’ve long had a vague sense that interpretability should be helpful somehow, but recently when I tried to spell out exactly how it helped I had a surprisingly hard time. I appreciated this post’s exploration of the concept.
James Payor 12 Mar 2022 18:20 UTC
LW: 5 AF: 2
AF

If the field of ML shifts towards having a better understanding of models …

I think this would be a negative outcome, and not a positive one.

Specifically, I think it means faster capabilities progress, since ML folks might run better experiments. Or worse yet, they might better identify and remove bottlenecks on model performance.
Charlie Steiner 11 Mar 2022 18:13 UTC
LW: 4 AF: 2
AF
Fun exercise.
I made a publicly editable google sheet with my own answers already added here (though I wrote down my answers in a text document, without more than glancing at previous answers):
https://docs.google.com/spreadsheets/d/1DmI2Wfp9mDadsH3c068pLlgkHB_XzkXK49A7JkNiEQ4/edit?usp=sharing
Looks like I’m much more interested in interpretability as a cooperation / trust-building mechanism.
- Neel Nanda 12 Mar 2022 0:13 UTC
  LW: 4 AF: 3
  AF Parent
  Good idea, thanks! I made a publicly editable spreadsheet for people to add their own https://docs.google.com/spreadsheets/d/1l3ihluDoRI8pEuwxdc_6H6AVBndNKfxRNPPS-LMU1jw/edit?usp=drivesdk
Quintin Pope 11 Mar 2022 17:24 UTC
3 points
Another potential reason: improved interpretability may lead to improved capabilities. If the most aligned project puts more effort into interpretability, this could lead to the most aligned project having more capabilities slack compared to other projects. Alternatively, if interpretability-derived capabilities benefits are widely distributed, it may accelerate transformative AI timelines.

For a concrete example of how interpretability may improve capabilities, consider that the interpretability paper Locating and Editing Factual Knowledge in GPT indicates that deeper feed forward layers in transformers become progressively less important for storing factual knowledge (Figure 3f). This suggests that we may be able to gradually reduce the width of feed forward layers in later layers of the model without impacting performance too badly, making the model more compute-efficient.

If interpretability techniques advance to the point of delivering on the potential benefits you describe above, it seems very likely that those techniques will impact capabilities, so thinking about how that may impact timelines vs alignment slack seems relevant.
What links here?
- Roman Leventov's comment on A Longlist of Theories of Impact for Interpretability by Neel Nanda (16 Jun 2023 3:18 UTC; 1 point)
hmys 6 May 2024 21:03 UTC
2 points
0
I feel like the biggest issue with aligning powerful AI systems, is that nearly all the features we’d like these systems to have, like being corrigible, not being deceptive, having values aligned with ours etc, are properties we are currently unable to state formally. They are clearly real properties, like humans can agree on examples of non-corrigibility, misalignment, dishonest, when shown examples of actions AIs could take. But we can’t put them in code or a program specification, and consequently can’t reason about them very precisely, test whether systems have them or not etc

One reason I’m very bullish on mechinterp is that it seems like the only natural pathway towards making progress on this. Transformers trained with RLHF do have “tendencies” and proto-values in a sense, figuring out how those proto-desires are represented internally, really understanding it, I believe will shed a lot of light on how values form in transformers, will necessarily entail getting a solid formal framework for reasoning aobut these processes, and will put the notions of alignment on much firmer ground. Same goes for the other features. Models already show deceptive tendencies. In the process of developing deep mechinterp understanding of that, I believe we’d gain better understanding of how deception in a neural net can be modeled formally, which would allow us to reason about it infinitely better.

(I mean, someone 300IQ might come along and just galaxy brain all this from first principles, but quite galaxy brained people have tried already.. The point is that if mechinterp was developed to a sophisticated enough level, in addition to all the good things listed already, it would shed a lot of conceptual clarity on many of the key notions, which we are currently stuck reasoning about on an informal level, and I think we will get there through incremental progress, without having to hope someone just figures it out by thinking really hard and having an einstein-tier insight).
eggsyntax 20 Oct 2023 16:48 UTC
2 points
0

[I was] skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view.

I’ve been thinking about this in the context of M.I. approaches to lie detection (some of my thoughts on that captured in this comment thread), and I share your intuition. It seems to me that you could straightforwardly include a loss term representing the presence or absence of certain features in the activation pattern, creating optimization pressure both away from whatever characteristic that pattern (initially) represents, and toward representing that characteristic differently. Would you agree?

If that’s true, then to the extent that some researchers (irresponsibly?) incorporate MI-based lie detection during training, it’ll create pressure both away from lying and also toward lying in illegible ways. That seems like over time it’ll be likely to cause problems for any such lie detection approach.

In any case, the idea that you can train for the presence or absence of certain activation patterns seems very testable, and I’ll try testing it unless you’re aware of existing research that’s done so.
- eggsyntax 20 Oct 2023 19:30 UTC
  1 point
  Parent
  Oh, there’s also a good new paper on this topic from Samuel Marks & Max Tegmark, ‘The Geometry of Truth’!
- eggsyntax 20 Oct 2023 17:20 UTC
  1 point
  Parent
  Stumbled over while trying to do some searching on comparable research: kind of an interesting paper on runtime monitoring from 2018: Runtime Monitoring Neuron Activation Patterns. Abstract:
  
  For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron activation pattern monitoring—after the standard training process, one creates a monitor by feeding the training data to the network again in order to store the neuron activation patterns in abstract form. In operation, a classification decision over an input is further supplemented by examining if a pattern similar (measured by Hamming distance) to the generated pattern is contained in the monitor. If the monitor does not contain any pattern similar to the generated pattern, it raises a warning that the decision is not based on the training data. Our experiments show that, by adjusting the similarity-threshold for activation patterns, the monitors can report a significant portion of misclassfications to be not supported by training with a small false-positive rate, when evaluated on a test set.
  
  (my guess after a quick initial read is that this may only be useful to narrow AI; ‘significantly extrapolate from what it learns (or remembers) from the training data, as similar data has not appeared in the training process’ seems like a normal part of operations for more general systems like LLMs]
nielsrolf 8 Apr 2022 20:19 UTC
2 points
0
I think if we notice that a model is not completely aligned but mostly useful, there will be at least one party deploying it. We can even see this with dall-e, which mirrors human biases (nurses=female, CEOs, lawyers, evil person=male) and is slowly being rolled out nonetheless. Therefore I believe that noticing misalignment is not helpful enough to prevent it, and we should put our focus on making it easy to create aligned AI. This is an argument for 9, 18, and 19 being relatively more important.
VojtaKovarik 6 Dec 2024 10:32 UTC
LW: 1 AF: 1
0
AF
even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!)
It seems to me that your analogy is the wrong way arond. IE, the right analogy would be “if I knew that a bunch of 5-year olds were reading my mind, I have...actually, a pretty good idea how to think deceptive thoughts in a way that avoids their tools”.
(For what it’s worth, I am not very excited about interpretability as an auditing tool. This—ie, that powerful AIs might avoid this—is one half of the reason. The other half is that I am sceptical that we will take audit warnings seriously enough—ie, we might ignore any scheming that falls short of “clear-cut example of workable plan to kill many people”. EG, ignore things like this. Or we might even decide to “fix” these issues by putting the interpretability in the loss function, and just deploying once the loss goes down.)
- Neel Nanda 6 Dec 2024 10:57 UTC
  LW: 2 AF: 2
  0
  AF Parent
  How would you evade their tools?
  - VojtaKovarik 16 Dec 2024 21:58 UTC
    LW: 1 AF: 1
    0
    AF Parent
    If I could assume things like “they are much better at reading my inner monologue than my non-verbal thoughts”, then I could create code words for prohibited things.
    I could think in words they don’t know.
    I could think in complicated concepts they haven’t understood yet. Or references to events, or my memories, that they don’t know.
    I could leave a part of my plans implicit, and only figure out the details later.
    I could harm them through some action for which they won’t understand that it is harmful, so they might not be alarmed even if they catch me thinking it. (Leaving a gas stove on with no fire.)
    And then there are the more boring things, like if you know more details about how the mind-reading works, you can try to defeat it. (Make the evil plans at night when they are asleep, or when the shift is changing, etc.)
    (Also, I assume you know Circumventing interpretability: How to defeat mind-readers, but mentioning it just in case.)
Niklas Todenhöfer 1 Dec 2024 18:18 UTC
1 point
0
Less ambitious—The could be Goodharted, but it’s expensive, and this shifts the inductive biases to favour aligned cognition

Hey 👋 , maybe a “tool” or which word were missing between
“Less ambitious—The” & “[...]could be Goodharted,”?

Thank you 🙏 for your feedback what the sentence refers to with “The”.
Roman Leventov 16 Jun 2023 3:18 UTC
1 point
Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is ‘actually understanding how our black box systems work’ not helpful?
The real question is not whether (mechanistic) interpretability is helpful, but whether it could also be “harmful”, i.e., speed up capabilities without delivering commensurate or higher improvements in safety (Quintin Pope also talks about this risk in this comment), or by creating a “foom overhang” as described in “AGI-Automated Interpretability is Suicide”. Good interpretability also creates an infosec/infohazard attack vector, as I described here.
Thus, the “theory of impact” for interoperability should not just list the potential benefits of it, but also explain why these benefits are expected to outweigh potential harms, timeline shortening, and new risks.
Luke 18 Apr 2022 4:35 UTC
1 point
In my view, the key purpose of interpretability is to translate model behavior to a representation that is readily understood by humans. This representation may include first-order information (e.g., feature attribution techniques that are common now), but should also include higher-order side-effects induced by the model as it is deployed in an environment. This second-order information will be critical for thinking about un-intended emergent properties that may arise, as well as bound their likelihood under formal guarantees.
If you view alignment as a controls problem (e.g., [1]), interpretability is giving us a mechanism for assessing (and forecasting) measured output of a system. This step is necessary for taking appropriate corrective action that reduces measured error. In this sense, interpretability is in some sense the inverse of the alignment problem. This notion of interpretability captures many of the points mentioned in the list above, especially #1, #2, #3, #7, #8, and #9.
[1] https://en.wikipedia.org/wiki/Control_theory#/media/File:Feedback_loop_with_descriptions.svg
MichaelBowlby 7 Apr 2022 14:28 UTC
1 point
I think there are quite a lot of worlds where understanding the black box better is bad.
If alignment is really really hard we should expect to fail in which case the more obvious it is that we’ve failed the better because the benefits of safety aren’t fully externalised. This probably doesn’t hold in worlds where we get from not very good AI to AGI very rapidly.
Potentially counterintuitive things happen when information gets more public. In this paper https://www.fhi.ox.ac.uk/wp-content/uploads/Racing-to-the-precipice-a-model-of-artificial-intelligence-development.pdf
increasing information has weird non-linear effects on the amount spent on safety. One of the pieces of intuition behind that is that having more information about your competitors can cause you to either speed up or slow down depending on where they in fact are in relation to you.
Also seems like risk preferences are important here. If people are risk averse then having less information about the expected outcomes of their models makes them less likely to deploy them all else equal.
I think I’m most excited about 15, 16 and 6b because of a general worldview of 1) alignment is likely to be really hard and it seems like we’ll need assistance from the best aligned systems to solve the problem and 2) that ~all the risk comes from RL agents. Getting really really good microscope AI looks really good from this perspective, and potentially we need a co-ordinated movement towards microscope AI and away from RL models in which case building a really compelling case for why AGI is dangerous looks really important.
- Neel Nanda 7 Apr 2022 21:17 UTC
  2 points
  Parent
  Note that for interpretability to give you information on where you are relative to your competitors, you both need the tools to exist, and for AI companies to use the tools and publicly release the results. It’s pretty plausible to me that we get the first but not the second!
  - MichaelBowlby 7 Apr 2022 23:06 UTC
    1 point
    Parent
    Yeah that sounds very plausible. It also seems plausible that we get regulation about transparency, and in all the cases where the benefit from interpretability has something to do with people interacting you get the results being released at least semi-publicly. Industrial espionge also seems a worry. The USSR was hugely successful in infultrating the Manhatten project and contined to successfully steal US tech throughout the cold war.
    Also worth noting the more information about how good one’s own model is also increases AI risk in the papers model, although they model it as a discrete shift from no information to full information so unclear well that model applies.
Dan Braun 29 Mar 2022 7:08 UTC
1 point
Nice list!

Conditioned on the future containing AIs that are capable of suffering in a morally relevant way, interpretability work may also help identify and even reduce this suffering (and/or increase pleasure and happiness). While this may not directly reduce x-risk, it is a motivator for people taken in by arguments on s-risks from sentient AIs to work on/advocate for interpretability research.