Charlie Steiner comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Charlie Steiner 8 Nov 2022 16:58 UTC
LW: 4 AF: 2
0
AF
Fun, informative, definitely rambling.

I think this is the sort of thing you should expect to work fine even if you can’t tell if a future AI is deceiving you, so I basically agree with the authors’ prognostication more than yours. I think for more complicated questions like deception, mechanistic understanding and human interpretability will start to come apart. Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.
- Neel Nanda 8 Nov 2022 18:57 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Idk, I definitely agree that all data so far is equally consistent with ‘mechanistic interp will scale up to identifying whether GPT-N is deceiving us’ and with ‘MI will work on easy problems but totally fail on hard stuff’. But I do take this as evidence in favour of it working really well.
  
  What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?
  
  Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.
  
  Not sure what you mean by this
  - Charlie Steiner 8 Nov 2022 20:37 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    But I do take this as evidence in favour of it working really well.
    
    I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.
    
    Not sure what you mean by this
    
    You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn’t scale to interpreting complicated world-modeling, and think that what we’ll want is methods that tell us abstract properties without us needing to understand the details. To some aesthetics, this is unsatisfying.
    
    E.g. suppose we do the obvious extensions of ROME to larger sections of the neural network rather than just one MLP layer, driven by larger amounts of data. That seems like it has more power for detection or intervention, but only helps us understand the micro-level a little, and doesn’t require it.
    
    What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?
    
    If the extended-ROME example turns out to be possible to understand mechanistically through some lens, in an amount of work that’s actually sublinear in the range of examples you want to understand, that would be strong evidence to me. If instead it’s hard to understand even some behavior, and it doesn’t get easier and easier as you go on (e.g. each new thing might be an unrelated problem to previous ones, or even worse you might run out of low-hanging fruit) that would be the other world.
    - Neel Nanda 8 Nov 2022 22:09 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Thanks for clarifying your position, that all makes sense.
      
      I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.
      
      Huh, can you say more about this? I’m not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)
      - Charlie Steiner 8 Nov 2022 23:12 UTC
        LW: 2 AF: 1
        0
        AF Parent
        I’m thinking about the paper Ewert 1987, which I know about because it spurred Dennet’s great essay Eliminate the Middletoad, but I don’t really know the gory details of, sorry.
        I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of “human science being able to find something interesting in situations kinda like this,” which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?

Charlie Steiner comments on A Walkthrough of Interpretability in the Wild (w/​ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Charlie Steiner comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)