Neel Nanda comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Neel Nanda 8 Nov 2022 22:09 UTC
LW: 2 AF: 1
0
AF
Thanks for clarifying your position, that all makes sense.

I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.

Huh, can you say more about this? I’m not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)
- Charlie Steiner 8 Nov 2022 23:12 UTC
  LW: 2 AF: 1
  0
  AF Parent
  I’m thinking about the paper Ewert 1987, which I know about because it spurred Dennet’s great essay Eliminate the Middletoad, but I don’t really know the gory details of, sorry.
  I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of “human science being able to find something interesting in situations kinda like this,” which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?