I think this is the sort of thing you should expect to work fine even if you can’t tell if a future AI is deceiving you, so I basically agree with the authors’ prognostication more than yours. I think for more complicated questions like deception, mechanistic understanding and human interpretability will start to come apart. Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.
Idk, I definitely agree that all data so far is equally consistent with ‘mechanistic interp will scale up to identifying whether GPT-N is deceiving us’ and with ‘MI will work on easy problems but totally fail on hard stuff’. But I do take this as evidence in favour of it working really well.
What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?
Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.
But I do take this as evidence in favour of it working really well.
I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.
Not sure what you mean by this
You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn’t scale to interpreting complicated world-modeling, and think that what we’ll want is methods that tell us abstract properties without us needing to understand the details. To some aesthetics, this is unsatisfying.
E.g. suppose we do the obvious extensions of ROME to larger sections of the neural network rather than just one MLP layer, driven by larger amounts of data. That seems like it has more power for detection or intervention, but only helps us understand the micro-level a little, and doesn’t require it.
What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?
If the extended-ROME example turns out to be possible to understand mechanistically through some lens, in an amount of work that’s actually sublinear in the range of examples you want to understand, that would be strong evidence to me. If instead it’s hard to understand even some behavior, and it doesn’t get easier and easier as you go on (e.g. each new thing might be an unrelated problem to previous ones, or even worse you might run out of low-hanging fruit) that would be the other world.
Thanks for clarifying your position, that all makes sense.
I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.
Huh, can you say more about this? I’m not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)
I’m thinking about the paper Ewert 1987, which I know about because it spurred Dennet’s great essay Eliminate the Middletoad, but I don’t really know the gory details of, sorry.
I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of “human science being able to find something interesting in situations kinda like this,” which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?
Fun, informative, definitely rambling.
I think this is the sort of thing you should expect to work fine even if you can’t tell if a future AI is deceiving you, so I basically agree with the authors’ prognostication more than yours. I think for more complicated questions like deception, mechanistic understanding and human interpretability will start to come apart. Good methods might not excite you much in terms of the mechanistic clarity they provide, and vice versa.
Idk, I definitely agree that all data so far is equally consistent with ‘mechanistic interp will scale up to identifying whether GPT-N is deceiving us’ and with ‘MI will work on easy problems but totally fail on hard stuff’. But I do take this as evidence in favour of it working really well.
What kind of evidence could you imagine seeing that mechanistic understanding is actually sufficient for understanding deception?
Not sure what you mean by this
I’d argue that most of the updating should already have been done already, not even based on Chris Olah’s work, but on neuroscientists working out things like the toad’s prey-detection circuits.
You seem pretty motivated by understanding in detail why and how NNs do particular things. But I think this doesn’t scale to interpreting complicated world-modeling, and think that what we’ll want is methods that tell us abstract properties without us needing to understand the details. To some aesthetics, this is unsatisfying.
E.g. suppose we do the obvious extensions of ROME to larger sections of the neural network rather than just one MLP layer, driven by larger amounts of data. That seems like it has more power for detection or intervention, but only helps us understand the micro-level a little, and doesn’t require it.
If the extended-ROME example turns out to be possible to understand mechanistically through some lens, in an amount of work that’s actually sublinear in the range of examples you want to understand, that would be strong evidence to me. If instead it’s hard to understand even some behavior, and it doesn’t get easier and easier as you go on (e.g. each new thing might be an unrelated problem to previous ones, or even worse you might run out of low-hanging fruit) that would be the other world.
Thanks for clarifying your position, that all makes sense.
Huh, can you say more about this? I’m not familiar with that example (though have a fairly strong prior on there being at best a weak association between specific neuroscience results + specific AI interp results)
I’m thinking about the paper Ewert 1987, which I know about because it spurred Dennet’s great essay Eliminate the Middletoad, but I don’t really know the gory details of, sorry.
I agree the analogy is weak, and there can be disanalogies even between different ANN architectures. I think my intuition is based more on some general factor of “human science being able to find something interesting in situations kinda like this,” which is less dependent on facts of the systems themselves and more about, like, do we have a paradigm for interpreting signals in a big mysterious network?