The “Zoom In” work is aimed at understanding what’s going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.
As the original article put it:
And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.
One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern.
A better understanding of ‘circuits’ in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don’t understand? (If it’s too large, this is also a significant point against the idea that the ‘circuits’ methodology is really providing much understanding of the deep NN from a scientific point of view.)
I’m not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.
The “Zoom In” work is aimed at understanding what’s going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.
As the original article put it:
Or, as I put it in Embedded Curiosities:
A better understanding of ‘circuits’ in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don’t understand? (If it’s too large, this is also a significant point against the idea that the ‘circuits’ methodology is really providing much understanding of the deep NN from a scientific point of view.)
I’m not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.