The worry I’d have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
The “Zoom In” work is aimed at understanding what’s going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.
As the original article put it:
And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.
One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern.
A better understanding of ‘circuits’ in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don’t understand? (If it’s too large, this is also a significant point against the idea that the ‘circuits’ methodology is really providing much understanding of the deep NN from a scientific point of view.)
I’m not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.
[not affiliated with the author but have thought a fair bit about this sort of thing]
Once you understand what causes these circuits to arise, you could hopefully regularise for interpretability and boost that rookie 95% number up to 99.9%, where you could really believe that pruning the rest isn’t a big deal.
I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I’m only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.
Given that assumption, I think it’s possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that your model never becomes deceptive in the first place and thus never does any sort of adversarial obfuscation.[1] I suspect that the right way to do this is to use your transparency tools to enforce some sort of simple condition that you are confident in rules out deception such as myopia. For more context, see my comment here and the full “Relaxed adversarial training for inner alignment” post.
It is worth noting that this does introduce the possibility of getting obfuscation by overfitting the transparency tools, though I suspect that that sort of overfitting-style obfuscation will be significantly easier to deal with than actively adversarial obfuscation by a deceptive mesa-optimizer.
I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I’m only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.
I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is “sure, some of it happens by default, but can really big stuff happen by default?”. But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.
The worry I’d have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?
The “Zoom In” work is aimed at understanding what’s going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.
As the original article put it:
Or, as I put it in Embedded Curiosities:
A better understanding of ‘circuits’ in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don’t understand? (If it’s too large, this is also a significant point against the idea that the ‘circuits’ methodology is really providing much understanding of the deep NN from a scientific point of view.)
I’m not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.
[not affiliated with the author but have thought a fair bit about this sort of thing]
Once you understand what causes these circuits to arise, you could hopefully regularise for interpretability and boost that rookie 95% number up to 99.9%, where you could really believe that pruning the rest isn’t a big deal.
I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I’m only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.
Given that assumption, I think it’s possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that your model never becomes deceptive in the first place and thus never does any sort of adversarial obfuscation.[1] I suspect that the right way to do this is to use your transparency tools to enforce some sort of simple condition that you are confident in rules out deception such as myopia. For more context, see my comment here and the full “Relaxed adversarial training for inner alignment” post.
It is worth noting that this does introduce the possibility of getting obfuscation by overfitting the transparency tools, though I suspect that that sort of overfitting-style obfuscation will be significantly easier to deal with than actively adversarial obfuscation by a deceptive mesa-optimizer.
I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is “sure, some of it happens by default, but can really big stuff happen by default?”. But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.