Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can’t speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it’s important to steelman those kinds of claims / expand on them before responding.
A few extra notes:
The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.
This is interesting because this would be a way to not need to fully reverse engineer a complete model.The technique used in Understanding and controlling a maze-solving policy network seems promising to me. Just focusing on “the motivational API” could be sufficient.
I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.
We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.
The second major point I wanted to address was this argument:
Determining the dangerousness of a feature is a mis-specified problem. Searching for dangerous features in the weights/structures of the network is pointless. A feature is not inherently good or bad. The danger of individual atoms is not a strong predictor of the danger of assembly of atoms and molecules. For instance, if you visualize the feature of layer 53, channel 127, and it appears to resemble a gun, does it mean that your system is dangerous? Or is your system simply capable of identifying a dangerous gun? The fact that cognition can be externalized also contributes to this point.
I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?
I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.
I’m not sure of what you meant about studying transistors.
It seems to me to me that if we are studying transistors so hard, it’s to push computers capabilities (faster, smaller, more energy efficient etc.), and not at all to make software safer. Instead to make software safer, we use anti-viruses, automatic testing, developer liability, standards, regulations, pop-up warnings, etc.
Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can’t speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it’s important to steelman those kinds of claims / expand on them before responding.
A few extra notes:
The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.
I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.
We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.
The second major point I wanted to address was this argument:
I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?
I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.
I’m not sure of what you meant about studying transistors.
It seems to me to me that if we are studying transistors so hard, it’s to push computers capabilities (faster, smaller, more energy efficient etc.), and not at all to make software safer. Instead to make software safer, we use anti-viruses, automatic testing, developer liability, standards, regulations, pop-up warnings, etc.