Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller ‘model organisms’ by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models).
However, the arguments for mechanistic interpretability in my view are as follows:
1.) Model capacities probably follow some kind of Pareto principle -- 20% or the circuits do 80% of the work. If we can figure out these circuits in a TAI model then we stand a good chance of catching many alignment-relevant behaviours such as deception, which necessarily require large-scale coordination across the network.
2.) Understanding lots of individual circuits and networks provide a crucial source of empirical bits about network behaviour and alignment at a mechanistic level which we can’t get just by theorycrafting about alignment all day. To have a reasonable shot at actually solving alignment we need direct contact with reality and interpretability is one of the main ways to get such contact.
3.) If we can figure out general methods for gaining mechanistic understanding of NN circuits, then we can design automated tools for performing interpretability which substantially reduces the burden on humans. For instance, we might be able to make tools that can rapidly identify the computational substrate of behaviour X, or all parts of the network which might be deceptive, or things like this. This then massively narrows down the search space that humans have to look at to check for safety.
Yeah, I think these are good points. However, I think that #1 is actually misleading. If we measure “work” in loss or in bits, then yes absolutely we can probably figure out the components that reduce loss the most. But lots of very important cognition goes into getting the last 0.01 bits of loss in LLMs, which can have big impacts on the capabilities of the model and the semantics of the outputs. I’m pessimistic on human-understanding based approaches to auditing such low-loss-high-complexity capabilities.
Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller ‘model organisms’ by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models).
However, the arguments for mechanistic interpretability in my view are as follows:
1.) Model capacities probably follow some kind of Pareto principle -- 20% or the circuits do 80% of the work. If we can figure out these circuits in a TAI model then we stand a good chance of catching many alignment-relevant behaviours such as deception, which necessarily require large-scale coordination across the network.
2.) Understanding lots of individual circuits and networks provide a crucial source of empirical bits about network behaviour and alignment at a mechanistic level which we can’t get just by theorycrafting about alignment all day. To have a reasonable shot at actually solving alignment we need direct contact with reality and interpretability is one of the main ways to get such contact.
3.) If we can figure out general methods for gaining mechanistic understanding of NN circuits, then we can design automated tools for performing interpretability which substantially reduces the burden on humans. For instance, we might be able to make tools that can rapidly identify the computational substrate of behaviour X, or all parts of the network which might be deceptive, or things like this. This then massively narrows down the search space that humans have to look at to check for safety.
Yeah, I think these are good points. However, I think that #1 is actually misleading. If we measure “work” in loss or in bits, then yes absolutely we can probably figure out the components that reduce loss the most. But lots of very important cognition goes into getting the last 0.01 bits of loss in LLMs, which can have big impacts on the capabilities of the model and the semantics of the outputs. I’m pessimistic on human-understanding based approaches to auditing such low-loss-high-complexity capabilities.