I agree that reducing superposition is probably valuable even if it requires a significantly larger network. I still don’t understand why the transition from float to binary would cause a dramatic reduction in superposition capacity. But if it does prevent superposition, great! I’ll just give it more parameters as needed. But if we still get superposition, I will need to apply other techniques to make it stop.
(I have not yet finished my closer re-read of Toy Models of Superposition after my initial skimming. Perhaps once I do I will understand better.)
Hopefully in a few months I will have empirical data regarding how much more neurons we need. Then I can stop hand waving about vague intuitions.
If we can get the unwanted cognition/behaviors to sit entirely in their own section of weights, we can then ablate the unwanted behaviors without losing wanted capability. That’s my hope anyway.
My thoughts and hope as well.
Current NN matrices are dense and continuous weighted. A significant part of the difficulty of interpretability is that they have all to all connections; it is difficult to verify that one activation does or does not affect another activation.
However we can quantize the weights to 3 bit and then we can probably melt the whole thing into pure combinational logic. While I am not entirely confident that this form is strictly better from an interpretability perspective, it is differently difficult.
“Giant inscrutable matrices” are probably not the final form of current NNs, we can potentially turn them into different and nicer form.