I’m played around with ELK and so on. And my impression is that we don’t know how to read the hidden_states/residual stream (beyond like 80% accuracy, which isn’t good enougth. But learning a sparse representation (e.g. the sparse autoencoders paper from Anthropic) helps, and is seen as quite promising by people across the field (for example EleutherAI has a research effort).
So this does seem quite promising. Sadly, we would really need a foundation model of 7B+ parameters to test it well, which is quite expensive in terms of compute.
Does this paper come with extra train time compute costs?
Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.
I’m played around with ELK and so on. And my impression is that we don’t know how to read the hidden_states/residual stream (beyond like 80% accuracy, which isn’t good enougth. But learning a sparse representation (e.g. the sparse autoencoders paper from Anthropic) helps, and is seen as quite promising by people across the field (for example EleutherAI has a research effort).
So this does seem quite promising. Sadly, we would really need a foundation model of 7B+ parameters to test it well, which is quite expensive in terms of compute.
Does this paper come with extra train time compute costs?
Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.