I second this request for a second opinion. It sounds interesting.
Without understanding, it deeply, there are a few meta-markers of quality:
they share their code
they manage to get results that are competitive with normal transformers (ViT’s on imagenet, Table 1, Table 3).
However
while claiming interpretability, on a quick click through, I can’t see many measurements or concrete examples of interpretability. There are the self attention maps in fig 17 but nothing else that strikes me.
Agreed. I’m working on a 3rd detailed reading, working my way through the math and references (the paper’s around 100 pages, so this isn’t a quick process) and will add more detail here on interpretability as I locate it. My recollection from previous readings is that the interpretability analysis they included was mostly of types commonly done on image-processing networks, not LLM interpretability, so I was less familiar with it.
I’m played around with ELK and so on. And my impression is that we don’t know how to read the hidden_states/residual stream (beyond like 80% accuracy, which isn’t good enougth. But learning a sparse representation (e.g. the sparse autoencoders paper from Anthropic) helps, and is seen as quite promising by people across the field (for example EleutherAI has a research effort).
So this does seem quite promising. Sadly, we would really need a foundation model of 7B+ parameters to test it well, which is quite expensive in terms of compute.
Does this paper come with extra train time compute costs?
Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.
I second this request for a second opinion. It sounds interesting.
Without understanding, it deeply, there are a few meta-markers of quality:
they share their code
they manage to get results that are competitive with normal transformers (ViT’s on imagenet, Table 1, Table 3).
However
while claiming interpretability, on a quick click through, I can’t see many measurements or concrete examples of interpretability. There are the self attention maps in fig 17 but nothing else that strikes me.
Agreed. I’m working on a 3rd detailed reading, working my way through the math and references (the paper’s around 100 pages, so this isn’t a quick process) and will add more detail here on interpretability as I locate it. My recollection from previous readings is that the interpretability analysis they included was mostly of types commonly done on image-processing networks, not LLM interpretability, so I was less familiar with it.
I’m played around with ELK and so on. And my impression is that we don’t know how to read the hidden_states/residual stream (beyond like 80% accuracy, which isn’t good enougth. But learning a sparse representation (e.g. the sparse autoencoders paper from Anthropic) helps, and is seen as quite promising by people across the field (for example EleutherAI has a research effort).
So this does seem quite promising. Sadly, we would really need a foundation model of 7B+ parameters to test it well, which is quite expensive in terms of compute.
Does this paper come with extra train time compute costs?
Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.