wassname comments on Striking Implications for Learning Theory, Interpretability — and Safety?

wassname 7 Jan 2024 13:35 UTC
2 points
0
I’m played around with ELK and so on. And my impression is that we don’t know how to read the hidden_states/residual stream (beyond like 80% accuracy, which isn’t good enougth. But learning a sparse representation (e.g. the sparse autoencoders paper from Anthropic) helps, and is seen as quite promising by people across the field (for example EleutherAI has a research effort).

So this does seem quite promising. Sadly, we would really need a foundation model of 7B+ parameters to test it well, which is quite expensive in terms of compute.

Does this paper come with extra train time compute costs?
- RogerDearnaley 7 Jan 2024 23:04 UTC
  2 points
  0
  Parent
  Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.