RogerDearnaley comments on Striking Implications for Learning Theory, Interpretability — and Safety?

RogerDearnaley 7 Jan 2024 23:04 UTC
2 points
0
Their interpretability results seem to be in section 4.3, and there are quite a number of them from a variety of different networks. I’m not very familiar with interpretability for image nets, but the model spontaneously learning to do high-quality image segmentation was fairly striking, as was it spontaneously doing subsegmentation for portions of objects such as finding a neuron that responded to animal’s heads and another that responded to their legs. That certainly looks like what I hope for if told that an image network was highly interpretable. The largest LLMs they trained were around the size of BERT and GPT-2, so their performance was pretty simple.