For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
Open AI were only looking at explaining single neurons, so combining their approach with the original paper’s sparse probing technique for superpositions seems like the obvious next step.
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Open AI were only looking at explaining single neurons, so combining their approach with the original paper’s sparse probing technique for superpositions seems like the obvious next step.