An interesting research direction for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.
It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it’s activation patterns across the entire corpus, and figure out whether it’s a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
Open AI were only looking at explaining single neurons, so combining their approach with the original paper’s sparse probing technique for superpositions seems like the obvious next step.
A fascinating paper.
An interesting research direction for this would be to perform a sufficient number of case studies to form a significant sized dataset, and further ensure that the approaches involved provide good coverage of the possibilities, and then attempt to few short-learn and/or fine-tune the ability for an autonomous agent/cognitive architecture powered by LLMs to reproduce the results of individual case studies, i.e. to attempt automate this form of mechanistic interpretability, given a suitable labeled input set, or a reliable means of labeling them.
It would also be interesting to be able to go the other way: take specific randomly selected neuron, look at it’s activation patterns across the entire corpus, and figure out whether it’s a monosematic neuron, and if so for what, or else look at its activation correlations with other neurons in the same layer and determine which superpostions for which k-values it forms part of and what they each represent. Using an LLM, or semantic search, to look at a large set of high-activation contexts and trying to come up with plausible descriptions for it might be quite helpful here.
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Open AI were only looking at explaining single neurons, so combining their approach with the original paper’s sparse probing technique for superpositions seems like the obvious next step.