This seems like a pretty promising approach to interpretability, and I think GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores. Which seems to be recursive self-improvement territory. It would be nice if by the time we got there, we already mostly knew how GPT-2, 3, 4, and 5 worked. Knowing how previous generation LLMs work is likely to be integral to aligning a next generation LLM and it’s pretty clear that we’re not going to be stopping development, so having some idea of what we’re doing is better than none. Even if an AI moratorium is put in place, it would make sense for us to use GPT-4 to automate some of the neuron research going on right now. What we can hope for is that we do the most amount of work possible with GPT-4 before we jump to GPT-5 and beyond.
GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores
This seems to assume the task (writing explanations for all neurons with >0.5 scores) is possible at all, which is doubtful. Superposition and polysemanticity are certainly things that actually happen.
This seems like a pretty promising approach to interpretability, and I think GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores. Which seems to be recursive self-improvement territory. It would be nice if by the time we got there, we already mostly knew how GPT-2, 3, 4, and 5 worked. Knowing how previous generation LLMs work is likely to be integral to aligning a next generation LLM and it’s pretty clear that we’re not going to be stopping development, so having some idea of what we’re doing is better than none. Even if an AI moratorium is put in place, it would make sense for us to use GPT-4 to automate some of the neuron research going on right now. What we can hope for is that we do the most amount of work possible with GPT-4 before we jump to GPT-5 and beyond.
This seems to assume the task (writing explanations for all neurons with >0.5 scores) is possible at all, which is doubtful. Superposition and polysemanticity are certainly things that actually happen.