I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?
We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can’t quite interpret individual neurons, but we’ve found some examples of where we can interpret what an individual attention head is doing.
I would be happy to see you write a top-level post about this paper. :)
Thanks! I’m probably not going to have time to write a top-level post myself, but I liked Evan Hubinger’s post about it.
We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can’t quite interpret individual neurons, but we’ve found some examples of where we can interpret what an individual attention head is doing.
I would be happy to see you write a top-level post about this paper. :)
Thanks! I’m probably not going to have time to write a top-level post myself, but I liked Evan Hubinger’s post about it.