This is great! Really professionally made. I love the look and feel of the site. I’m very impressed you were able to make this in three weeks.
I think my biggest concern is (2): Neurons are the wrong unit for useful interpretability—or at least they can’t be the only thing you’re looking at for useful interpretability. My take is that we also need to know what’s going on in the residual stream; if all you can see is what is activating neurons most, but not what they’re reading from and writing to the residual stream, you won’t be able to distinguish between two neurons that may be activating on similar-looking tokens but that are playing completely different roles in the network. Moreover, there will be many neurons where interpreting them is basically hopeless, because their role is to manipulate the residual stream in a way that’s opaque if you have no way of understanding what’s in the residual stream.
Take this neuron, for example (this was the first one to pop up for me, so not too cherrypicked):
Clearly, the autogenerated explanation of “words related to expressing personal emotions or feelings” doesn’t fit at all. But also, coming up with a reasonable explanation myself is really hard. I think probably this neuron is doing something that’s inscrutable until you’ve understood what it’s reading and writing—which requires some understanding of the residual stream.
My hope is that the residual stream can mostly be understood in terms of relevant directions, which will represent features or superpositions of features. If users can submit possible mappings of directions → features, and we can look at what directions the neuron is reading from/writing to, then maybe there’s more potential hope for interpreting a neuron like the above. I’ve been working on a similar tool to yours, which would allow users to submit explanations for residual stream directions. Not online at the moment, but here’s a current screenshot of it:
DM me if you’d be interested in talking further, or working together in some capacity. We clearly have a similar approach.
The game is addictive on me, so I can’t resist an attempt at describing this one, too :) It seems related to grammar, possibly looking for tokens on/after articles and possessives
My impression from trying out the game is that most neurons are not too hard to find plausible interpretations for, but most seem to have low-level syntactical (2nd token of a work) or grammatical (conjunctions) concerns.
Assuming that is a sensible thing to ask for, I would definitely be interested in an UI that allows working with the next smallest meaningful construction that features more than a single neuron.
Some neurons seem to have 2 separate low-level patterns that cannot clearly be tied together. This suggests they may have separate “graph neighbors” that rely on them for 2 separate concerns. I would like some way to follow and separate what neurons are doing together, not just individually, if that makes any sense =)
(As an aside, I’d like to apologize that this isn’t directly responding to the residuals idea. I’m not sure I know what residuals are, though the description of what can be done with it seems promising, and I’d like to try the other tool when it comes online!)
This is great! Really professionally made. I love the look and feel of the site. I’m very impressed you were able to make this in three weeks.
I think my biggest concern is (2): Neurons are the wrong unit for useful interpretability—or at least they can’t be the only thing you’re looking at for useful interpretability. My take is that we also need to know what’s going on in the residual stream; if all you can see is what is activating neurons most, but not what they’re reading from and writing to the residual stream, you won’t be able to distinguish between two neurons that may be activating on similar-looking tokens but that are playing completely different roles in the network. Moreover, there will be many neurons where interpreting them is basically hopeless, because their role is to manipulate the residual stream in a way that’s opaque if you have no way of understanding what’s in the residual stream.
Take this neuron, for example (this was the first one to pop up for me, so not too cherrypicked):
Clearly, the autogenerated explanation of “words related to expressing personal emotions or feelings” doesn’t fit at all. But also, coming up with a reasonable explanation myself is really hard. I think probably this neuron is doing something that’s inscrutable until you’ve understood what it’s reading and writing—which requires some understanding of the residual stream.
My hope is that the residual stream can mostly be understood in terms of relevant directions, which will represent features or superpositions of features. If users can submit possible mappings of directions → features, and we can look at what directions the neuron is reading from/writing to, then maybe there’s more potential hope for interpreting a neuron like the above. I’ve been working on a similar tool to yours, which would allow users to submit explanations for residual stream directions. Not online at the moment, but here’s a current screenshot of it:
DM me if you’d be interested in talking further, or working together in some capacity. We clearly have a similar approach.
Hi Adam and thanks for your feedback / suggestion. Residual Viewer looks awesome. I have DMed you to chat more about it!
The game is addictive on me, so I can’t resist an attempt at describing this one, too :)
It seems related to grammar, possibly looking for tokens on/after articles and possessives
My impression from trying out the game is that most neurons are not too hard to find plausible interpretations for, but most seem to have low-level syntactical (2nd token of a work) or grammatical (conjunctions) concerns.
Assuming that is a sensible thing to ask for, I would definitely be interested in an UI that allows working with the next smallest meaningful construction that features more than a single neuron.
Some neurons seem to have 2 separate low-level patterns that cannot clearly be tied together. This suggests they may have separate “graph neighbors” that rely on them for 2 separate concerns. I would like some way to follow and separate what neurons are doing together, not just individually, if that makes any sense =)
(As an aside, I’d like to apologize that this isn’t directly responding to the residuals idea. I’m not sure I know what residuals are, though the description of what can be done with it seems promising, and I’d like to try the other tool when it comes online!)