Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It’s curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).
Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.
One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).
Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.
We did look very briefly at this for the " an" neuron. We plotted the residual stream congruence with the neuron input weights throughout the model. The second figure shows the difference from each layer.
Unfortunately I can’t seem to comment an image. See it here.
We can’t tell that much from this but I think there are three takeaways:
The model doesn’t start ‘preparing’ to activate the " an" neuron until layer 16.
No single layer stands out a lot as being particularly responsible for the " an" neuron’s activation (which is part of why we didn’t investigate this further).
The congruence increases a lot after MLP 31. This means the output of layer 31 is very congruent with the input weights of the " an" neuron (which is in MLP 31). I this this is almost entirely the effect of the " an" neuron, partly because the input of the " an" neuron is very congruent with the " an" token (although not as much as the neuron output weights). This makes me think that this neuron is at least partly a ‘signal boosting’ neuron.
Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (
_an
,_An
,an
,An
, etc.). It’s curious because the semantics of these tokens can be quite different (compared to thethough
,tho
,however
neuron).Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.
One reason the neuron is congruent with multiple of the same tokens may be because those token embeddings are similar (you can test this by checking their cosine similarities).
Yup! I think that’d be quite interesting. Is there any work on characterizing the embedding space of GPT2?
Adam Scherlis did some preliminary exploration here:
https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights
Here’s a more thorough investigation of the overall shape of said embeddings with interactive figures:
https://bert-vs-gpt2.dbvis.de/
There’s also a lot of academic work on the geometry of LM embeddings, e.g.:
https://openreview.net/forum?id=xYGNO86OWDH (BERT, ERNIE)
https://arxiv.org/abs/2209.02535 (GPT-2-medium)
(Plus a mountain more on earlier text/token embeddings like Word2Vec.)
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation is related to the embedding space
We did look very briefly at this for the
" an"
neuron. We plotted the residual stream congruence with the neuron input weights throughout the model. The second figure shows the difference from each layer.Unfortunately I can’t seem to comment an image. See it here.
We can’t tell that much from this but I think there are three takeaways:
The model doesn’t start ‘preparing’ to activate the
" an"
neuron until layer 16.No single layer stands out a lot as being particularly responsible for the
" an"
neuron’s activation (which is part of why we didn’t investigate this further).The congruence increases a lot after MLP 31. This means the output of layer 31 is very congruent with the input weights of the
" an"
neuron (which is in MLP 31). I this this is almost entirely the effect of the" an"
neuron, partly because the input of the" an"
neuron is very congruent with the" an"
token (although not as much as the neuron output weights). This makes me think that this neuron is at least partly a ‘signal boosting’ neuron.