Really impressive work and I found the colab very educational.I may be missing something obvious, but it is probably worth including “Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space” (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts. Notably, they did not use SGD to find these directions, but rather had “NLP experts” (grad students) manual look over the top 30 words associated with each value vector.
Really impressive work and I found the colab very educational.
I may be missing something obvious, but it is probably worth including “Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space” (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts.
Notably, they did not use SGD to find these directions, but rather had “NLP experts” (grad students) manual look over the top 30 words associated with each value vector.