Logan Riggs comments on Really Strong Features Found in Residual Stream

Logan Riggs 9 Jul 2023 11:46 UTC
LW: 5 AF: 1
0
AF
Setup:
Model: Pythia-70m (actually named 160M!)
Transformer lens: “blocks.2.hook_resid_post” (so layer 2)
Data: Neel Nanda’s Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)
Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])
Logit Lens: The decoder here had 2k feature directions. Each direction is size d_model, so you can directly unembed the feature direction (e.g. the German Feature) you’re looking at. Additionally I subtract out several high norm tokens from the unembed, which may be an artifact of the pythia tokenizer never using those tokens (thanks Wes for mentioning this!)
Ablated Text: Say the default feature (or neuron in your words) activation of Token_pos 10 is 5, so you can remove all tokens from 0 to 10 one at a time and see the effect on the feature activation. I select the token pos by finding the max feature activating position or the uniform one described above. This at least shows some attention head dependencies, but not more complicated ones like (A or B… C) where removing A or B doesn’t effect C, but removing both would.
[Note: in the examples, I switch between showing the full text for context & showing the partial text that ends on the uniformly-selected token]