Eric J. Michaud comments on Mech Interp Puzzle 2: Word2Vec Style Embeddings

Eric J. Michaud 31 Jul 2023 19:59 UTC
4 points
0
AF
I checked whether this token character length direction is important to the “newline prediction to maintain text width in line-limited text” behavior of pythia-70m. To review, one of the things that pythia-70m seems to be able to do is to predict newlines in places where a newline correctly breaks the text so that the line length remains approximately constant. Here’s an example of some text which I’ve manually broken periodically so that the lines have roughly the same width. The color of the token corresponds to the probability pythia-70m gave to predicting a newline as that token. Darker blue corresponds to a higher probability. I used CircuitsVis for this:
We can see that at the last couple tokens in most lines, the model starts placing nontrivial probability of a newline occurring there.
I thought that this “number of characters per token” direction would be part of whatever circuit implements this behavior. However, ablating that direction in embedding space seems to have little to no effect on the behavior. Going the other direction, manually adding this direction to the embeddings seems to not significantly effect the behavior either!

Maybe there are multiple directions representing the length of a token? Here’s the colab to reproduce: https://colab.research.google.com/drive/1HNB3NHO7FAPp8sHewnum5HM-aHKfGTP2?usp=sharing
- Neel Nanda 31 Jul 2023 21:46 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Oh that’s fascinating, thanks for sharing! In the model I was studying I found that intervening on the token direction mattered a lot for ending lines after 80 characters. Maybe there are multiple directions...? Very weird!