StefanHex comments on Really Strong Features Found in Residual Stream

StefanHex 9 Jul 2023 10:27 UTC
LW: 6 AF: 5
0
AF
Nice work! I’m especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in “embedding of uppercase word that is usually lowercase”.
- Logan Riggs 9 Jul 2023 11:21 UTC
  LW: 4 AF: 1
  0
  AF Parent
  Actually any that are significantly effected in “Ablated Text” means that it’s not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
  The Beginning & End of First Sentence actually doesn’t have this effect (but I think that’s because removing the first word just makes the 2nd word the new first word?), but I haven’t rigorously studied this.