Logan Riggs comments on Really Strong Features Found in Residual Stream

Logan Riggs 9 Jul 2023 11:21 UTC
LW: 4 AF: 1
0
AF
Actually any that are significantly effected in “Ablated Text” means that it’s not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
The Beginning & End of First Sentence actually doesn’t have this effect (but I think that’s because removing the first word just makes the 2nd word the new first word?), but I haven’t rigorously studied this.