Nice work! I’m especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in “embedding of uppercase word that is usually lowercase”.
Actually any that are significantly effected in “Ablated Text” means that it’s not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
The Beginning & End of First Sentence actually doesn’t have this effect (but I think that’s because removing the first word just makes the 2nd word the new first word?), but I haven’t rigorously studied this.
Nice work! I’m especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in “embedding of uppercase word that is usually lowercase”.
Actually any that are significantly effected in “Ablated Text” means that it’s not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).
The Beginning & End of First Sentence actually doesn’t have this effect (but I think that’s because removing the first word just makes the 2nd word the new first word?), but I haven’t rigorously studied this.