Neel Nanda comments on A Mechanistic Interpretability Analysis of Grokking

Neel Nanda 17 Aug 2022 6:15 UTC
LW: 6 AF: 4
0
AF
Thanks! I agree that they’re pretty hard to distinguish, and evidence between them is fairly weak—it’s hard to distinguish between a winning lottery ticket at initialisation vs one stumbled upon within the first 200 steps, say.
My favourite piece of evidence is [this video from Eric Michaud](https://twitter.com/ericjmichaud_/status/1559305105521926144) - we know that the first 2 principle components of the embedding form a circle at the end of training. But if we fix the axes at the end of training, and project the embedding at the start of training, it’s pretty circle-y