Excellent comment. Independently same main takeaway here. Thanks for the pictures!
Agree with nitpick, although I get why they restrict the term “grok” to mean “test loss minimum lagging far behind training loss minimum”. That’s the mystery and distinctive pattern from the original paper, and that’s what they’re aiming to explain.
Excellent comment. Independently same main takeaway here. Thanks for the pictures!
Agree with nitpick, although I get why they restrict the term “grok” to mean “test loss minimum lagging far behind training loss minimum”. That’s the mystery and distinctive pattern from the original paper, and that’s what they’re aiming to explain.