This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.
This post (which is really dope) provides some grokking examples in large language models in a Big-Bench video at 19313s & 19458s, with that segment (18430s-19650s) being a nice watch! I shall spend a bit more time collecting and precisely identifying evidence and then include it in the grokking part of this post. This was a really nice thing to know about and very suprising.
I’ve commented on that, but I’m not convinced that the phase transitions in learning are grokking, per se. There are many different scaling phenomenon, and we shouldn’t go around prematurely conflating them.