FWIW, the fact that the scaling laws were different and extrapolate very differently and also apparently resolve the contradiction were discussed a lot at the time; I dunno if they were discussed enough, but certainly it was in the discussions here & /r/MLscaling & by Daniel & Nostalgebraist & the usual suspects.
FWIW, the fact that the scaling laws were different and extrapolate very differently and also apparently resolve the contradiction were discussed a lot at the time; I dunno if they were discussed enough, but certainly it was in the discussions here & /r/MLscaling & by Daniel & Nostalgebraist & the usual suspects.
Do you have links handy?
Various discussion in this reddit thread: https://www.reddit.com/r/mlscaling/comments/trwkck/training_computeoptimal_large_language_models/
In particular this comment: https://www.reddit.com/r/mlscaling/comments/trwkck/comment/i2pc6bk/?utm_source=reddit&utm_medium=web2x&context=3
Dang, I’ve been missing out on juicy Gwern comments! I better follow them on reddit...