I’m quite skeptical of “this will be better around 10^23 flop” and their scaling laws overall.
I think if you properly quantified the uncertainty in the scaling law fit, the slope error bars would fully surrond the transformer slope and the 30% confidence interval would include “always worse”. They seem to be extrapolating from three datapoints.
Further, a key aspect of their claimed scaling laws and intersection is that their method performs worse at smaller scale. They could end up with this result simply by tuning the hyperparameters less effectively for the small models in the sweep. It’s easy to make performance worse (especially for your own method!), so I don’t think this is very solid. (To be clear, I don’t expect delibrate fraud, but it’s easy to end up with incorrect results by accident here especially if you’re fishing for optimistic findings.)
(I don’t have a strong view on the relevance of the method overall, but the prior for this sort of paper is indicates a quite low chance of widespread adoption.)
Yeah, you’ve convinced me I was a little too weak just by saying “the scaling laws are untested”—I had the same feeling of like “maybe I’m getting Eulered here, and maybe they’re Eulering themselves” with the 10^23 thing.
Mostly I just kept seeing suggested articles in the mainstream-ish tech press about this “wow, no MatMul” thing, assumed it was an overhyped exaggeration/misleading, and was pleasantly surprised it was for real (as far as it goes). But I’d give it probably… 15%? Of having industrial use cases in the next few years. Which I guess is actually pretty high! Could be nice for really really huge context windows, where scaling on input token length sucks.
I’m quite skeptical of “this will be better around 10^23 flop” and their scaling laws overall.
I think if you properly quantified the uncertainty in the scaling law fit, the slope error bars would fully surrond the transformer slope and the 30% confidence interval would include “always worse”. They seem to be extrapolating from three datapoints.
Further, a key aspect of their claimed scaling laws and intersection is that their method performs worse at smaller scale. They could end up with this result simply by tuning the hyperparameters less effectively for the small models in the sweep. It’s easy to make performance worse (especially for your own method!), so I don’t think this is very solid. (To be clear, I don’t expect delibrate fraud, but it’s easy to end up with incorrect results by accident here especially if you’re fishing for optimistic findings.)
(I don’t have a strong view on the relevance of the method overall, but the prior for this sort of paper is indicates a quite low chance of widespread adoption.)
Yeah, you’ve convinced me I was a little too weak just by saying “the scaling laws are untested”—I had the same feeling of like “maybe I’m getting Eulered here, and maybe they’re Eulering themselves” with the 10^23 thing.
Mostly I just kept seeing suggested articles in the mainstream-ish tech press about this “wow, no MatMul” thing, assumed it was an overhyped exaggeration/misleading, and was pleasantly surprised it was for real (as far as it goes). But I’d give it probably… 15%? Of having industrial use cases in the next few years. Which I guess is actually pretty high! Could be nice for really really huge context windows, where scaling on input token length sucks.