Theoretically and em- pirically, KANs possess faster neural scaling laws than MLPs
What do they mean by this? Isn’t that contradicted by this recommendation to use the an ordinary architecture if you want fast training:
It seems like they mean faster per parameter, which is an… unclear claim given that each parameter or step, here, appears to represent more computation (there’s no mention of flops) than a parameter/step in a matmul|relu would? Maybe you could buff that out with specialized hardware, but they don’t discuss hardware.
One might worry that KANs are hopelessly expensive, since each MLP’s weight parameter becomes KAN’s spline function. Fortunately, KANs usually allow much smaller compu- tation graphs than MLPs. For example, we show that for PDE solving, a 2-Layer width-10 KAN is 100 times more accurate than a 4-Layer width-100 MLP (10−7 vs 10−5 MSE) and 100 times more parameter efficient (102 vs 104 parameters) [this must be a typo, this would only be 1.01 times more parameter efficient].
I’m not sure this answers the question. What are the parameters, anyway, are they just single floats? If they’re not, pretty misleading.
I’m guessing they mean that the performance curve seems to reach much lower loss before it begins to trail off, while MLPs lose momentum much sooner. So even if MLPs are faster per unit of performance at small parameter counts and data, there’s no way they will be at scale, to the extent that it’s almost not worth comparing in terms of compute? (which would be an inherently rough measure anyway because, as I touched on, the relative compute will change as soon as specialized spline hardware starts to be built. Due to specialization for matmul|relu the relative performance comparison today is probably absurdly unfair to any new architecture.)
What do they mean by this? Isn’t that contradicted by this recommendation to use the an ordinary architecture if you want fast training:
It seems like they mean faster per parameter, which is an… unclear claim given that each parameter or step, here, appears to represent more computation (there’s no mention of flops) than a parameter/step in a matmul|relu would? Maybe you could buff that out with specialized hardware, but they don’t discuss hardware.
I’m not sure this answers the question. What are the parameters, anyway, are they just single floats? If they’re not, pretty misleading.
clearly, they mean 10^2 vs 10^4. Same with the “10−7 vs 10−5 MSE”. Must be some copy-paste/formatting issue.
I’m guessing they mean that the performance curve seems to reach much lower loss before it begins to trail off, while MLPs lose momentum much sooner. So even if MLPs are faster per unit of performance at small parameter counts and data, there’s no way they will be at scale, to the extent that it’s almost not worth comparing in terms of compute? (which would be an inherently rough measure anyway because, as I touched on, the relative compute will change as soon as specialized spline hardware starts to be built. Due to specialization for matmul|relu the relative performance comparison today is probably absurdly unfair to any new architecture.)