Vladimir_Nesov comments on Scaling Laws and Superposition

Vladimir_Nesov 10 Apr 2024 16:20 UTC
3 points
2

Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al

Nice, this is explained under Figure 5, in particular

The loss varies only a few percent over a wide range of shapes. [...] an ( $n_{l a y e r}$ , $d_{m o d e l}$ ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model

(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)